Handling Missing Data

How to handle Missing Data?

  1. Data Imputation
  2. Deletion
  3. Remove the column (feature)
  4. Remove the row (data point)
  5. Use algorithms that work with missing data
  6. K-nearest Neighbor (KNN)
  7. Naive Bayes
  8. XGBoost
  9. Random Forest
  10. Train a model to replace missing data

What are the pros and cons of handling missing data with imputation?

  • Pros:
    1. Easy to implement
  • Cons
    1. Not works for categorical data (at least not easily)
    2. Might not consider outliers

What are the pros and cons of handling missing data with tree based methods?

  • Pros
    1. Can capture more underlying patterns (and outliers)
    2. Suitable for both categorical and numerical values
  • Cons
    1. Adds a level of complexity
    2. Model needs to be retrained from scratch for new data or if the distribution changes

TODO:

  1. Learn how the algorithms handle missing data

Related Notes