Handling Missing Data

How to handle Missing Data?

  1. Data Imputation
  2. Mean
  3. Median
  4. Mode
  5. Replace with most co-related data Finding Co-relation between two data or distribution
  6. Assign new category i.e. unknown (For categorical values)
  7. Interpolation (Time series data)
  8. Use K-nearest Neighbor (KNN) to interpolate data
  9. Deletion
  10. Remove the column (feature)
  11. Remove the row (data point)
  12. Use algorithms that work with missing data
  13. K-nearest Neighbor (KNN)
  14. Naive Bayes
  15. XGBoost
  16. Random Forest
  17. Train a model to replace missing data

What are the pros and cons of handling missing data with imputation?

  • Pros:
    1. Easy to implement
  • Cons
    1. Not works for categorical data (at least not easily)
    2. Might not consider outliers

What are the pros and cons of handling missing data with tree based methods?

  • Pros
    1. Can capture more underlying patterns (and outliers)
    2. Suitable for both categorical and numerical values
  • Cons
    1. Adds a level of complexity
    2. Model needs to be retrained from scratch for new data or if the distribution changes

TODO:

  1. Learn how the algorithms handle missing data

Related Notes