Handling Missing Data

How to handle Missing Data?

Data Imputation
Mean
Median
Mode
Replace with most co-related data Finding Co-relation between two data or distribution
Assign new category i.e. unknown (For categorical values)
Interpolation (Time series data)
Use K-nearest Neighbor (KNN) to interpolate data
Deletion
Remove the column (feature)
Remove the row (data point)
Use algorithms that work with missing data
K-nearest Neighbor (KNN)
Naive Bayes
XGBoost
Random Forest
Train a model to replace missing data

What are the pros and cons of handling missing data with imputation?

Pros:
1. Easy to implement
Cons
1. Not works for categorical data (at least not easily)
2. Might not consider outliers

What are the pros and cons of handling missing data with tree based methods?

Pros
1. Can capture more underlying patterns (and outliers)
2. Suitable for both categorical and numerical values
Cons
1. Adds a level of complexity
2. Model needs to be retrained from scratch for new data or if the distribution changes

Related Notes