Handling Missing Data
How to handle Missing Data?
- Data Imputation
- Mean
- Median
- Mode
- Replace with most co-related data Finding Co-relation between two data or distribution
- Assign new category i.e. unknown (For categorical values)
- Interpolation (Time series data)
- Use K-nearest Neighbor (KNN) to interpolate data
- Deletion
- Remove the column (feature)
- Remove the row (data point)
- Use algorithms that work with missing data
- K-nearest Neighbor (KNN)
- Naive Bayes
- XGBoost
- Random Forest
- Train a model to replace missing data
- Imputation works if the number of missing points is less
What are the pros and cons of handling missing data with imputation?
- Pros:
- Easy to implement
- Cons
- Not works for categorical data (at least not easily)
- Might not consider outliers
What are the pros and cons of handling missing data with tree based methods?
- Pros
- Can capture more underlying patterns (and outliers)
- Suitable for both categorical and numerical values
- Cons
- Adds a level of complexity
- Model needs to be retrained from scratch for new data or if the distribution changes
TODO:
- Learn how the algorithms handle missing data