ML System Design
Framework:
flowchart TD A[Clarifying Requirements] --> B[Framing the problem as an ML Task] B --> C[Data Preparation] C --> D[Model Development] D --> E[Evaluation] E --> F[Deployment and Serving] F --> G[Monitoring and Infrastructure]
Clarifying Requirements
- Ask question to understand the exact requirements
- Business Objective: What is the end goal of this model? Like for a recommender system, do we want the user to watch the recommended video or click on it or like it?
- Data: Do we already have labeled data? Can we use data from the current platform? What is the size of the data?
- Constraints: Is there any constraint on inference time? Is there any constraint on training time? Constraint on compute?
- Scale of the System: How much users it has to handle? How much requests can come at a time?
- Performance: How fast the prediction should be? Is it real time? What is the trade off between performance vs latency?
Framing the problem as ML Task
- Define the ML Objective
- Define the input and output
- Choosing the right ML Category
Define ML Objective:
I need to convert the business objective from the step 1 to a ML objective. Let's say the business objective is to increase sales by 20% for a product recommendation model. But we can't ask a model to increase sales by 20%. We have to think how can actually product recommendation model helps to increase sales? By showing more relevant products that the user want to see or want to buy? So a better ML objective would be to increase click-through rate or correctly predict the relevant product based on the user. The former is an online metric where the latter one is an offline metric, that we can use to train
Define the Input & Output:
Here, we need to define what would be input and output of the model. The input can be different type for the same model, like for product recommendation model we can only use user's embedding as input (collaborative based filtering) or use both user and product embedding as input (content based filtering). In both of the ways, the output can be binary (relevant or not) - in reality, the model will output a probability of a product being relevant.
Choosing the right ML category
Choose between the following ones
- Regression - what is the range
- Classification - binary / multi-class / multi-label
- Supervised / Unsupervised / Semi-supervised
Data Preparation
Data Engineering:
There are 3 key
Framing the problem as ML Task
- Define the ML Objective
- Define the input and output
Questions to Ask
- What are the constraint of the system?
- End devices?
- What is the use of the model?
- Data characteristics
- Size
- Output - Categorical or Continuous
- Labeled - Supervised Learning / Unsupervised Learning / Semi-supervised Learning
- Missing data - Handling Missing Data
- Imbalanced? - Handling Imbalanced Dataset
- Outliers - Handling Outliers
- Which Machine learning to use?
- Machine Learning Algorithm Selection
- Need to be interpretable?
- Online Learning?
- Recommendation system?
- Model Evaluation
- #evaluation
- Positive is more important or negative