ML System Design

#machine-learning #deep-learning #nlp #interview #need-review

Framework:

flowchart TD
    A[Clarifying Requirements] --> B[Framing the problem as an ML Task]
    B --> C[Data Preparation]
    C --> D[Model Development]
    D --> E[Evaluation]
    E --> F[Deployment and Serving]
    F --> G[Monitoring and Infrastructure]

Features: features - Google Sheets

Clarifying Requirements

Ask question to understand the exact requirements
Business Objective: What is the end goal of this model? Like for a recommender system, do we want the user to watch the recommended video or click on it or like it?
Data: Do we already have labeled data? Can we use data from the current platform? What is the size of the data?
Constraints: Is there any constraint on inference time? Is there any constraint on training time? Constraint on compute?
Scale of the System: How much users it has to handle? How much requests can come at a time?
Performance: How fast the prediction should be? Is it real time? What is the trade off between performance vs latency?

Framing the problem as ML Task

Define the ML Objective
Define the input and output
Choosing the right ML Category

Define ML Objective:

I need to convert the business objective from the step 1 to a ML objective. Let's say the business objective is to increase sales by 20% for a product recommendation model. But we can't ask a model to increase sales by 20%. We have to think how can actually product recommendation model helps to increase sales? By showing more relevant products that the user want to see or want to buy? So a better ML objective would be to increase click-through rate or correctly predict the relevant product based on the user. The former is an online metric where the latter one is an offline metric, that we can use to train

Define the Input & Output:

Here, we need to define what would be input and output of the model. The input can be different type for the same model, like for product recommendation model we can only use user's embedding as input (collaborative based filtering) or use both user and product embedding as input (content based filtering). In both of the ways, the output can be binary (relevant or not) - in reality, the model will output a probability of a product being relevant.

Choosing the right ML category

Choose between the following ones

Regression - what is the range
Classification - binary / multi-class / multi-label
Supervised / Unsupervised / Semi-supervised

Data Preparation

Data Engineering:

There are 3 key

Framing the problem as ML Task

Define the ML Objective
Define the input and output

Questions to Ask

What are the constraint of the system?
1. End devices?
2. What is the use of the model?
Data characteristics
1. Size
2. Output - Categorical or Continuous
3. Labeled - Supervised Learning / Unsupervised Learning / Semi-supervised Learning
4. Missing data - Handling Missing Data
5. Imbalanced? - Handling Imbalanced Dataset
6. Outliers - Handling Outliers
Which Machine learning to use?
1. Machine Learning Algorithm Selection
2. Need to be interpretable?
3. Online Learning?
4. Recommendation system?
Model Evaluation
1. #evaluation
2. Positive is more important or negative

Framework:

Clarifying Requirements

Framing the problem as ML Task

Define ML Objective:

Define the Input & Output:

Choosing the right ML category

Data Preparation

Data Engineering:

Framing the problem as ML Task

Questions to Ask

Related Notes