TF-IDF

#machine-learning #interview ##nlp

TF-IDF is used to find the importance of a word in multiple documents
TF = Term Frequency
- number of times the word is in a document
IDF = Inverse Document Frequency
- how relevant that term is across all documents
TF-IDF is the product of TF and IDF

TF-IDF

$T F - I D F (w) = \frac{c o u n t_{w o r d w} in a doc}{total # of words in a doc} l o g \frac{Total # of docs}{# of documents with word w}$

Pros

Simple
Efficient
Effective in document retrieval

Limitations:

All limitations of Count based Word Embeddings
Bias towards rare tokens
Sparse representation (high dimensional and Density Sparse Data)

TF-IDF can be used as Word Embeddings also, by replacing $1$ in One Hot Vector by the TF-IDF score.