TF-IDF
- TF-IDF is used to find the importance of a word in multiple documents
- TF = Term Frequency
- number of times the word is in a document
- IDF = Inverse Document Frequency
- how relevant that term is across all documents
- TF-IDF is the product of TF and IDF
TF-IDF
Why can't we use just one?
- If we use just the TF, then think of the terms like "a", "the", they will get a very high importance whereas in reality there is no importance of those words
- There comes the IDF part which normalize the TF work, if one word is on all documents, that means there is no importance of those words, so the IDF part will become 0 (
), but if it is in 1 document, then that term becomes very important across all documents. In that case, the IDF will be also large ( ).
Pros
- Simple
- Efficient
- Effective in document retrieval
Limitations:
- All limitations of Count based Word Embeddings
- Bias towards rare tokens
- Sparse representation (high dimensional and Density Sparse Data)
Applications:
- Information Retrieval
- Text Mining
- Document Classification
- Search Engines
- Recommendation Systems
- TF-IDF can be used as Word Embeddings also, by replacing
in One Hot Vector by the TF-IDF score.