TF-IDF

TF-IDF

TF(w,d)=countwordw in doc d total # of words in doc dIDF(w,D)=logTotal # of docs in corpus D# of documents with word wTFIDF(w)=TFIDF

Why can't we use just one?

Pros

  1. Simple
  2. Efficient
  3. Effective in document retrieval

Limitations:

  1. All limitations of Count based Word Embeddings
  2. Bias towards rare tokens
  3. Sparse representation (high dimensional and Density Sparse Data)

Applications:

References: