Word Tokenizer

#deep-learning #nlp #interview

In word tokenizer, every sentence or data is split into tokens by space. It is also known as space tokenizer.

Example:

Sentence: the low you go, the lower you find yourself.
Tokens: [the, low, you, go, the, lower, you, find, yourself]

Cons:

In the previous example, "low" and "lower" is thought to be totally different words, but there are relation between them
In test time, if "high" and "mid" these words come, both of them are thought to be same "OOV" word but they have different meaning.

References

Related Notes