- Parallelization: Transformer has high parallelization by attending to all the token in a sequence simultaneously, on the other hand, LSTM process data sequentially as the current cell of LSTM depends on the hidden state of the previous cell.
- Long Range Dependency: Transformer's Self-Attention helps to attend to the tokens that have generated long before. Its a direct way to attend. On the other hand, LSTM has a gating mechanism but still suffer from capturing very long-range dependency due to Vanishing Gradient or Exploding Gradient.
- Computation & Memory: Transformers are very computation heavy. While LSTM are less computation heavy, due to its sequential nature, it's training and inference is very slow.
- Sequential vs Non-Sequential: Transformer inherently lacks sequential information, LSTM processes text sequentially so it has the order information already embedded.
References