Pre-Training LLM

The pre-training of LLM depends on the architecture of the LLM

Next Sentence Prediction: The model has to say if the given two pair of sentences are next sentence to each other or not. And by training the model with Binary Cross Entropy loss, the models learn about the semantic nature of the sentence
Masked Language Model: In the MLM, the model has to predict the token that are masked. This is trained using Multi Class Cross Entropy loss. During the training, the model learns about the semantic nature of the token to each other
Auto-Regressive Language Modeling: The model has to generate the next token given the input sequence. The model is trained using the Multi Class Cross Entropy loss based on the token generated and the ground truth token. The only difference with Instruction Fine Tuning is that the Cross Entropy loss is calculated for all the tokens, but in IFT the loss is calculated only for the assistant response.

References