Self-supervised learning

Supervised
Labeled

Self-supervised
No label

Bidirectional Encoder Representations from Transformers (BERT)

Transformer encoder. Like a cloze test.

Masking input
Randomly mask some tokens(mask or replace with a random token)

Next sentence prediction

Fine-tune
Train pre-trained BERT with new data
General Language Understanding Evaluation (GLUE score)
Fine-tune follow GLUE score to optimize BERT
1. Sentiment analysis
1. Natural Language Inference (NLI)
1. Part-of-speech tagging (POS)
1. Extraction-based Question Answering (QA)
  Document: 𝐷={𝑑_1,𝑑_2,⋯,𝑑_𝑁 }
  Query: 𝑄={𝑞_1,𝑞_2,⋯,𝑞_𝑀 }
  Answer: 𝐴={𝑑_𝑠, ⋯,𝑑_𝑒 }

Downstream task
Label some data of the task we care.

Multi-lingual BERT

Generative Pre-trained Transformer (GPT)

Transformer decoder. Predict next token.

In-text learning
No gradient descent. (few shot, one shot, zero shot)