Sunday, 23 November 2025

Mastering BERT and DistilBERT

This is BERT.  

Introduced in 2018, it stands for "Bidirectional Encoder Representations from Transformers".  A possible simplification of BERT could be "BET" or "Bidirectional Encoding from Transformers".

The "bidirectional" component implies it use context to the left and right of critical words.

It's GLUE score is 80%.

The BERT paper was published by researchers from Google's AI Language group.

Don't know what a GLUE score is? It's a score that evaluates how well a language model understands natural language - and comes from the GLUE benchmark (General Language Understanding Evaluation). It is less used in the "modern" context, as most modern LLMs max out GLUE - they are too good.

This is DistilBERT, introduced in the context of edge computing.

In the paper, the authors also point out "We have made the trained weights available along with the training code in the Transformers library from HuggingFace".
 
It is worthwhile to study BERT as DistilBERT has the same general architecture as BERT.

No comments: