Monday, 16 March 2026

OpenAI's Open Source Tokeniser

OpenAI has created a Python package called tiktoken which is a BPE tokeniser.

Tokenising is something that's needed by chat interfaces (and other applications: compiler, interpreter etc.) to break text into tokens.

BPE stands for byte-pair encoding. It was described in 1994 by Philip Gage and a modified version is used in LLMs. The original algorithm is a clever compression technique replacing the most frequently occurring pair of bytes with a new byte not in the original data set, and uses as lookup table to recreate the original text. A modification extends this technique into tokenisation.

No comments: