Tokenising is something that's needed by chat interfaces (and other applications: compiler, interpreter etc.) to break text into tokens.
BPE stands for byte-pair encoding. It was described in 1994 by Philip Gage and a modified version is used in LLMs. The original algorithm is a clever compression technique replacing the most frequently occurring pair of bytes with a new byte not in the original data set, and uses as lookup table to recreate the original text. A modification extends this technique into tokenisation.
No comments:
Post a Comment