Wednesday, 9 July 2025

LLM Training Data

LLMs are trained on large data sets.  One such data set is Common Crawl which consists of 250 billion Internet pages with 3-5 billion pages added each month.  This is petabytes worth of data (1 petabyte = 10^15 bytes of digital information). The data is stored on Amazon's S3 service allowing direct download or access for Map-Reduce processing in EC2.

No comments: