I have seen the Future, and it is Not JavaScript: LLM Training Data

Wednesday, 9 July 2025

LLM Training Data

LLMs are trained on large data sets. One such data set is Common Crawl which consists of 250 billion Internet pages with 3-5 billion pages added each month. This is petabytes worth of data (1 petabyte = 10^15 bytes of digital information). The data is stored on Amazon's S3 service allowing direct download or access for Map-Reduce processing in EC2.

I have seen the Future, and it is Not JavaScript

Wednesday, 9 July 2025

LLM Training Data

No comments:

My Blog List