LLMs are trained on large data sets. One such data set is Common Crawl which consists of 250 billion Internet pages with 3-5 billion pages added each month. This is petabytes worth of data (1 petabyte = 10^15 bytes of digital information). The data is stored on Amazon's S3 service allowing direct download or access for Map-Reduce processing in EC2.
No comments:
Post a Comment