Thursday, 23 April 2026

Apache Spark (and its roots in Scala)

Apache Spark is a foundational layer underlying many data platforms. 

It is written both in Java and Scala. Read the source code here.

A good starting point is SparkSession.scala.

One of Spark's "selling points" is "Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling" (see detailed post on downsampling). 

A petabyte (PB) holds 1000 terabytes (one thousand million million bytes).

No comments: