Thursday, 23 April 2026

Downsampling from a Data Science Perspective

Downsampling in data science and data processing is as follows (this excludes the DSP, or digital signal processing, technical definition of downsampling - which is similar in spirit but differently defined).

Downsampling involves reducing the number of data points in a data set to enable comparability (sometimes referred to as "balancing the data").  This helps machine learning models avoid bias towards a dominant class.

Various approaches to downsampling (e.g. random downsampling) are described in this IBM article.

Scala, Scala, Everywhere

For legacy observations on Scala, check out JVM stuff.  Here we build a fresh relationship with Scala.

Scala is a strongly statically typed language supporting OOP and functional programming.  Strong static typing means it avoids implicit type conversions when calling functions and other scenarios.

A good starting point for learning Scala is scala.dev here.

Apache Spark (and its roots in Scala)

Apache Spark is a foundational layer underlying many data platforms. 

It is written both in Java and Scala. Read the source code here.

A good starting point is SparkSession.scala.

One of Spark's "selling points" is "Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling" (see detailed post on downsampling). 

A petabyte (PB) holds 1000 terabytes (one thousand million million bytes).

The Apache Incubator

The Apache Incubator services projects seeking to enter the almighty Apache Software Foundation. Projects (called "podlings") are "ingested" and become subject to Apache-style governance and operation.

The name Apache was taken from the Apache Indian people, a Native American tribe known for their warrior spirit and inexhaustible endurance, and was first used in the context of the cross-platform Apache Web Server (launched in 1995; despite being cross-platform most instances run on Linux distributions).

Wednesday, 22 April 2026

Qwen Series of Models

The Qwen series of models comes from Alibaba Cloud.  The Qwen 3.5 models, released in early 2026, has set new records for sub 2B models. It is much smaller than gpt-oss.

Compile to WASM - The Emscripten Toolchain

Emscripten is an open-source compiler toolchain to Wasm. C/C++ (or any other LLVM-supported language) can be compiled and run on the Web, Node.js or other Wasm runtimes.

WebAssembly Not Automatically Blocked by Browsers

WebAssembly is a type of code designed to run in modern web browsers.  It is designed to run alongside JavaScript using WebAssembly JavaScript APIs - creating an option for performance critical functionality.

As WebAssembly increases the browser's attack surface, so browsers contain WASM inside the browser's sandbox and restricts system access. 

A risk maybe breaking out of the sandbox. Adobe Flash was a product sandboxed after a bunch of exploits, and after sandboxing exploits still occurred.

Transmission of WASM does not require TLS, HSTS or any other transport layer security mechanism making it susceptible to man-in-the-middle attacks.

Integrity checking is also impossible as WASM modules need not be signed by the author.

Some security-focused browser configurations can block WASM.