Thursday, 23 April 2026

Apache Spark

Apache Spark is a foundational layer underlying many data platforms. 

One of its "selling points" is "Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling" (see detailed post on downsampling).

The Apache Incubator

The Apache Incubator services projects seeking to enter the almighty Apache Software Foundation. Projects (called "podlings") are "ingested" and become subject to Apache-style governance and operation.

The name Apache was taken from the Apache Indian people, a Native American tribe known for their warrior spirit and inexhaustible endurance, and was first used in the context of the cross-platform Apache Web Server (launched in 1995; despite being cross-platform most instances run on Linux distributions).

Wednesday, 22 April 2026

Qwen Series of Models

The Qwen series of models comes from Alibaba Cloud.  The Qwen 3.5 models, released in early 2026, has set new records for sub 2B models. It is much smaller than gpt-oss.

Compile to WASM - The Emscripten Toolchain

Emscripten is an open-source compiler toolchain to Wasm. C/C++ (or any other LLVM-supported language) can be compiled and run on the Web, Node.js or other Wasm runtimes.

WebAssembly Not Automatically Blocked by Browsers

WebAssembly is a type of code designed to run in modern web browsers.  It is designed to run alongside JavaScript using WebAssembly JavaScript APIs - creating an option for performance critical functionality.

As WebAssembly increases the browser's attack surface, so browsers contain WASM inside the browser's sandbox and restricts system access. 

A risk maybe breaking out of the sandbox. Adobe Flash was a product sandboxed after a bunch of exploits, and after sandboxing exploits still occurred.

Transmission of WASM does not require TLS, HSTS or any other transport layer security mechanism making it susceptible to man-in-the-middle attacks.

Integrity checking is also impossible as WASM modules need not be signed by the author.

Some security-focused browser configurations can block WASM.

An Insider Look at CPython: The "Compiler-Interpreter"

A run-of-the-mill Python programmer may not necessarily think about CPython on a day-to-day basis. 

But CPython is an interesting thing to think about.

It is the reference implementation for Python, written in C and Python. C was used in theory to make portability easier - it's also more efficient (so there's no C++ or STL in there).

CPython is both a compiler and an interpreter. Python code is compiled (into bytecode) before being interpreted.  So you can think of it as a "compiler-interpreter".

One (potentially) painful feature of CPython is the Global Interpreter Lock (GIL) - and the GIL is used on each interpreter process - which means effectively only one thread can run at any one time (more explicitly, only one thread can process Python bytecode at any one time). While this simplifies the implementation, it becomes a bottleneck for CPU-intensive tasks.

Concurrency can be achieved by having multiple Python processes (which have by extension, multiple interpreter processes) and enable inter-process communication.  The Python multiprocessing module aims to make this paradigm simpler to implement.  This is however not available on mobile platforms or WebAssembly platforms.

Thursday, 16 April 2026

UTM is Urchin Tracking Module

UTM is something you may come across first in URLs. 

UTM refers to Urchin Tracking Module, named after Urchin, the firm Google acquired in 2005 to form the basis for Google Analytics.

  • utm_source denotes a tracking parameter in a URL - to denote where traffic is coming from
  • utm_source=google indicates traffic came from google
  • utm_source=email traffic came from an email
Google Analytics alternatives such as Plausible have been built which follow the UTM convention.