Monday, 1 December 2025

IaC Zoology

Infrastructure-as-Code has had its fair share of tech standards. Let's look at them.
  • Terraform - multi-cloud, open-source Infrastructure-as-Code tool that works across AWS, Azure, GCP and more
  • ARM templates - Azure's native JSON-based IaC format, verbose but powerful. Many companies still have ARM templates in their arsenal, even though it's time has come
  • Bicep - domain-specific language (DSL) for Azure (where said domain is IaC, or more broadly "declarative deployment of Azure resources") that simplifies ARM templates with cleaner syntax. A good one if you are not hybrid-clouding


Saturday, 29 November 2025

Boosting versus Bagging

Boosting and bagging are two classes of machine learning techniques.
Boosting is basically stacking mini-models that incrementally improve on previous models. Bagging is using ensemble/averaging techniques i.e. running models in parallel and computing some form of average.

Friday, 28 November 2025

ufunc in numpy - understanding universal functions

It operates on ndarrays. Here's the lowdown.

ufuncs operate on ndarrays in an element-by-element function. Several features are supported such as array broadcasting and type casting. 

Broadcasting is when a smaller array is spread over a bigger one. For example. a single number can be "broadcast" to every element of an array.  Analogously, a 1D array can be "broadcast" across the rows and columns of a 2D array.

The idea of "broadcasting" predates NumPy and was first implemented in an array-focused scientific programming language called Yorick from the 1990s.

statsmodels in Python

statsmodels is a Python package that complements scipy for statistical computation. 

The stable version is found here.

statsmodels takes ideas from other libraries and ecosystems, specifically it uses R-style formulas, and pandas DataFrames.  

Chances are you are using the library with other libraries too, like numpy.

It can be installed via Conda or pip. Examples:

conda install -c conda-forge statsmodel
python -m pip install statsmodels

Among the tricks statsmodels can perform are: time series analysis, various flavours of regression (OLS, generalized and weighted least squares), as well as PCA.

Thursday, 27 November 2025

Validating DataFrames in pandas

You may ingest a time series into a DataFrame in pandas. 

You may then need to access part of that DataFrame using something like dataframe.iloc[0] where the aforementioned command gives the row at position 0. 

However, what if that row is empty? There is a predicate dataframe.empty you can use as follows: 

if df.empty:
    print "DataFrame is empty"

DataFrames are like spreadsheets or SQL tables. They are the most commonly used data structures in pandas, and like a spreadsheet, columns don't need to be of the same type.

In Advance of Node.js Learning

Learn the basic rudiments of JavaScript. Include asynchronous JavaScript.

Connoisseur's Guide to JavaScript Engines: V8 Rules

Node.js uses the V8 JavaScript engine which powers Google Chrome (and is open sourced by Google).

Other browsers use their own engine, for example Firefox uses SpiderMonkey and Safari uses JavaScriptCore (aka Nitro). Edge was based on Chakra (a Microsoft project that was open-sourced) before being rebuilt with V8 and Chromium.

What JavaScript is Not Allowed to Do in the Browser

JavaScript in the browser is not normally allowed to do too much for security reasons.  

Stuff it cannot do includes anything OS related - specifically:

  • Cannot read/write arbitrary files
  • Cannot access hardware directly
  • Cannot control processes
JavaScript can do certain things through Web APIs:
  • DOM manipulation (HTML/CSS)
  • Local/session storage
  • IndexedDB (database built into the browser)
  • Cookies (with same origin)
  • Clipboard (with user permission)
  • Geolocation, camera, microphone (with user consent)

Node.js relationship with Electron

Node.js is needed to "scaffold" an Electron project - the node package manage (npm) is used to download Electron packages, install dependencies and generate starter files. 

Electron itself comes bundled with its own Node.js runtime. 

This serves as the "back end" of your Electron app. 

It manages windows, menus, system events and native OS integration. You can use Node modules directly in this environment e.g. fs, path and http.

The "renderer" process is Chromium. 

By default, this can also access Node.js APIs unless explicitly disabled for security reasons. So you effectively have a "contained" web app with OS access.

The main process (Node.js runtime) communicates to the renderer via IPC.

This combined architecture of Node.js and Chromium enables applications to be written that run on Windows, macOS and Linux without experience of native UI development.

The Same Origin Policy (SOP) on Modern Web Browsers

The Same Origin Policy (SOP) is a browser-enforced security rule that prevents scripts from one "origin" (PDP -> protocol + domain + port) from accessing resources from another origin.

The SOP prevents cookies, DOM and local storage from being read by malicious cross-site scripts.

The SOP does not just apply to web browsers. For example, Electron apps (desktop apps built with web tech) enforce SOP because they embed Chromium.

The Same Origin Policy is an "isolation model" designed to ensure "secure workflow".

Basics of Selenium

Selenium started out in 2004 at ThoughtWorks as a way to automate UI testing for a timesheet application in Python and Plone. Discussions were held on open sourcing this (internal) tool and Selenium was born.

Wednesday, 26 November 2025

Microsoft Edge has an optional WebDriver for Automation

Find out more here

git remote add

The git remote add command is used to link you local Git repository to a remote repository (e.g. on GitHub, GitLab or BitBucket).

Syntax:
git remote add <name> <url>

name is the name (alias) you are giving to that remote. By convention, the main remote is called origin. Hence you will see commonly:

git remote add origin https://github.com/username/my-project.git

Note that my-project should be pre-created from within GitHub using New Repository.

Tuesday, 25 November 2025

pandas and DataFrames

pandas provides data structures and data analysis tools for Python.

The basic data structures in pandas are:

  • Series: one-dimensional (labelled) array holding data of any type e.g. integers, strings
  • DataFrame: a two-dimensional data structure, holding a 2d-array or table with rows and columns

Monday, 24 November 2025

Microsoft Launch Fara-7B: A CUA (Computer Use Agent) in SLM Form

And here we have it. Ready for action on Hugging Face, Sir.

Sandboxing and monitoring are recommended. The agent itself is a wrapper around Playwright.

Sunday, 23 November 2025

Hacking Transformers with Hugging Face

Knowledge here. But in short:

pip install huggingface_hub
pip install --upgrade huggingface_hub

To test the install, you can try the below:

 python -c "import huggingface_hub; print(huggingface_hub.__version__)"
 python3 -c "import huggingface_hub; print(huggingface_hub.__version__)"

pip install transformers tensorflow (if you are using tensorflow, else type torch)
pip install transformers tensorflow datasets 

To test bring up Python CLI and do:

from transformers import AutoTokenizer.

The Runtime Formerly Known as TensorFlow Lite

LiteRT is the  Google on-device runtime for machine learning, formerly known as TensorFlow Lite. 

You can convert TensorFlow, PyTorch and JAX models to the TFLite format. 

This can be done using AI Edge conversion tools.

LiteRT rises to various ODML (On-Device Machine Learning) challenges:

1. Connectivity - ability to execute without an Internet connection
2. Size - reduced model and binary size
3. Privacy/data restrictions - no personal data leaves the device
4. Power consumption - efficient inference and a lack of network connections

Operationally, LiteRT models use an efficient portable format known as FlatBuffers, and the .tflite file extension. (See here for the difference between FlatBuffers and protobuf).


Mastering BERT and DistilBERT

This is BERT.  

Introduced in 2018, it stands for "Bidirectional Encoder Representations from Transformers". 

The "bidirectional" component implies it use context to the left and right of critical words.

It's GLUE score is 80%.

This is DistilBERT, introduced in the context of edge computing.

In the paper, the authors also point out "We have made the trained weights available along with the training code in the Transformers library from HuggingFace".
 
It is worthwhile to study BERT as DistilBERT has the same general architecture as BERT.

IBM's Guide to Small Language Models

IBM have made a guide to SLMs.

Examples SLMs listed as:

  • DistilBERT (DistilBERT is Google's groundbreaking BERT model in "distilled" form (hence the name "Distilled BERT"), retaining 97% of BERT's NLU abilities)
  • Gemma
  • GPT-4o mini
  • Granite
  • Llama
  • Ministral
  • Phi

sbs_diasymreader.dll

This DLL is part of the .NET Framework.
  • sbs - side by side (Recall - allows multiple versions of a DLL to sit side-by-side without conflicts)
  • dia - Debug Interface Access, reference to SDK used to read debugging symbols (PDB files)
In Windows 11, it sits in C:\WINDOWS\Microsoft.NET\Framework along with other sbs_xxx DLLs.

Friday, 21 November 2025

Programming Realities - The Awkward Error Message

The awkward error message can stop you in your tracks. Keep pushing on. Discover. Remediate error. Every error is a massively valuable learning opportunity.

Thursday, 20 November 2025

Activation Functions in Neural Networks

The Concept of C# Scripting (.csx files)

The concept of C# scripting bears semblance to Jython in Java (more so than JavaScript to Java).

Scripting commands will not be allowed in regular programs.

Here's an example:

#r "nuget: Microsoft.SuperAdvancedKernel, 1.23.0"

It can be used in environments like .NET Interactive and Jupyter Notebooks with .NET. It came into being in Visual Studio 2015.

The file needs to be .csx file.

#r is the reference directive, to reference an assembly or package.

Compiling C# in VS Code

For this you need the C# Dev Kit extension. It's Roslyn-powered. 

The code name Roslyn was first written publicly by engineer Eric Lippert (the code was originally hosted on Microsoft's CodePlex before being moved to GitHub).

Dev Containers Extension in VS Code

What it is, What it does

The Dev Containers ("DC") extension is needed by Semantic Kernel in VS Code. It's worth expanding on its purpose here.

DC allows you to use a Docker container as a full-feature development environment (this is independent of how you deploy the thing).

More details here.

Dev Containers Dependencies

It requires Docker Desktop to be installed, which interacts with WSL2. If you don't have it, don't worry, however. VS Code will prompt you automatically to install it.  After installation, you will see a status bar labelled "Setting up Dev Containers" followed by "Connecting to Dev Container (show log)".

Dev Container Configuration

This is located in semantic-kernel\.devcontainer\devcontainer.json.  This is similar to launch.json for debugging configurations. More info here.

Git on Windows

Git on Windows is good.

However there are a few options to select before you get this going.

  • add git to path (can use from cmd.exe and Powershell etc.)
  • use bundled OpenSSH (uses ssh.exe that comes with git) - alternative is to use an ssh.exe you install and add to your path
  • which SSL/TLS library should Git use for HTTPS connections? OpenSSL library or native Windows Secure Channel (Choose latter). Here Server Certs will be validated by Windows Certificated Stores. Also allows you to use your company's internal Root CA certificates distributed e.g. via Active Directory Domain Services
  • Git Bash to use MinTTY for terminal emulation (better scrollback than cmd.exe)
  • Use Git Credential Manager or use none

Do You Understand Fully How This Works

 I think you need to do. Attribution: One software engineer to another.

Mastering git clone

You will want to use git clone to clone a repository and see its contents. This is similar to an svn checkout, or svn co.

So what does git clone actually do?

git clone clones a repository into a newly created directory, plus a lot more - read and summarise this later. Pay close attention to the notion of "remote-tracking branches".

There is a -l or --local option for git clone which does a clone from a local machine.

The Unstable Book for Rust

Rust has The Unstable Book to cover unstable features of Rust. When using an unstable feature of Rust, you need to use a special flag or rather a special attribute:  #![feature(...)].

Unstable features in Rust refer to specific capabilities (language or library) as yet unstabilized for general use.  You can access these on the nightly compiler (not in stable or beta channels).

Unstable features may be experimental, incomplete or subject to change.

They are in the language as a means of balancing innovation with stability. Developers get access to new features but basically on a trial basis. While the feature is classed as unstable, Rust team can refine the design, fix edge cases or abandon features if problematic.

cargo new hello_cargo

 For details on how to use cargo new type cargo new -help at the command line.

cargo new hello_cargo

creates a new directory for your project, a .git subdirectory, an src directory for your source code, and a Cargo.Toml file.   

TOML, or Tom's Obvious Minimal Language, format, is Cargo's configuration format. (TOML brands itself as "A Config File Format for Humans" and is nice, simple and neat).

It has a [package] section which configures package settings, and a [dependencies] section for any of the crates required for the package to run (crates are Rust packages).

cargo init -help

for help integrating any Rust code developed outside of Cargo.

Elliptic Curves Assert Presence on Linux

WinJoes may be surprised to see refs to elliptic curves appearing in their Linux VMs in their D2D Ops.

These appearances often reference EdDSA (e.g. in messages like "using EDDSA key 0327BE68.....").

EdDSA refers to the Edwards-curve Digital Signature Algorithm (EdDSA) in public key cryptography.

Edwards curves are an artefact of algebraic geometry and named after the American mathematician Harold Edwards.  The specific form of Edwards curves used in the algorithm are known as twisted Edwards curves, where the "twist" comes from a non-unitary coefficient (from some field F) in the curve equation. The twisted Edwards curve equation is an interesting equation and the first question when you see it is how did someone come up with it.

source command in Linux shell

 The source command in the Linux shell sources (loads) a script. Its short form is .

emacsx - An Emacs Launcher for .bashrc

Here is a cool emacs launcher for your .bashrc file if you like reverse video.

# Launch Emacs in full screen, optionally with a file

# Launch Emacs in full screen, optionally with a file

emacsx() {
  if [ "$#" -eq 0 ]; then
    command emacs -rv --fullscreen
  else
    command emacs -rv --fullscreen "$@"
  fi
}

$# in bash is a special parameter that represents the number of positional parameters to a script of function.

Cargo for Rustaceans

A simple rust program with no dependencies can be compiled with good old rustc. But for Rustaceans who want more - more dependencies, more complexity - cargo is the tool to use. Read hello cargo here.

Wednesday, 19 November 2025

The DevOps HUD - GoLand

DevOps brothers and sisters. Discover GoLand.

What is the "tensor" in TensorFlow?

In TensorFlow, a tensor is a multidimensional array used to represent data in a machine learning model. It generalizes scalars (0D-array), vectors (1D-array) and matrices (2D-array) to higher dimensions.

But don't be fooled. A true mathematical tensor has much more going on behind the scenes, being multilinear maps with specified transformation rules. 

A TensorFlow tensor is a looser construct than that.

Describing Einstein's General relativity mathematically involves the use of tensors of the math variety.

Deeper Look at Sequential Model Building in TensorFlow

Let's revisit the model building command in TensorFlow in our "hello world" equivalent example.

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])

So Sequential lets you build up a model in layers (see Layer class, or tf.keras.Layer, that inherits from Operation). 

Layers are callable objects. In Python, a callable is any object that can be called using parentheses (optionally with arguments). Read the implementation here (in keras/source/layers/layer.py).

But what does the Flatten method/Layer do?

Dense creates a densely-connected NN Layer (convolutional neural network architecture). 
  • The first argument is the positive integer units, representing the dimensionality of the output space
  • The second argument is the activation function to use (if this is missing, no activation is applied which is actually linear activation a(x) = x)
Activation functions in a neural net introduces non-linearity into the network.  
  • Essentially these functions work with neurons and transform neural computations into output signals
  • ReLU (rectified linear unit) is one of the most widely used activation functions in neural networks (f(x) = max(0,x)).

Train your First Neural Network on the MNIST dataset

You can train your first neural network on the MNIST dataset (used for image recognition models). The MNIST example is tantamount to being a "hello world" of machine learning programs.

Key features:
  • Use Keras API as a "portal" into TensorFlow library to build the neural network
  • Use "Sequential" model - allows you to add "layers" sequentially
  • "Feed" the model the training data (creating the model takes a bit of study/effort)
  • Model gives back a vector of "Logits" or "Logs-odd" scores, one per class
  • Run softmax to convert these scores to probabilities
  • Compile the model - with an optimizer and a loss function, configure for 'accuracy'
  • model.fit
  • model.evaluate to see how the model performed (was it a good fit to the data)
Doing this example immediately raises a billion questions! Answering these questions will help you in future machine learning projects with TensorFlow. So get your answers now!

Some numbers to remember in this "post game analysis" are 0 to 255 and 28x28.

All About the Data - the MNSIT Dataset & (Numpy-friendly) Data Format 

The MNIST dataset consists of 60,000 training images of handwritten digits and 10,000 test images, each a size of 28x28 pixels. Images are grayscale and numbers are 0-9. The data set is vectorized and in numpy format. Each pixel has an encoding of 0 to 255 (typical for grayscale images) where the number represents brightness, 0 is black and 255 is white.

The Data Set Loading Process (Involves Normalization)

So MNIST is one of the built-in datasets in Keras. 

The first step is to normalize the data by dividing each pixel value (in the training and testing data set) by PIXEL_MAX=255 which creates a value between 0 and 1 (inclusive) and converts an integral value into a decimal value.

Model.fit - In depth

How does this from a function-calling perspective.

How do I see how good this model is visually?

This requires some additional programming.

I had to bring pip into my Ubuntu Linux VM

sudo apt install python3-pip

Even with Python installed a whole host of packages are needed for pip. These include (non-exhaustively):

  • build-essential
  • bzip2
  • cpp
  • cpp-11
  • fakeroot
  • ...
  • zlib1gdev

Just to mention a few.

build-essential is a Debian specific package, consisting of a bunch of useful build tools, to build software from source and create Debian packages.

Tuesday, 18 November 2025

Rustup

what rustup is and what it does for rust

Rust is installed and managed with the rustup tool. There are some concepts underlying how the tool works, including its role as a toolchain multiplexer.

rustup's role as a toolchain multiplexer

A toolchain represents a full installation of a Rust compiler (rustc) and related tools (like cargo).  Rustup enables switching between multiple versions of these tools which can be selected dynamically.

analogies in wider programming pantheon

rustup is similar to pyenv.

Microsoft's Big Bet on Rust

Microsoft is writing more code in Rust (a memory safe "cousin" of C++).  It is also investing time in developing the language ecosystem in its role as a founding member of the Rust Foundation.

This is due to 1) performance 2) memory safety and 3) developer mindshare.

Mark-R has called for a halting of new code in C and C++ and using Rust 

"for those scenarios where a non-GC language is required".

Linus has also confirmed Version 6.1 of the Linux kernel will use Rust.

Rust was created by Graydon Hoare in 2006 while working at Mozilla as a personal side project to solve memory safety and concurrency issues in systems languages like C and C++.

Other companies that have embraced Rust include Cloudflare who coded up Pingora in Rust to overcome limitations in nginx.

Will two git commits ever have the same id?

 This is highly unlikely due to the design of the commit algorithm.  The probability is less than 1.47 in 10^48.  Digging into this, the commit id is actually a 40-character hexadecimal number - using a cryptographic hash function (SHA-1 or SHA-256 in newer versions) producing 2^(4*40) possible hashes. 

Metadata that's fed into the cryptographic hash function include snapshot of the project tree (folders, files and contents), commit message, the commit data itself, author information and timestamp (time elapsed in seconds since Jan 1, 1970).

TCPL Still Rules the Linux Roost

The programming language C (born in the 1970s, created by Dennis Ritchie, who also created its predecessor, B) still rules the roost as far as the Linux kernel is concerned, whereby of 37 million lines of code, we have just over 35m LOCs written in C as per analysis from OpenHub.

Linux is the open source version of Unix, which was written in C, and prior to that in assembler. The introduction of the programming language C made the code portable to different hardwares.

Ken Thompson and Dennis Ritchie and their colleagues at Bell Labs (AT&T) were the co-creators of the Unix Operating System, created in 1969.   Douglas McIlroy introduced the Unix philosophy of small, composable tools. Brian Kernighan helped to popularise UNIX and C through co-authoring The C Programming Language with Dennis Ritchie.

git add and git commit

The Mechanics of Commits 

You can add a new file to your repo doing:

git add <mynewfile>
git status (this will show any pending commits)
git commit -m "a little comment, if you please" <mynewfile>

Once you commit, you should get a message like:

[master (root-commit) strange_hexadecimal_code] comment from -m command line flag
 "1 file changed, X insertions(+)" 
create mode 100644 filename.xx

 If you didn't do a git add, it will say "nothing to commit, working tree clean".
The 100644 is Unix style permissions - 100 means regular file, 644 means read/write for owner and read-only for group and others.

Committing to Git is a Two-Step Process

Stating the obvious here, but as stated above, you need to git add before you git commit for new files.

The Semantics of Commits

When we commit, Git invokes an algorithm to create a "commit object", a binary object which it stores inside the .git folder.  A unique identifier is used to "stamp" the commit. Part of the "stamp" is shown in the git commit response message - and is computed from a bunch of metadata including the commiter's name, time of commit, commit message and information on the change.

The Index and the Object Database

When you do a git add, a copy is made of the file and put into the index. The index is like a "waiting room" where we put in our "candidate" objects until we are ready to commit them.  git commit takes the objects in the waiting room and puts them into the object database.

Allowing wsl.localhost in list of allowed hosts

When navigating to a directory in WSL from VS Code you may get the message:

The host 'wsl.localhost' was not found in the list of allowed hosts. Do you want to allow it anyway?

You will get the opportunity to hit Allow (because you trust the host, it is after all, your WSL installation) together with the option to flag: "Permanently allow host 'wsl.localhost'".

Accessing Linux Directories in WSL From Explorer

Directories in WSL can be accessed by navigating to your distribution and filesystem after opening \\wsl$ in File Explorer.

Telling git who you are with git config

To get your commits attributing correctly you need to let git know who you are. This means sharing your name and email address you want to use for git. Here is the way:

git config --global user.name "Your Name"
git config --global user.email "youremail@provider.com"

Then you can type:

git config -l

to see if git remembered your changes. The user settings will remain even if you then eliminate the project (project specific settings will disappear).

Creating a git repository: git init

What git init does

Run  git init in the folder where your code will be based (this assumes you have already created an appropriately named folder and done cd into it).

This creates a .git subdirectory.

What does the .git subdirectory hold?

.git is filled with configuration files and a subfolder for snapshots to be stored. All commits are stored in .git as well.

Name of initial branch

This may change in successive versions of Git, but as of 2.34, the name of the initial branch created by git init is 'master' (other commonly used names include 'main' and 'trunk' and 'development').

Re-running git init

Re-running git init will not do anything. It will simply output "Reinitializing existing Git repository" in /home/YOURLOGIN/YOURPROJECT/.git/. 

Re-running git init in the .git subdirectory

This is weird and will create a new . git subdirectory in the .git subdirectory, so if you do this, you need to clean up your .git. There is no valid use case for doing this.

Renaming a branch in git

The git branch command can be used to rename the branch.

Linux VM Skills - The ls command

ls -a and ls -A will list all files including hidden files.

ls -A (capital A) will attempt to remove the . and .. directory (i.e. current and parent directories). 

A Head-First Guide to Git

A head-first guide to Git is available on O'Reilly. Head First Git also has an accompanying website.

Saturday, 15 November 2025

Canonical Snaps and Approach to Linux Packaging

Concept of Snaps and Why Its Useful

When running wsl in Windows (with Ubuntu) you will eventually come across the concept of snaps. 

Snaps are a package management feature that offer an alternative to the usual sudo apt-get (or sudo apt install, which is a wrapper over apt-get).

Snaps was developed by Canonical for security (via sandboxing, or in Canonical language "confinement") and convenience (an "all-in-one" snap removes the need to download and install individual dependencies).

The security side aims to guarantee safe execution of software by mandating packages abide by the principle of least privilege (this is diluted however by the option of classic confinement).

Concrete Example: Installing emacs edito

Trying to run emacs at the command line, you find it is not installed. You may see:

Command 'emacs' not found, but can be installed with
sudo snap install emacs # version 30.2

This is achieved by placing the package in a sandbox with snapd mediating all access to host system resources. 
The snap's confinement level controls the degree of isolation from the user's system.
  • Strict confinement - abide by sandbox rules
  • Classic confinement - liberal / "laissez-faire" (but needs explicit user approval on install)
Searching for Pre-Created Snaps using Canonical's Search Engine

There is a search engine for Snaps on Canonical's website. Canonical are calling it the "app store" for Linux.

Friday, 14 November 2025

Wayland for WinDevs Who May Not Know It

What is Wayland?

Wayland is intended as a replacement for the X Window system with the aim to make it easier to develop and maintain.

Specifically, Wayland is the protocol used to talk to the display server to render the UI and receive input from the user.  

Wayland also refers to a system architecture (more below), which will give you an understanding of how the protocol is used to build a working UI.

Wayland versus X Architecture?  Call it a "Simplified X".

In an X setup,  X clients talk to an X server and the server talks to a compositor. The comms between server an compositor is bidirectional.

The X Server also talks to the kernel. There is a critical interface called evdev (short for event device) which is the input event interface in the Linux kernel. 

In Wayland - the display server and the compositor are rolled into one. The architecture is thus simpler.

What is WSLg?

WSLg is the Windows Subsystem for Linux GUI to enable running Linux GUI applications (X11 and Wayland) on Windows.

Google Colab

Google Colaboratory ("Colab") is a hosted Jupyter notebook which includes free access to GPUs and TPUs. It is for machine learning, data science and education.

For cool datasets to explore ML with, check out Google Dataset Search.

There is an interesting Colab workbook by Ashwin Rao on the SVB crisis.

Colab supports a large number of constantly upgraded Python packages including kagglehub (to use Kaggle resources) and narwhals (dataframe library).

Tuesday, 11 November 2025

Addressing AI Misuse

OpenAI has a Preparedness Framework aimed at addressing AI misuse.

The domain of cybersecurity features prominently here, since AI can be used to enhance security, but equally make it easier to scale up cyberattacks.

Getting Jiggy with gpt-oss-20b (and why open weights matter)

gpt-oss-20b is an open weight language model. These so-called "open weights" reflect the pre-training the model has received.

The model is a significant 12GB download.

LM Studio Setup

LM Studio has two setup options:
1. For anyone who uses the computer
2. Only for the currently logged in user
Both have advantages, but if you are working with Studio to build custom LLMs tailored to you, you may want option 2 despite the (potential) convenience of option 1 where LM Studio is "universally" available to all users of the machine.

Disk Space in Windows 11

Type Storage Settings in the Search bar. This will lead you into Settings ->System -> Storage.

Monday, 10 November 2025

Python Wheels

Python wheels are pre-built binary packages for Python which make installation via pip faster and more efficient. Learn more here.

Why Python From Windows Store is Flawed

Installing Python from the Windows Store is flawed as everything goes into AppData\Local.  This is a local directory associated with the logged in user C:\Users\<Name>\AppData\Local.

This is a way to sidestep a "proper installation" in C:\Program Files which requires administrator privileges. It's a way to overcome potential UAC restrictions.

Once installed - there is no proper way to uninstall. You need to get into the AppData directory (which is hidden in File Explorer). Once opened, you can navigate to Microsoft\WindowsStore to find python and related exe files (e.g. pip.exe).  Then do a clean of the registry.

You may also find Python subdirectories in the WindowsStore directory.

While cleaning out registry entries you may come across references to .whl files (or Python Wheels).

Understand the Simple Power of Backpropagation but also the Dangers

So states Lex Fridman in his lecture on Recurrent Neural Networks (from the course on Deep Learning for Self-Driving Cars).

pip install tensorflow

This will install the current stable release for TensorFlow.

Friday, 7 November 2025

Control-Star - The Magic Key Combination in Word

 Control-Star - will make hidden characters appear and disappear in Word.

Word AI is Not Intelligent

Macros are better. Even with feedback, Word AI is not good. It cannot be classified as "intelligent".

Wednesday, 5 November 2025

Azure Kubernetes Service - Rationale for Using AKS for .NET Applications

Why use the Azure Kubernetes Service (AKS) for your .NET applications (rather than just deploy them to virtual machines)?

First, there are scalability benefits - these are ideal for microservices.

Kubernetes has "canned processes" for scaling.

Kubernetes has what's called Horizontal Pod autoscaling - which means when load increases, the system deploys more pods.  It doesn't change the dynamics of the pods themselves i.e. doesn't allocate more memory or CPU to them, it just makes more pods.

Exciting Thing about AI

The exciting thing about AI is accelerated learning - and there is no field where no impact is most immediate is computer programming.   But the edge is the human learning that is accelerated - not so much the automatic code generation, prototype generation - but the ability to come to quick conclusions around technology choices, implementation choices and optimised solution building.

Technology Knowledge

Technology knowledge always needs refreshing.

Software is changing all the time.
Hardware is advancing - making new things possible (new software).
Models of software hosting are changing as well (centralised models - decentralised - centralised in cloud).

Patterns repeat themselves, but with variations.

That's why you need to know the past and the present. Keeping up with Technology change is a full time job and that's why you need people in a team, always updating themselves.

Technology is a knowledge business.

Biztalk and Azure Integration Services

Biztalk was pitched as an application server and an application integration server.

It is designed to be an integrator of different software systems, and automate business processes (you may have a business process that touches on multiple IT systems - so this solution makes perfect sense - deploy a middleware to manage all the interaction).

Biztalk has/had a bunch of adapters including file adapters, FTP, HTTP, SOAP, SMTP, WCF, MSMQ, MQSeries (now IBM MQ) etc.

It's actually a brilliant product name because it captures the concept of what it does so well.

The replacement for Biztalk is Azure Integration Services.

It is interesting to consider the relationship between "AIS" (as defined above in the Azure context) and "RPA" (or "Robotic Process Automation"). The latter focuses on streamlining human-system interactions, and the former focuses on system-to-system interactions.

Monday, 27 October 2025

Keras Models API

 Keras Models API provides three ways to create models.

  • Sequential Model - the simple model - consists of layers. applied in succession.  You can create a Sequential model by passing a list of layers to the constructor of Sequential.
  • The alternative, and preferred method for most use cases, is the Functional API. which is more flexible than the keras.Sequential API.  It enables the building of graphs of Layers.
  • Model subclassing is building from scratch. for out of the box use cases.

All Eyes on Keras - Layer = IO TRANSFORMATION

What is it and Why Use it

Keras is the high-level API for TensorFlow, covering all aspect of workflow, from data processing to (hyperparameter) tuning to deployment.  It is the API to be used by default.
 
Layers and Models, Layers and Models

The core data structs of Keras are layers and models.
  • A layer is a simple input/output transformation
  • A model is a directed acyclic graph (DAG) of layers (production flow of layers, and thus a production flow of transformations)
A model is thus a "special" series of input-output transformations i.e. a "series" of layers.

Sidebar: what are hyperparameters

Hyperparameters are parameters you set before training a model. They can be set by a user or by a tuning algorithm. Example parameters could include learning rate (how fast the model learns), or number of layers in the neural network (more layers the more complex patterns the neural net can learn).

Thursday, 16 October 2025

AI as an Accelerated Research Tool

AI is amazing at research and automation of research.  It leads you down all the relevant rabbit holes in a particular area. Using probability and statistics, it finds useful associations and linkages for you, and saves you having to click down a whole tree of links.

Tuesday, 14 October 2025

What is PCI-DSS?

PCI-DSS are security standards to protect cardholder data during payment processing. Anyone processing or storing payment data needs to comply. There are some generic things around firewalls and vulnerability management as well as more specific rules around sensitive data in transit, for example.

Friday, 10 October 2025

Updating your Printer Driver

 Go to Device Manager -> Printers and right click, then click on Update Driver.

Renaming your Printer

It can be good to rename your printer - particularly if the current name is a very complex model number.  Simply open the printer in Settings, click on Additional Printer Settings and hit Rename.

Printer Showing Out of Paper But Not

Simply go to services.msc and restart the Print Spooler service.  Apart from queuing print jobs it also does error processing.

Relationship of WMIC to WBEM

WMI is the Microsoft implementation of Web-based Enterprise Management (WBEM), a set of systems management technologies for use in distributed systems.  It is based on common standards like the Common Information Model (CIM).

The WBEM initiative was started in 1996 by BMC Software, Cisco, Compaq, Intel and Microsoft.

The specification for the CIM is maintained by the DMTF (Distributed Management Task Force), an industry standards organization consisting of members and alliance partners collaborating on specifications. It also maintains a repository of its operating policies on issues including IP.

Use wmic to debug printer status in cmd.exe

WMIC is Windows Management Instrumentation Command Line.  It is a command-line interface to WMI.

To use it to debug printer status on Windows, try the following in cmd.exe.

wmic printer get name, status

If status is Error this needs investigation.

Why Does Printing in Windows Always Try to Print in Letter Format?

Go to "Printers & Scanners" in System Settings. Select your device and click on "Printing Preferences". Then click on "Advanced" (Alt-V).  Switch paper size to a new default size (e.g. A4). Click OK, Apply.

Tuesday, 30 September 2025

The Downlow on Explainable AI

Here's a great compilation of the latest in Explainable AI  (XAI). Deep learning models (CNNs, RNNs, LLMs) as well as older models such as Support Vector Machines are covered. Techniques such as SHAP (short for Shapley Additive Explanations) are covered as well.

Monday, 29 September 2025

Getting to Know Svelte

Get to know Svelte.

Open Source Community Communication with Discord

The Discord instant messaging platform is popular with virtual communities, around open source projects, gaming and other communities, conceived by San-Francisco-born Jason Citron. It has various safety infrastructure mechanisms built in such as the Safety Rules Engine and they also emphasise Safety-by-Design (a concept prevalent in other engineering disciplines such as civil engineering).

Mermaid Diagramming

Mermaid diagramming was a concept of using text commands to create flowchart-like diagrams. 

This idea was realised in MermaidJs which renders Markdown-inspired text definitions to create and modify diagrams. It is a competitor to Lucid Chart.

Among the different charts Mermaid can build are the old database favourite of entity-relationship diagrams (E-R diagrams). Conventionally, we represent entities with capital letters and relations with lowercase letters.  The crow's foot notation is used to connect entities; intuitively expressing the idea of 1-many relationships.

Another popular chart is a sequence diagram. These express interactions between entities, e.g. the favourite Alice and Bob interactions in cryptography interactions. Entities can be expressed as rectangular boxes by default, but stick men can be used as icons too using the actor syntax.

Quadrant charts, or 2x2 matrices, can easily be built as well. These are common in consulting e.g. the BCG matrix, also known as the product-portfolio matrix. It helps to identify promising investment areas and areas which should be closed down e.g. low growth and low market share quadrant products.

Architecture diagrams can also be rendered in Mermaid.

There are also various experimental diagrams such as Sankey diagrams, which are used in science, especially physics.

Mermaid chart rendering is now available in the latest Visual Studio 2022 update.

Friday, 12 September 2025

dotnet.exe - what it means

dotnet.exe has many uses, but running compiled .NET executables distributed as EXE files is not one of them. One application is running .NET SDK commands.

File format for Jupyter Notebooks

Jupyter notebooks are stored as .ipynb files (which stands for interactive Python notebook). 

Microsoft Semantic Kernel

Microsoft Semantic Kernel is a "lighweight, open-source development kit" to build AI agents and integrate models into C#,  Python and Java code.

When you load up SK into a fresh Visual Studio Code (no extensions) it will prompt to install recommended extensions. These will include:

  • ESLint - integrates ESLint JavaScript into VS Code (for static analysis)
  • Prettier - integrates Prettier, the opinionated code formatter (for JavaScript, TypeScript and other webby stuff)
  • Azure Functions - to quickly manage serverless apps directly from VS Code
  • vscode-pdf - to display pdf files in VSCode (required to open PDF code maps for .NET and Python)

Cost Effective Deployment of Language Models

Cost effective deployment of language models (explicit financial as well as implicit environmental cost) is partly responsible for triggering the interest in small language models (SLMs) as alternatives for specific applications. 

Nvidia Research have a great paper on this entitled "Small Language Models are the Future of Agentic AI" with the recommendation that more routine tasks (non reasoning tasks) move from LLMs to SLMs. Fine tuning these SLMs for specific tasks can also enhance the effectiveness of deployed models.

Thursday, 11 September 2025

Introducing the Mojo Programming Language

Mojo is a programming language in the Python family.  Informally, it is a kind of "high performance Python".

It is currently available in browsers, via Jupyter notebooks and locally on Linux and macOS.

Developers of Mojo are Chris Lattner (ex Apple, ex Tesla, ex Google) the original architect of Swift (which he began developing in 2010, subsequently developed and added to by other developers) and LLVM (working with Vikram Adve at University of Illinois), and Tim Davis, a former Google employee.

Chris has described Mojo as "AI First" whereby AI is driving the design and requirements but it is not designed to be AI-Only, and so it can be described as moving towards a general purpose programming language.

Mojo is built on the MLIR (Multi-Level Intermediate Representation) compiler infrastructure, of which Lattner was a co-founder, one of whose aims is to reduce the cost of building domain-specific compilers.


Tuesday, 9 September 2025

Where Winforms lives on Github

This is where Winforms lives on Github.  You can use it to understand how Winforms works under the hood. 

Winforms is a .NET wrapper over Windows user interface libraries, such as User32 and GDI+.  It also offers "control and functionality ... unique to Windows Forms" (taken from README.md on Github). 

Can't Resize a Form in Design Mode for Winforms

Check a control with Docking.Fill set is not blocking the resize. This can intercept clicks meant for the form. You can workaround by temporarily setting Dock=None on the blocking control.

Resource-Aware Design in WinForms - TextBox versus RichTextBox

The TextBox is the clear winner when it comes to memory usage - as it is optimised for plain text only, versus the RichTextBox which supports rich text formatting as well as images and tables. 

This means the startup cost is also less.  Layout logic is also a source of resource consumption in the latter, as it uses FlowDocument internally for layout, whereas TextBox is just a simple stream of text.

Here are some metrics to compare 1000 instances of TextBox versus 1000 instances of RichTextBox:

Initial memory: 45MB vs 60MB  (33% heavier)
Memory used: 3MB
Control creation time: 120ms vs 250ms

Restoring a mass of text boxes from memory will take half the time if we use a regular TextBox versus a RichTextBox.

Monday, 8 September 2025

F7 and Shift-F7 - Key Visual Studio Solution Explorer Shortcuts

If you have a C# control in Visual Control you want to look at, hit the following:

F7 - to see the code (or FN-F7 on certain laptop keyboards that overload function keys)
Shift-F7 - to see the control in Design mode (or Shift-FN-F7 on certain laptop keyboards)

Environment Variables in Windows 11

To see your environment variables, type set in the command line. This will show you a bunch of stuff like:

ALLUSERSPROFILE=C:\ProgramData
ALLDATA=C:\Users\windowsjoe\AppData\Roaming
CommonProgramFiles=C:\Program Files\Common Files
CommonProgramFiles(x86)=C:\Program Files (x86)\Common Files
COMPUTERNAME=windowsjoemachine
...
OS=Windows_NT (Even if you are using Windows 11)
...
USERDOMAIN=windowsjoemachine
USERDOMAIN_ROAMINGPROFILE=windowsjoemachine
USERNAME=windowsjoe
WINDIR=C:\Windows
..

And so on, and so forth.

The setx command is an extension of set which allows you to create or modify environment variables, persisting the result across sessions.  It was first integrated into Windows Vista and is a staple for Windows 10 and Windows 11. 

To use setx to append to your PATH variable you will want to do something like this:

setx PATH "%PATH%;C:\Your\New\Directory"

For Windows Joe, a special scripts directory holds a lot of useful scripts. Hence, the path is updated to:

setx PATH "%PATH%;C:\users\windowsjoe\scripts"

If successful, you will see the message "SUCCESS: Specified value was saved.".

However, you will not be able to see the results using echo %PATH% until you start a new session.

Sunday, 7 September 2025

Alt-TNP - Package Manager Settings in Visual Studio

This includes a feature called "Clear All NuGet Storage" which wipes out all the local NuGet caches.

Disk space is freed, but builds may take longer initially, since cached metadata and binaries are gone.

Alt-TNO - the Nuget Package Manager Console in Visual Studio

Alt-TNO in Visual Studio 2022 leads you to:

Tools
NuGet Package Manager
Package Manager Console

from where you can use Find-Package to find your desired packages.

An example usage:

Find-Package Newtonsoft.Json

The Jungle of Text Encoding

Dealing with textual data on the Internet is like navigating a jungle. 

Without some normalisation, you need to get adept at handling multiple encodings.

System.Text is your partner here.

This holds the Encoding class, which has various useful properties.

  • Encoding.ASCII 
  • Encoding.Default (default encoding for current .NET implementation)
  • Encoding.Latin1 (Latin 1 character set, ISO-8859-1)
  • Encoding.Unicode (encoding for UTF16 in little endian byte order)
  • Encoding.UTF32 (little endian)
  • Encoding.UTF7 (obsolete)
  • Encoding.UTF8
Recall that Windows is little-endian by default, running primarily on x86 and x86-64 architectures which are little endian.  Even Windows on ARM uses little endian mode (ARM is bi-endian which means it can be used in little or big endian mode).

Saturday, 6 September 2025

PascalCase for WinForms Controls - Always

PascalCase is used for readability and consistency.  It aligns with how .NET controls are named, such as Button, Label and ComboBox.  

In fact, all public types and members are in PascalCase. Private fields are often in camelCase or _camelCase.

Key Properties of SplitContainer

There are some properties of SplitContainer that are well-worth remembering for programming efficiency.

Recall that a property in C# is a special encapsulation of a class field (typically with internal setter and getter methods). The set method uses an implicit value keyword.

Here are some key ones:
  1. Panel1 - the leftmost or topmost panel in a SplitContainer
  2. Panel2 - the rightmost or bottommost panel in SplitContainer
  3. FixedPanel - which panel remains fixed sized when container is resized
  4. Orientation - orientation of the panels
  5. Panel1Collapsed - whether Panel1 is collapsed
  6. Panel2Collapsed - whether Panel2 is collapsed
  7. BorderStyle - container border style

Prompt Maintenance

So-called prompt engineering is fraught with execution risk. New features are added to LLMs all the time. What worked yesterday may not work (as well) today and may need to be further "tuned".

This high touch maintenance requirement may trigger a shift away from usage of prompts in production workflows to more truly engineering-oriented solutions that have better stability and resilience properties.

Shortcuts for WinForms Custom Controls

To swiftly create custom controls in WinForms (which is needed for applications beyond a certain size), command of some keyboard shortcuts is preferable.

Control-Alt-X      Make the Toolbox appear
F4     When selecting a control, to make the Properties window appear (on some laptops you need to do FN-F4)
F12   Go to definition (or FN-F12)
Alt-F12    Peek definition (or FN-ALT-F12)

Changes in .NET 6 to WinForms

While .NET 6 may seem like a distant memory, a very important change was made to WinForms at the time.

In the main Program class, and static Main method, the previous initialization routines were replaced with a succinter statement - ApplicationConfiguration.Initialize().

Compiler Intrinsics - Basics, Pros and Cons

In Microsoft C++, many functions come from libraries, whilst some are built in to the compiler. The latter functions are known as compiler intrinsics.

PROS

An intrinsic function is usually inserted inline, avoiding overhead of a function call. Hence, compiler intrinsics are efficient!

They can be even jiffier than inline functions, as the optimizer in the compiler has knowledge of them and can optimise usage!

POTENTIAL CONS

A potential downside of using intrinsic functions is reduced portability of the code - other compilers may not support these functions. Also, some intrinsics may be available for some target architectures and not others.

COMMON INTRINSICS

Microsoft make some intrinsics available on all target architectures. Here's an example!

_AddressOfReturnAddress

signature: void * _AddressOfReturnAddress();

<intrin.h>

This function provides the address of the memory location that holds the return address of the current function.

MASM Decoded

MASM is the Microsoft Macro Assembler. 

It was introduced by Microsoft in the early 1980s to support x86 programming on DOS and Windows Platforms, and competed with IBM and Borland assembly tools.  The "macro" in the naming refers to assembly macros - reusable code snippets to simplify complex or repetitive assembly tasks.   It is an assembler in the sense it converts assembly language into executable machine code.

Inline assembler used to be a "thing" in earlier versions of Visual Studio. This allowed you to embed assembly language in a higher level programming language source code. This is no longer supported for x64 or ARM64 targets.

Options to port inline assembler include: conversion to C++, create separate assembly language source files, or use compiler intrinsics (supported by the Microsoft C++ compiler).

Visual Studio 2022 August 2025 Updates

August 2025 brought new amendments to Visual Studio 2022. 

For the August 19th release (17.14.13), these included the latest MAUI install (9.0.82), a fix for a stack overflow crash when linking MASM-generated debug information, and a fix for the Live Visual Tree horizontal scrollbar (Live Visual Tree gives you a real-time view of running XAML code).

Versions follow the MAJOR.MINOR.BUILD convention, so August 19th release is the 13th build of the 14th minor release, of the 17th major release.

Friday, 5 September 2025

Python platform module API shows errors - Oh Yeah

The Python platform module shows Windows 10 on Windows 11 machines - this is similar to other APIs which have been made to identify Windows 11 as a more advanced build of Windows 10 to avoid breaking backward compatibility. If build number is greater than 22000, you got Windows 11. 

Sunday, 24 August 2025

Silverlight is Dead, Long Live Silverlight

Silverlight lives on in XAML. Visual Studio supports XAML but has lost the visual XAML editor. It's ok.

Figma Basics and First Impressions

Figma feels a lot "lighter" than Microsoft's historical desktop design tool Expression Blend. Being a SaaS it is also easy to get started without installing software - which is time consuming and taxing on laptops.

It operates on design files, has a central canvas, toolbox and two sidebars. The left sidebar is Navigation and the right sidebar is the Properties bar. Very neat and simple.

Figma for Non Design People

The Concept of Figma

Figma is a SaaS tool - popular with UI/UX designers - for creating prototypes and designs enabling real-time collaboration.  Its purpose is to "bridge the gap" between designers and developers.

Adobe XD (Adobe Experience Design), now being discontinued, and Sketch are alternatives - however Figma was designed from the ground up to be cloud-native. This adds convenience as well as practical usefulness e.g. collaborating around live synchronised files.

Microsoft Integration with Figma

Microsoft has built custom Figma plugins to integrate its Fluent Design System into Figma. This creates design consistency e.g. in icons, text strings etc. The Fluent Design System is the design language used across Windows, Office and other products.

Microsoft PowerApps

Microsoft enables app creation from Figma designs using Power Apps, part of the Power Platform series of low code tools.

Figuring out Fluent 2

Fluent 2 is the next iteration of Microsoft's design "product" which consists of the Fluent 2 UI Kit build in Figma.

Monday, 18 August 2025

Sidestep Kernel with RDMA

Remote Direct Memory Access - or RDMA for short - allows machines to access memory of other machines without going through the operating system kernel.  This is a traditional computer science trick - if you have an abstraction in the way of your objective, bypass the abstraction.

The idea is to get performance improvement - low latency and high throughput. Sometimes these tricks come under the label of "ultra-low-latency software development" or "ultra low latency techniques".

The goal is zero-copy (or near-zero-copy) data transfer - data moves between memory buffers directly. 

Key benefits include:
  • Skipping kernel involvement
  • Avoid creation of multiple intermediate copies
  • Avoid switching between kernel space and user space
Protocols to support this include:

Kernel Bypass Networking

Kernel bypass networking is a technique to improve network performance by enabling applications to interact with network hardware directly, instead of going via an abstraction layer (the OS kernel's network stack).  All intermediate abstractions are bypassed. There are various specialized libraries such as DPDK, RDMA and OpenOnload to access Network Interface Cards (NICs) directly. 

Monday, 11 August 2025

Built-in Secure VPN for Microsoft Edge

Microsoft Edge now comes with a free built-in VPN to keep your location private, safeguard sensitive data, fill out forms and more. There is an option to automatically use VPN for public wifi. The VPN is allegedly powered by CloudFlare and comes with usage limits.

Friday, 8 August 2025

OpenAI Releases Open Models under Apache 2.0

OpenAI has released two open models as part of that is now dubbed the gpt-oss series.
  • gpt-oss-120b
  • gpt-oss-20b
The first is more suited to data centers, the latter is more for laptop and personal use.

They are both trained on the harmony response format (if going through a provider like HuggingFace, Ollama or vLLM you don't need to worry about it - the provider will deal with the format).

AWS S3 (and Auth Methods) for Azure Folks

Amazon S3 is something akin to Azure Blob Storage

A quick revision of Azure Blob Storage: 

Azure Blob Storage is a general-purpose cloud storage option for cloud native workloads.  

It supports WORM operating scenarios (Write Once Read Many) aka WORM-compliant storage (which prevents data from being edited or deleted). Role based access control (RBAC) is supported as is authentication with Microsoft Entra (formerly Azure Active Directory).

Amazon S3 (Simple Storage Service) is a similar offering in AWS.

The services are exposed as a web service with a custom HTTP scheme known as keyed-HMAC (Hash Message Authentication Code).  A more "high level" authentication option is available via AWS IAM.

The End of Microsoft Authenticator

Microsoft Authenticator is/was a mechanism for MFA (or multi-factor authentication) - proving your identity using more than just username and password. However as of 1 August 2025, Microsoft Authenticator is shutting down, with passkeys being the recommended alternative.

Thursday, 10 July 2025

Do you know Windsurf? Oh, sorry, Codeium, actually under the hood!

Windsurf is Codeium. A rebrand, but not a bad one. It states it is the "most powerful AI code editor". Investors include Founders Fund, General Catalyst, Greenoaks and Kleiner Perkins, but that need not trouble Windows Joe. The main thing is this is an invested platform, so developers can invest time in it.

Wednesday, 9 July 2025

Apache Nutch - the Tool that Drives Common Crawl

Apache Nutch is the tool that delivers data for Common Crawl. Its GitHub repository also contains a link to the wiki which tells you the active version.

LLM Training Data

LLMs are trained on large data sets.  One such data set is Common Crawl which consists of 250 billion Internet pages with 3-5 billion pages added each month.  This is petabytes worth of data (1 petabyte = 10^15 bytes of digital information). The data is stored on Amazon's S3 service allowing direct download or access for Map-Reduce processing in EC2.

Tuesday, 8 July 2025

What is InstructLab?

Instruct Lab was developed by IBM Research and Red Hat and is an open source product. It is designed to improve training of LLMs (specifically reducing cost of training).  A basic intro can be found here. It uses fine-tuning (both knowledge tuning and skills tuning).

Friday, 4 July 2025

Latest .NET Version as of July 2025

Latest Supported .NET Versions as of July 2025 is .NET 9 (STS)

The latest stable .NET version as of July 2025 is 9.0.6, released on June 10, 2025. 

.NET 9  Patch version 9.0.6;  Release: STS; End of support: May 2026 (ORD: November 2024)

.NET 8  Patch version 8.0.17; Release: LTS; End of support: November 2026 (ORD: November 2023)

ORD means Original Release Date. 

Release Schedule

Major .NET versions are released annually in November. Each release is defined as STS or LTS at the beginning of the release.

Details of Microsoft's Lifecycle Policy are Below

Microsoft Lifecycle Policy | Microsoft Learn

2 Become 1 - Story of .NET Frame** and .NET Core

The merger of .NET Framework and .NET Core was completed with .NET 5 in November 2020. 

The new cross-platform framework was now known simple as .NET.

.NET Framework was Windows only; .NET Core was its cross-platform, open source, cooler cousin.

The "Core" branding was dropped and .NET was now the mainline successor of .NET Core 3.1. .NET Framework 4.8 was frozen with no new features planned.

What is TriG in Semantic Computing?

TriG is an extension of Turtle for representing all the data in RDF graphs in a compact format. It is a W3C recommendation as of February 2014. TriG stands for "triples in graphs".

Any Turtle statement is also a valid statement in TriG.

Thursday, 3 July 2025

Can't fit your model into 1 GPU - try Fully Sharded Data Parallel

 PyTorch details how this works.

Unsloth

Unsloth aims to speed up the expensive process of LLM training. It does this by rewriting different components of the training pipeline including rewriting the gradient calculation. Their motto is "24 hours not 30 days" which is a reference to LLM training time. It also claims to rewrite GPU kernels for efficiency (functions designed to be executed on GPUs).

The Hugging Face Transformers Library and MRM

Transformers library puts trained open AI models in the hands of Python programmers. 

It is maintained by Hugging Face, a hub for "SOTA" AI models.

Hugging Face also maintain markdowns called Model Cards in each relevant model repo to give you insight into the models.

The concept of Model Cards is explained in this paper. It argues, for high impact applications, the Model Card brings critical usage information for deployers to consider.  This could be seen as a tool to support Model Risk Management (MRM).

Transformers is available on PyPI and can be installed with pip.

So - what is Preference Alignment in LLM Training?

Preference alignment in LLM training aims to improve an LLM's behavior by forcing it to follow rules and preferences.  It could related to stopping offensive language or some other restriction. 

Some approaches to preference alignment are detailed in this blog post from Miguel Mendez. There are a number of known techniques for this - these include:

PPO: Proximal Policy Optimization

DPO: Direct Preference Optimization

ORPO: Optimization without Reference Model

For preference alignment we usually need data which is good or bad.  Human annotation of such data is often expensive and in some cases a clear "winner" in terms of contrasting data points is not decidable. With KTO two answers can both be regarded as good. This arguably is closer to reality. 

KTO stands for Kahneman-Tversky Optimization and is detailed more in a blog post from contextual.ai.

The research paper on KTO should be read to understand how to construct the relevant KTO loss function.

SPARQL is the query language for RDF - Know It

SPARQL is THE query language for RDF. Here are some learning resources.

SPARQL.dev

SPARQL 1.1 (W3C Site)

SPARQL 1.2 (W3C Site)

RDF-Star (W3C WG)

Both universities and commercial firms are involved in the RDF Star WG.

SPARQL uses pattern matching to query an RDF graph and also allows aggregate algebra operations (such as COUNT) to be performed on qualifying nodes.

The Power is INDEED the LLAMA

Llama 4 ("Leading Intelligence") is out. 

There is something called the Llama 4 Community License Agreement. This states you can use Llama models in derived products - but you must tell the world you are using Llama and what version it is.

Machine Unlearning

As a machine learns, so must it unlearn.  

This ability is needed if an LLM ingests copyrighted content or personal data - it must be able to unlearn information it is not permitted to have. This could also apply to fallacious or untrusted data.

IBM in an article have noted the lack of industry wide tools to evaluated the effectiveness of unlearning.

The IBM piece also highlights research by Microsoft on machine unlearning. This also states the problem of the high cost of retraining models (this costly training process is what has spiked demand for GPUs).

A research paper, which styles itself as a "bridge" paper between unlearning research in classification models to unlearning in generative models focusing on the I2I (image-to-image) generation space.

In the IBM article, the writers go on to describe the SPUNGE framework they have developed for machine unlearning (SPUNGE being short for Split, Unlearn, Merge).

Tuesday, 1 July 2025

Concept of LoRA or Low Rank Adapation in LLMs

LoRA is an approach to optimizing LLMs by reducing the "size" of the matrix of trainable parameters, as measured by "rank" of the matrix i.e. the number of linearly independent rows or columns.

What is RIO in RDF? Is it relevant post OpenRDF?

RIO is the "RDF Input/Output" toolkit.  

The RIO appellation persists even in the post OpenRDF world.

RIO was part of OpenRDF and is now part of RDF4J. Docs are here

These parsers and writers can be used independently of the rest of the RIO library. An important parser in the toolkit is the RDFHandler, which receives parsed RDF triples.  This can be used as a pure listener, or as a reporting tool (being passed to a function that needs to report results back).

It's good to understand RIO both for comprehending legacy messages from OpenRDF and also more recent exceptions from RDF4J.

OpenRDF Usage in Blazegraph

Blazegraph (no longer maintained and unofficially superseded by Amazon Neptune) uses OpenRDF under the hood (rather than the renamed version RDF4J).

This can be found by forcing an exception in the Blazegraph workbench:

org.openrdf.rio.RDFParseException

By typing in some bad syntax into the Update window. Typing "hello world" will do nicely.

Why does Copilot want to rewrite my RDF when it doesn't know how?

Copilot wants to change the "tone" of an RDF file created by Windows Joe which will consume AI credits. This tone could be Formal, Casual, Inspirational or Humor (Sic).  Um, no thanks Copilot. 

What is Turtle in the World of Semantic Web?

Turtle is a type of notation and a W3C Recommendation since February 2014.  

Its full name is Terse RDF Triple Language. 

It is one of a family of ways to write RDF, also known as RDF serialization formats. Turtle's plus point is that it is user-friendly (and less verbose, than say, XML).

Turtle's official documentation states that it:

 "allows an RDF graph to be completely written in a compact and natural text form, with abbreviations for common usage patterns and datatypes. Turtle provides levels of compatibility with the N-Triples [N-TRIPLES] format as well as the triple pattern syntax of the SPARQL W3C Recommendation".

Monday, 30 June 2025

Why are GPUs fundamental to AI? And AI training?

Deep learning algorithms require linear algebra (matrix multiplications) and GPUs are very functional for linear algebra and matrices (as per their original design to compute graphics transformations effectively).

The matrix multiplications are used to update weights, and multiple cycles of updating weights, often referred to as epochs are required to adequately train these neural network models.

GPUs are basically the "free weights" of the AI training world.  More GPUS means more "reps" of AI training moves.

Geoffrey Hinton was one of the early pioneers who harnessed the power of GPUs to achieve AI training speeds hitherto unknown.

Is Blazegraph now Amazon Neptune?

Blazegraph is an open source graph database written in Java. It has been abandoned since 2020 and is used in production by the Wikidata Query Service. It is licensed under GPL version 2.0. 
 
Amazon Neptune is Amazon's high performance graph database available through AWS. As Amazon acquired Blazegraph developers, it is possible Neptune is the new, maintained incarnation of Blazegraph.

Apache JENA

Apache JENA provides APIs for Semantic Web and linked data applications. It features ARQ, a SPARQL 1.1. compliant engine.

The Rationale Behind "Internationalized" Resource Identifiers

The "names" of graph nodes in RDF utilize IRIs or "Internationalized" Resource Identifiers.  They are "internationalized" in the sense that the allowable character set consists of the full range of Unicode characters which means Chinese characters, Hindi script and other languages can be utilized as names. This differs from URIs or Uniform Resource Identifiers (IRIs are a superset of URIs and URLs).

What is OpenRDF Better Known as?

RDF hackers will know about OpenRDF, which officially became Eclipse RDF4J in May 2016. 

Its tagline is its power to "create applications that leverage the power of linked data and Semantic Web".

Sesame was another name for what is now known as RDF4J.

Many name changes were also effected in the move to a new governance structure. For example:

org.openrdf.*  Java packages moved to org.eclipse.rdf4j.*

In particular the RDF4J project houses the SAIL (Storage and Inference Layer) API for low level transactional access to RDF data. Sail is dubbed the "JDBC of the RDF database world".

What is Dublin Core in Computer Systems?

Dublin Core is a metadata labelling system.  Its full title is the DCMI, or Dublin Core Metadata Initiative.

org.openrdf.query.MalformedQueryException

This is what happens when you get your RDF queries wrong in Blazegraph.

What is Snake Case in Computer Programming?

Snake case is a way of writing compound words, so that each word is separated by an underscore symbol. An example would be "cavendish_laboratory" or "sainsbury_laboratory". 

It is meant to be easier to read than camelCase, _camelCase or PascalCase.

A variant, Kebab Case, uses dashes and though commonly found in URLs, is also used in other contexts e.g. SPARQL-QUERY.

Change Font Size in Terminal on Windows 11

Changing font size in the Terminal in Windows 11 is a multistep process.  It is an important skill for developers.

1. Press the DOWN ARROW on the menu bar

2. Go to SETTINGS

3. Select the THREE BARS at the top left

4. Select the appropriate profile  (COMMAND PROMPT, as opposed to e.g. Windows PowerShell, Azure Cloud Shell).

5. Click Appearance and Edit Font Size (Default is 12 point, 10 point better for development tasks).

The Blazegraph Database - 50 Billion Edges Supported

The Blazegraph database is an ultra-high-peformance graph database supporting Blueprints and RDF/SPARQL supporting 50 billion edges on a single machine. 

It powers the Wikidata Query Service.

There is a Quick Start guide that shows you how to start the Blazegraph JAR file from its installed location. It will then greet you with a Welcome Message from SYSTAP.

java -server -Xmx4g -jar blazegraph.jar

What is THE Semantic Web Stack?

The Semantic Web Stack illustrates the architecture of the Semantic Web.

Another weird name for this is Semantic Web Cake or Semantic Web Layer Cake.

It's built from hypertext technologies (such as XML, XML namespaces) and utilizes middle layer technologies like RDF and SPARQL (which is a middle layer RDF query language).

Security layer and UI layer are evolving areas of Semantic Web technology which are not standardised.

Win Joe's Buzzword Alert - What is a SIEM?

SIEM is a buzzword in information security applications - that stands for Security Information and Event Management - basically this is observability for security events. This supports threat protection for organizations.

Friday, 6 June 2025

The Model Context Protocol (Merci, Anthropique)

The MCP or Model Context Protocol was introduced by Anthropic as a way of sharing data with LLMs, or put differently, connecting LLMs to wider data sources. 

Anthropic has dubbed it a "universal translator". 

MCP is also highly relevant for those developing AI agents. A standard protocol makes integration easier. 

MCP was mentioned by Sundar Pichai in Google I/Os 2025 keynote.

What is WebRTC?

WebRTC (Web Real-Time Communication) is an open standard providing real-time communication between web applications using APIs. It was initially released in 2011 and was the work of Justin Uberti and Peter Thatcher. 

The official website can be found on the Google for Developers web.

One application it enables is browser-based VoIP telephony, or "web phones", enabling calls to be made and received from within a web browser.

Tuesday, 3 June 2025

Against AI-First

AI First is not often what you want in software.  Systems need to start with humans and human intentions and AI needs to provide seamless support not be front and center stage. The other thing is AI written by humans introduces human biases. These may make sense where the creator and end user are the same, but too often you have programmer biases entering consumer software which is bad for human-centered software development.

Monday, 26 May 2025

Progressive Rollout (aka "Canary Deployment") in the Cloud

A canary deployment is an old concept with a new branding in the age of cloud, and terminology-wise is used by both Google, AWS and Azure. Kubernetes technology is one way to manage canary deployments.

Canary deployments are progressive rollouts where new functionality is released to a subset of specially selected users. Therefore the canary deployment runs in parallel to current production deployment used by your regular users. This gives you more time and space to test the reliably of new features "in the wild".

Introducing the Azure SRE Agent

The new Azure SRE agent (announced May 2025) and demonstrated at MS Build, is designed to make it easier to "sustain production environments". This includes taking toil away from checking log files, analyzing historical changes and augmenting this with LLMs. Incident and infrastructure management is set to be transformed, with the Azure SRE agent able to partner in incident investigation and root cause analysis. An example prompt may be: "visualize HTTP request and 500 errors for last week for my app".

Wednesday, 14 May 2025

Data Flywheels

 The concept of a data flywheel is central to continuous improvement of AI systems.

AI Resources

Good AI resources (what's happening in the AI world):


And company specific AI news:

Tuesday, 13 May 2025

CNCF

We have discussed CNCF in the context of gRPC. 

Other famous hosted projects are Kubernetes, Prometheus and CoreDNS

In their own words they host "critical components of the global technology infrastructure". They also organize conferences.

wsl for Windows 11

 wsl is not installed by default on Windows 11. To install, just type wsl and tap any key to install.

You can then type wsl --version to get version info.

This will tell you the wsl version (e.g. 2.4.13.0), kernel version (5.15.1674-1) and MSRDC version (e.g. 1.2.5716).

The kernel version does not refer to the Windows kernel version but the WSL kernel version. 

Kernel releases can be found here.  MSRDC version refers to Remote Desktop Client whose versions can be found here (and which can be used to connect with Azure Virtual Desktop).

TCF Vendors

On some websites, when presented with the option to accept cookies, you may see a header with TCF vendors.

TCF refers to the Transparency and Consent Framework (a voluntary standard) being promoted by IAB Europe, a Europe-level association for the digital marketing and advertising ecosystem.

This facilitates compliance with GDPR (General Data Protection Regulation which came into force on 25 May 2018) and the ePrivacy directive (aka "ePD", a 2002 directive also known as the "cookie law") proposed to be replaced by some upcoming EU regulation.

An example of a data collector who might be surfaced through the TCF are:

Friday, 9 May 2025

Papers with Code

Papers with Code is a Meta AI initiative that organizes machine learning papers under various themes including Computer Vision, Natural Language Processing, Reasoning, Time Series and Knowledge Representation. Some of these papers are written by corporate researchers contributing to open source.

Thursday, 1 May 2025

Visual Studio Magazine

MSVS is large, complex and suitably changing enough to warrant its own magazine. Read it well.

Dot net (.NET) MAUI for Dummies

Dot net (.NET) MAUI (Multi-platform App UI) is a cross-platform framework for creating native mobile and desktop applications with C# (and XAML optionally).

The upside is you just have one codebase which can be used to render UI on Windows, Android, iOS, macOS and Samsung Tizen

For Windows, WinUI 3 is used as the native platform (this means it will work on Windows 10 version 1809 or later, and Windows 11).

Mermaid has evolved from multifarious UI technologies.