Thursday, 10 July 2025

Do you know Windsurf? Oh, sorry, Codeium, actually under the hood!

Windsurf is Codeium. A rebrand, but not a bad one. It states it is the "most powerful AI code editor". Investors include Founders Fund, General Catalyst, Greenoaks and Kleiner Perkins, but that need not trouble Windows Joe. The main thing is this is an invested platform, so developers can invest time in it.

Wednesday, 9 July 2025

Apache Nutch - the Tool that Drives Common Crawl

Apache Nutch is the tool that delivers data for Common Crawl. Its GitHub repository also contains a link to the wiki which tells you the active version.

LLM Training Data

LLMs are trained on large data sets.  One such data set is Common Crawl which consists of 250 billion Internet pages with 3-5 billion pages added each month.  This is petabytes worth of data (1 petabyte = 10^15 bytes of digital information). The data is stored on Amazon's S3 service allowing direct download or access for Map-Reduce processing in EC2.

Tuesday, 8 July 2025

What is InstructLab?

Instruct Lab was developed by IBM Research and Red Hat and is an open source product. It is designed to improve training of LLMs (specifically reducing cost of training).  A basic intro can be found here. It uses fine-tuning (both knowledge tuning and skills tuning).

Friday, 4 July 2025

Latest .NET Version as of July 2025

Latest Supported .NET Versions as of July 2025 is .NET 9 (STS)

The latest stable .NET version as of July 2025 is 9.0.6, released on June 10, 2025. 

.NET 9  Patch version 9.0.6;  Release: STS; End of support: May 2026 (ORD: November 2024)

.NET 8  Patch version 8.0.17; Release: LTS; End of support: November 2026 (ORD: November 2023)

ORD means Original Release Date. 

Release Schedule

Major .NET versions are released annually in November. Each release is defined as STS or LTS at the beginning of the release.

Details of Microsoft's Lifecycle Policy are Below

Microsoft Lifecycle Policy | Microsoft Learn

2 Become 1 - Story of .NET Frame** and .NET Core

The merger of .NET Framework and .NET Core was completed with .NET 5 in November 2020. 

The new cross-platform framework was now known simple as .NET.

.NET Framework was Windows only; .NET Core was its cross-platform, open source, cooler cousin.

The "Core" branding was dropped and .NET was now the mainline successor of .NET Core 3.1. .NET Framework 4.8 was frozen with no new features planned.

What is TriG in Semantic Computing?

TriG is an extension of Turtle for representing all the data in RDF graphs in a compact format. It is a W3C recommendation as of February 2014. TriG stands for "triples in graphs".

Any Turtle statement is also a valid statement in TriG.

Unsloth

Unsloth aims to speed up the expensive process of LLM training. It does this by rewriting different components of the training pipeline including rewriting the gradient calculation. Their motto is "24 hours not 30 days" which is a reference to LLM training time. It also claims to rewrite GPU kernels for efficiency (functions designed to be executed on GPUs).

The Hugging Face Transformers Library and MRM

Transformers library puts trained open AI models in the hands of Python programmers. 

It is maintained by Hugging Face, a hub for "SOTA" AI models.

Hugging Face also maintain markdowns called Model Cards in each relevant model repo to give you insight into the models.

The concept of Model Cards is explained in this paper. It argues, for high impact applications, the Model Card brings critical usage information for deployers to consider.  This could be seen as a tool to support Model Risk Management (MRM).

Transformers is available on PyPI and can be installed with pip.

So - what is Preference Alignment in LLM Training?

Preference alignment in LLM training aims to improve an LLM's behavior by forcing it to follow rules and preferences.  It could related to stopping offensive language or some other restriction. 

Some approaches to preference alignment are detailed in this blog post from Miguel Mendez. There are a number of known techniques for this - these include:

PPO: Proximal Policy Optimization

DPO: Direct Preference Optimization

ORPO: Optimization without Reference Model

For preference alignment we usually need data which is good or bad.  Human annotation of such data is often expensive and in some cases a clear "winner" in terms of contrasting data points is not decidable. With KTO two answers can both be regarded as good. This arguably is closer to reality. 

KTO stands for Kahneman-Tversky Optimization and is detailed more in a blog post from contextual.ai.

The research paper on KTO should be read to understand how to construct the relevant KTO loss function.

SPARQL is the query language for RDF - Know It

SPARQL is THE query language for RDF. Here are some learning resources.

SPARQL.dev

SPARQL 1.1 (W3C Site)

SPARQL 1.2 (W3C Site)

RDF-Star (W3C WG)

Both universities and commercial firms are involved in the RDF Star WG.

SPARQL uses pattern matching to query an RDF graph and also allows aggregate algebra operations (such as COUNT) to be performed on qualifying nodes.

The Power is INDEED the LLAMA

Llama 4 ("Leading Intelligence") is out. 

There is something called the Llama 4 Community License Agreement. This states you can use Llama models in derived products - but you must tell the world you are using Llama and what version it is.

Machine Unlearning

As a machine learns, so must it unlearn.  

This ability is needed if an LLM ingests copyrighted content or personal data - it must be able to unlearn information it is not permitted to have. This could also apply to fallacious or untrusted data.

IBM in an article have noted the lack of industry wide tools to evaluated the effectiveness of unlearning.

The IBM piece also highlights research by Microsoft on machine unlearning. This also states the problem of the high cost of retraining models (this costly training process is what has spiked demand for GPUs).

A research paper, which styles itself as a "bridge" paper between unlearning research in classification models to unlearning in generative models focusing on the I2I (image-to-image) generation space.

In the IBM article, the writers go on to describe the SPUNGE framework they have developed for machine unlearning (SPUNGE being short for Split, Unlearn, Merge).

Tuesday, 1 July 2025

Concept of LoRA or Low Rank Adapation in LLMs

LoRA is an approach to optimizing LLMs by reducing the "size" of the matrix of trainable parameters, as measured by "rank" of the matrix i.e. the number of linearly independent rows or columns.

What is RIO in RDF? Is it relevant post OpenRDF?

RIO is the "RDF Input/Output" toolkit.  

The RIO appellation persists even in the post OpenRDF world.

RIO was part of OpenRDF and is now part of RDF4J. Docs are here

These parsers and writers can be used independently of the rest of the RIO library. An important parser in the toolkit is the RDFHandler, which receives parsed RDF triples.  This can be used as a pure listener, or as a reporting tool (being passed to a function that needs to report results back).

It's good to understand RIO both for comprehending legacy messages from OpenRDF and also more recent exceptions from RDF4J.

OpenRDF Usage in Blazegraph

Blazegraph (no longer maintained and unofficially superseded by Amazon Neptune) uses OpenRDF under the hood (rather than the renamed version RDF4J).

This can be found by forcing an exception in the Blazegraph workbench:

org.openrdf.rio.RDFParseException

By typing in some bad syntax into the Update window. Typing "hello world" will do nicely.

Why does Copilot want to rewrite my RDF when it doesn't know how?

Copilot wants to change the "tone" of an RDF file created by Windows Joe which will consume AI credits. This tone could be Formal, Casual, Inspirational or Humor (Sic).  Um, no thanks Copilot. 

What is Turtle in the World of Semantic Web?

Turtle is a type of notation and a W3C Recommendation since February 2014.  

Its full name is Terse RDF Triple Language. 

It is one of a family of ways to write RDF, also known as RDF serialization formats. Turtle's plus point is that it is user-friendly (and less verbose, than say, XML).

Turtle's official documentation states that it:

 "allows an RDF graph to be completely written in a compact and natural text form, with abbreviations for common usage patterns and datatypes. Turtle provides levels of compatibility with the N-Triples [N-TRIPLES] format as well as the triple pattern syntax of the SPARQL W3C Recommendation".