Windsurf is Codeium. A rebrand, but not a bad one. It states it is the "most powerful AI code editor". Investors include Founders Fund, General Catalyst, Greenoaks and Kleiner Perkins, but that need not trouble Windows Joe. The main thing is this is an invested platform, so developers can invest time in it.
Programming is Not Rocket Science, Don't let AI Write Your Code, Fight Back. And if you must use AI, find provenance, and Attribute. Long Live GNU/Linux. Full praise to SSA-Based Compilation.
Thursday, 10 July 2025
Wednesday, 9 July 2025
Apache Nutch - the Tool that Drives Common Crawl
LLM Training Data
LLMs are trained on large data sets. One such data set is Common Crawl which consists of 250 billion Internet pages with 3-5 billion pages added each month. This is petabytes worth of data (1 petabyte = 10^15 bytes of digital information). The data is stored on Amazon's S3 service allowing direct download or access for Map-Reduce processing in EC2.
Tuesday, 8 July 2025
What is InstructLab?
Friday, 4 July 2025
Latest .NET Version as of July 2025
Latest Supported .NET Versions as of July 2025 is .NET 9 (STS)
The latest stable .NET version as of July 2025 is 9.0.6, released on June 10, 2025.
.NET 9 Patch version 9.0.6; Release: STS; End of support: May 2026 (ORD: November 2024)
.NET 8 Patch version 8.0.17; Release: LTS; End of support: November 2026 (ORD: November 2023)
ORD means Original Release Date.
Release Schedule
Major .NET versions are released annually in November. Each release is defined as STS or LTS at the beginning of the release.
Details of Microsoft's Lifecycle Policy are Below
2 Become 1 - Story of .NET Frame** and .NET Core
What is TriG in Semantic Computing?
TriG is an extension of Turtle for representing all the data in RDF graphs in a compact format. It is a W3C recommendation as of February 2014. TriG stands for "triples in graphs".
Any Turtle statement is also a valid statement in TriG.
Thursday, 3 July 2025
Unsloth
Unsloth aims to speed up the expensive process of LLM training. It does this by rewriting different components of the training pipeline including rewriting the gradient calculation. Their motto is "24 hours not 30 days" which is a reference to LLM training time. It also claims to rewrite GPU kernels for efficiency (functions designed to be executed on GPUs).
The Hugging Face Transformers Library and MRM
Transformers library puts trained open AI models in the hands of Python programmers.
It is maintained by Hugging Face, a hub for "SOTA" AI models.
Hugging Face also maintain markdowns called Model Cards in each relevant model repo to give you insight into the models.
The concept of Model Cards is explained in this paper. It argues, for high impact applications, the Model Card brings critical usage information for deployers to consider. This could be seen as a tool to support Model Risk Management (MRM).
Transformers is available on PyPI and can be installed with pip.
So - what is Preference Alignment in LLM Training?
Preference alignment in LLM training aims to improve an LLM's behavior by forcing it to follow rules and preferences. It could related to stopping offensive language or some other restriction.
Some approaches to preference alignment are detailed in this blog post from Miguel Mendez. There are a number of known techniques for this - these include:
PPO: Proximal Policy Optimization
DPO: Direct Preference Optimization
ORPO: Optimization without Reference Model
For preference alignment we usually need data which is good or bad. Human annotation of such data is often expensive and in some cases a clear "winner" in terms of contrasting data points is not decidable. With KTO two answers can both be regarded as good. This arguably is closer to reality.
KTO stands for Kahneman-Tversky Optimization and is detailed more in a blog post from contextual.ai.
The research paper on KTO should be read to understand how to construct the relevant KTO loss function.
SPARQL is the query language for RDF - Know It
SPARQL is THE query language for RDF. Here are some learning resources.
Both universities and commercial firms are involved in the RDF Star WG.
SPARQL uses pattern matching to query an RDF graph and also allows aggregate algebra operations (such as COUNT) to be performed on qualifying nodes.
The Power is INDEED the LLAMA
Llama 4 ("Leading Intelligence") is out.
There is something called the Llama 4 Community License Agreement. This states you can use Llama models in derived products - but you must tell the world you are using Llama and what version it is.
Machine Unlearning
Tuesday, 1 July 2025
Concept of LoRA or Low Rank Adapation in LLMs
LoRA is an approach to optimizing LLMs by reducing the "size" of the matrix of trainable parameters, as measured by "rank" of the matrix i.e. the number of linearly independent rows or columns.
What is RIO in RDF? Is it relevant post OpenRDF?
RIO is the "RDF Input/Output" toolkit.
The RIO appellation persists even in the post OpenRDF world.
RIO was part of OpenRDF and is now part of RDF4J. Docs are here.
These parsers and writers can be used independently of the rest of the RIO library. An important parser in the toolkit is the RDFHandler, which receives parsed RDF triples. This can be used as a pure listener, or as a reporting tool (being passed to a function that needs to report results back).
It's good to understand RIO both for comprehending legacy messages from OpenRDF and also more recent exceptions from RDF4J.
OpenRDF Usage in Blazegraph
Blazegraph (no longer maintained and unofficially superseded by Amazon Neptune) uses OpenRDF under the hood (rather than the renamed version RDF4J).
This can be found by forcing an exception in the Blazegraph workbench:
org.openrdf.rio.RDFParseException
By typing in some bad syntax into the Update window. Typing "hello world" will do nicely.