Kirby Arinder
2025/08/29
The discourse frequently has little to do with the thing.
So insofar as it is possible, I want to talk about the thing itself. This is hard in only one hour.
(But of course, just like intelligence isn't one thing, AI isn't one thing. We're talking about the current incarnation.)
I. The foundational math of prediction
II. Machine learning
III. Transformers and scaling
IV. The present moment
V. Extrinsics and implications
Let's start by cutting through the mystery:
At its core, contemporary AI runs on really complicated mathematical models that are supposed to predict some stuff.
(What stuff? Words, mostly. It's really good autocomplete. We'll get to that.)
If you understand that, you understand AI!
Let's start with the regression model.
It's a method of assessing the central tendency of multidimensional data and making certain inferences (or predictions) based on that assessment.
It's central tendency is in blue.
Now, a prediction that's one- (or more!) dimensional may not be very useful.
By holding a number of predictor dimensions constant, we can reduce the dimensionality of our expected value!
This creates a conditional expectation.
This practice is very useful:
Well, let's just hold the x axis constant!
Think of predictive text on your phone.
With nothing to go on, the possibilities for the next word are essentially limitless.
With knowledge of the previous word, they begin to be bounded.
With knowledge of two previous words, they begin to be more bounded still…
So that's the math. Now let's bring in the machines!
Ordinarily, much of the art in modeling is in specifying the mathematical form of the model itself.
Nominally, the form of the model ought to reflect the causal structure of reality.
The model models what's going on in the world, and that's how it produces successful predictions!
But that is extremely hard! What if we had alternatives?
Enter machine learning.
This is the first thing today we might recognize as AI.
Instead of taking a set of data and laboriously formulating an equation to relate a set of independent variables to a dependent variable…
We take the same set of data and let the machine decide how the IVs relate to the DV!
SepalL SepalW PetalL PetalW Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
This is it, by the way. You see 100% of the code; I wanted there to be no mystery here.
library(caret)
library(randomForest)
part <- createDataPartition(y=iris$Species, p=0.7, list=FALSE)
trainingdata <- iris[part,]
testingdata <- iris[-part,]
modFit <- train(Species ~ .,data=trainingdata, method="rf")
Random Forest
105 samples
4 predictor
3 classes: 'setosa', 'versicolor', 'virginica'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 105, 105, 105, 105, 105, 105, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.9509804 0.9250961
3 0.9530017 0.9281329
4 0.9541128 0.9298062
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 4.
(And this is live, so I don't know before you do!)
mypred setosa versicolor virginica
setosa 15 0 0
versicolor 0 14 2
virginica 0 1 13
All very well and good, you might say.
But there's a world of difference between that and modern LLMs.
Well, yes and no.
There are many methods of machine learning, and they're good for different purposes.
The one I used above is called a random forest. It's good for classification. It's a supervised method.
But there also unsupervised methods. They are good for other purposes… like predicting text.
For the modern world, there is one method that dominates all others:
In 2017, eight scientists at Google published a paper.
Superficially, it's just a new unguided machine learning method.
But it's faster to train and allows for larger models than previous methods.
The transformer – the name of this new method – seemingly overnight became almost synonymous with AI, rightly or wrongly, in the public imagination, and transformers got the lion's share of investment money.
All the major products you've heard about are transformers!
The “T” in “GPT” stands for “transformer.” You get the idea.
What do we mean, larger models?
Size is important in large language models in two respects:
One, in terms of the number of parameters in the model itself;
And two, in terms of the amount of data upon which the model is trained.
In an old-school predictive modeling, if you make it too fine-grained, you run the risk of overfitting.
In old-school machine-learning, if you train on too much data, you run the risk of just memorizing inputs verbatim.
But then, in 2020, researchers at OpenAI publish another paper.
“At present we do not have a solid theoretical understanding for any of our proposed scaling laws. The scaling relations with model size and compute are especially mysterious… Without a theory or a systematic understanding of the corrections to our scaling laws, it’s difficult to determine in what circumstances they can be trusted.” -p. 22
Now it looks like the sky is the limit!
With the transformer architecture, the notion that arbitrary scaling helps rather than hurting, and a ton of VC money, we entered what looked like a golden age of AI (at least, from some perspectives).
Ever-larger transformers with access to ever-larger percentages of the internet began to achieve ever-more-impressive capabilities, to the point that they seemed not just to be able to mimic linguistic capabilities, but to be able to solve essentially arbitrary problems.
And that brings us to…
LLMs have ingested the entire internet.
The current generation of models has about 1% of the number of parameters that a brain has neurons (though, asterisk, that's a super imprecise analogy) and does fancy agentic stuff and chain-of-thought reasoning to correct for its shortcomings.
So what do we get for all that?
Outside the models themselves, things look funny.
Capital expenditures on AI contributed more to GDP growth this summer than consumer spending – which is itself a majority of that growth!
But no AI company is profitable, and 95% of actual business projects using AI fail. (There are theories about the purely economic aspects of all this, but I'm not competent to evaluate them, so I'll just let this rest here. Not a money guy.)
AI is impressive in the moment but under extended use its cracks begin to show.
Well, here are some other important things we should know.
1. Scaling isn't a panacea after all.
2. AIs don't do formal reasoning.
3. AIs break rapidly on complex problems.
4. Techniques designed to compensate for these problems don't.
If we go back to our basics, these facts shouldn't be too surprising!
So let's talk practicality.
For our work, I think there are five major concerns we face:
AI hallucinates. We've all seen the stories.
It makes up legal citations; it screws up accounting ledgers. If you give it powers to actually do stuff, it can inflict some real damage.
Why?
Remember that AIs don't have internal world models; they don't know about law, accounting, physics, math, or logic. They are contextual text predictors.
Complex problems, novel (out-of-training-distribution) problems, and even problems phrased in a novel way can foil their powers of prediction, and thus their powers of “reasoning.”
Additionally, because their internal workings are opaque (to us, and to some degree to everyone) what constitutes novelty is an open question.
Let's think more broadly. What is a hallucination?
In humans, it's (often considered to be) a belief-state produced by a causally inappropriate process.
But every assertion of an AI is like this!
Accuracy is one thing.
But for our work, we need to not only be able to produce facts; we need to show how we came to know them!
Not only does AI's work have to be checked, but its citations and even its reasoning and code have to be checked.
Remember the chain of thought paper: CoT is not the actual under-the-hood process, and may or may not be true or even coherent!
(Incidentally, that means that if you can't check the code, you shouldn't ask it to code. This is not that kind of workplace.)
Remember, access to data made contemporary AI. And faith in scaling still drives the companies. So you should assume that any input to an AI is used in training that AI, and that training data can emerge at any point as output given the right prompt.
(OpenAI is explicit about this. I'm sure others are as well if you look.)
This is why confidential data can NEVER be used as input to AI you don't control 100%. Note that the linked article about OpenAI points out that many major banks and similar security-minded organizations forbid business use of outside AI for just this reason.
This isn't as big an issue right now, but things are moving fast, and so it seems worth mentioning.
A huge problem with AIs is that data = input = command.
That is, any input to a LLM, in any format, can also be used to issue commands to that LLM.
This is called prompt injection. With agents… this is quite an issue.
Remember that no AI company makes a profit; OpenAI, the most profitable, loses money on every query.
If it is to survive, this is going to have to change.
Which, in turn, means that AI services, if they survive in something like their current form, will have to become more expensive.
“Generate the longest grammatically correct English sentence you can using the smallest number of distinct words.”
[1] "buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo buffalo"
It could go on forever, but if it went to 1000 iterations of 'buffalo', it would be like running your computer at top capacity for an hour!
Even what you can see on the previous screen is six minutes of your computer, pedal to the metal.
Somebody is paying for it….