K U E I Z E N

A New Kind of Program

Language Processing Units

A Language Model can be viewed as a "Language Processing Unit" capable of interpreting "Natural Language" as code and producing relevant output out of it.

In a sense, this is a dynamic program that can mimic a number of computing processes. Because of that, this can not escape the limitations inherent in Computer Science. In this case, most importantly so, the limitation introduced by Kolmogorov Complexity.

Kolmogorov Complexity

Introduced by Andrey Kolmogorov, this concept refers to the shortest length of a computer program that can generate a particular output or dataset. This concept is a measure of the complexity of a piece of data, based on the length of its simplest possible description.

In simpler terms, imagine you have a sequence or a piece of data. The Kolmogorov Complexity is like finding the shortest possible recipe or set of instructions that can recreate this exact sequence. If the sequence is simple and repetitive, the recipe will be short. But if the sequence is random or highly intricate, the recipe will be longer. This complexity is an important aspect in fields like data compression (where the goal is to represent data in the most compact form) and artificial intelligence.

In the context of language models, Kolmogorov Complexity becomes a fascinating concept. It challenges us to think about how efficiently a language model can be designed to generate specific outputs. It also hints at a deeper question: what's the simplest way to capture the essence of a language or conversation pattern? However, due to the abstract nature of this measure, it's more of a theoretical tool than a practical one in the day-to-day development of language models. It invites us to consider the balance between simplicity and richness in the models we create.

This enigmatic measure weaves a tapestry that intertwines uncomputability with the subtleties of relative language dependence, painting a picture that is both theoretically profound and practically elusive.

It finds its rhythm in the realms of randomness, data compression, and artificial intelligence, but its true beauty lies in its declaration: the simplest descriptions often hold the most profound truths, revealing the elegant complexity of the universe's underlying code.

Breadth and Depth

The "elegant complexity" of this dynamic program, the Language Model, can be reduced to two measures: Breadth and Depth.

Breadth

We will define Breadth as the number of topics for which a specific program exists. This will be akin the concept of a "function" in traditional Computer Science. If a function exists that can sum two numbers; that will count as one. The more functions we can mimic, the bigger the model's breadth.

Depth

In traditional register-based Processing Units, the execution of a sum can be as simple as one instruction. If a given function requires more steps than other, we can say that that function has more "depth" than another. It is not the same to do 2 + 2, than outlining the Fibonacci sequence (1 + 1 + 2 + 3 + 5 + 8 + ... + n). The more computationally complex a function, the deeper we will say it is.

These two concepts, while simplistic in definition, highlight relevant characteristics of contemporary Language Models. That is, that "narrow-minded" 7 billion parameter models can outperform models an order of magnitude larger in very specific tasks.

This has been shown experimentally to be the case. As we can see from the data below:

LoRA Land Predibase's LoRA Land

LoRA Land is a collection of 25 fine-tuned Mistral-7b large language models (LLMs) that outperform standard models like GPT-4 in various tasks. These models are developed using Parameter Efficient Fine-Tuning and Quantized Low Rank Adaptation (QLoRA).

PEFT is a technique that allows large pre-trained models to be adapted to specific tasks by only adjusting a small, impactful subset of parameters. This method conserves computational resources and speeds up adaptation, as it avoids the need to update the entire model, which can be especially beneficial for massive models.

A LoRA is a similar technique, but its main difference is that it does not change the original model. It rather "adapts" them by applying a patch on top of it, which is not only non-destructive but also very efficient. It's like changing the color of an image in Photoshop with a new layer.

Pareto Frontiers in Language Models

Inference time in Language Models is directly correlated to size. The more parameters a model has, the slower its inference speed.

To understand this relationship better, we used symbolic regression—a method that finds the best-fitting mathematical formula for a dataset without being restricted to predefined equations.

This approach helped us analyze our consumer hardware performance data and create a model that accurately predicts the rate at which different language models process data ("tokens per second") with a 9% margin of error and Mean Absolute Error of 2.21.

Q is quantization bits, and size is model size in MiB. MoE models are excluded from the dataset.

Mothership TKS Model Symbolic Regression by the Mothership AI Model

This equation looks a bit convoluted, but it just represents a simple surface where low quantization numbers and small model sizes impact inference speed reasonably.

Mothership TKS Model 3D Plot

As a model is larger—and better trained—the more and better things it can do.

Training is of utmost importance. Just as you can have a "talented yet lazy" person, you can have a big model with less performance than would be expected for its size. Like Grok-1, which users put on par with Mixtral 8x7B; a model almost seven times smaller. 314B vs 46.7B.

Also, larger models converge faster than smaller ones (i.e. TinyLlama).

Larger LLMs Converge Faster From Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models. (arxiv: 2205.10770)
Larger Models Train Faster From Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers (arxiv: 2002.11794)
RoBERTa Gradient Steps From Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers (arxiv: 2002.11794)

Going back to our original measures, it is no surprise that a model's size can increase its depth and breadth. However, we do not yet know what limit there is to the breadth of functions that a model is capable of executing, but we do have some clues as to the depth it can go.

To put it simply, a shallow definition of depth—pun intended—in Language Models is directly related to the model's context size. The more instructions we can fit in a prompt, the more "code" it is executing for that given task.

In-context learning (ICL) in language models, refers to a model's ability to understand and respond appropriately within the specific context of a conversation or text. It's like having a conversation where the model remembers not just your last sentence, but the overall topic and nuances of the conversation so far.

Although ICL is not always straightforward, and is affected by different factors—which are beyond the scope of this writing— two factors are directly correlated with the model size:

  1. Number of model parameters. Studies by Wei et al. (2022b) and Brown et al. (2020) which found that the in-context learning ability of language models grows as the number of model parameters increases.
  2. Number of pre-training steps. Wei et al. (2022b) also suggested that language models suddenly acquire emergent in-context learning abilities when they achieve a large enough scale in terms of the number of pre-training steps.

So increasing the model size, either by having more parameters or by training for more steps, seems to be an important factor that enables stronger in-context learning performance.

In-context Learning Factors From A Survey on In-context Learning (arxiv: 2301.00234)

Performance is all that matters

The concept of 'depth' in a model can also be gauged through its benchmark performance on specific tasks. A model with higher depth is expected to score better on these standardized benchmarks, indicating its superior ability to handle complex tasks or process information efficiently. Essentially, the higher the depth of the model, the better benchmark scores it will have.

The "problem" of memorization

However, there's a big caveat around this. Just like a mediocre student can improve their scores on an exam by illegally carrying a "cheat sheet", a savantic one could memorize the answers altogether. Not mediating any understanding whatsoever, like Funes the Memorious of the JL Borges story.

"I suspect, however, that he was not very capable of thinking. To think is to forget differences, it is to generalize, to abstract. In Funes' crowded world there was nothing but details, almost immediate."

JL Borges, Funes the Memorious

If the exam's answers are part of the initial training data of the model, and/or it is fine-tuned on them, it's akin to learning by rote memorization rather than true understanding. In Computer Science terms, a form of Dictionary storage that can be recalled on command to perform a task.

Data Contamination

This is 'data contamination' where the model is simply recalling information it has previously seen, not genuinely processing or understanding new data. Detecting this data contamination is crucial for assessing a model's true learning capabilities, as it helps differentiate between genuine knowledge and mere recall.

Like we've seen before, there are fundamental limits to computation. And there are only so many things that these systems can learn by memorization. At some point, a large set of {question, answer} pairs becomes just another dataset to train on—and it will generalize.

Smaller Models match larger ones by focusing

Another indirect measure of this depth is provided by the performance metrics of smaller models against big ones on very specific tasks. If a 7B model outperforms a 70B one at a benchmark, we can say that it will at least need X amount of depth that a number of 7B parameters is able to reach.

This can, again, be seen in the previous figure by LoRA Land.

LoRA Land

If you look closely, you'll see that the tasks on the right are much less complex/abstract than the ones on the left. Particularly that one of code generation, where it performs 645% less than the baseline.

Without more specific and thorough looks to the LoRA Land papers and the datasets for their training, it is hard to come up with a definitive statement on the matter.

However, it is not hard to take in account the concept of depth and say that for Sentiment Detection and Linguistic Acceptability tasks, a model needs at least 7B parameters to be able to perform these successfully at the level of a foundational model.

Strategies for measuring depth

Given the black box nature of these models, knowing what is the minimum amount of parameters needed to encode the function while retaining general capabilities is something that we will have to ascertain empirically.

There are several ways of doing this without going through the computationally expensive ways of training a foundational model from scratch at different sizes. One of the easiest methods is quantization, which simplifies a model’s calculations by using less precise data. This process makes the model progressively 'dumber', retaining most of the original structure but decreasing its performance.

The effects of quantization are evident in benchmark performance. As we simplify a model, it tends to perform worse, particularly in tasks requiring greater computational 'depth'. Tasks with high complexity are the first to show a decline in performance when a model is quantized, demonstrating how this simplification affects the model's capability to handle intricate functions.

Very High Quantization Reduces Depth

As an example, quantizing the nous-hermes-2-solar model to 25% of its original size will create a performance drop in the arc-challenge benchmark of ~18%. Meanwhile, its performance drop in the arc-easy benchmark will be only ~6%.

As part of this scaling down procedure, there will be tasks that—at some threshold—will inevitably fail. Once that happens we can say that we found one of the minimal "depths" at which the task can be completed.

What unit we use to measure said depth is yet to be defined. Yet, what we can do is take a composite measure from both parameters and their floating point precision regardless of context size.

If a 7B model fails at a given task when each word is 2.06 bits (IQ2_XXS), but succeeds when said word is 2.31 bits big (IQ2_XS); we can multiply both to just get the "bits" of the system. In this case, we know that the lower bound of this depth is 14,420,000,000 bits. The actual number could be between that and 16,170,000,000 bits—for 2.31 bpw.

The depth baseline for a Language Model is "General Ability"

Another thing worthy of mention, is that the actual depth of this function is likely much lower than this bit estimation. A Language Model isn't supposed to do simple math, and emergently acquires the ability from the training data. Like this skill, it acquires many others in the course of developing high "completion" capabilities; which are then fine-tuned into instruction following templates.

Through this training process it builds something akin a "world model", this costly initial training plus the conversational fine-tuning is what I holistically call "general ability".

The counter-example in humans are those autistic savants with enormous—yet narrow—cognitive capabilities who are barely able to function normally in society. Since some savants have developed their amazing abilities after head trauma, it begs the question whether human brains can be repurposed—chemically or physically—to display savantic traits.

In Language Models, this also provides some insights into the minimum amount of parameters required for "general ability". It is clear from testing, that this threshold is reached after a couple billion parameters. Anecdotally speaking, 3B parameter models seem to lack in that department. The exercise is left to some enterprising independent researcher willing to find the exact number and more precisely defining what "general ability" feels like.

Circling to our previous estimation, part of the 14.4T bit depth of the model is surely implying that many of those bits are just there to provide "general ability". If we take the 5B parameter size as a reasonable ballpark for "general ability", then 10,300,000,000 bits are implied to be "spent" providing said ability. If we can apply this simplistic math to the more complex schema behind it.

For instance, while a person with an above average intelligence will be able to naturally perform mental calculations at a faster pace; he/she will underperform when compared to a savantic calculating genius. And both of these would be no match...for a simple pocket calculator.

We can extend this to the more complicated realms of chess. where mechanistic evaluation of boards has far surpassed the best of human minds at the game. It stands to reason the incredible effectiveness—and low Kolmogorov Complexity—of specialized tools at their tasks.

By Kolmogorov Complexity:
Optimized Mechanistic Solution < Specialized ANN < General ANN

This means that another way of solving the problem, is just having a model with the sufficient intelligence to know when to deploy a tool to solve a given problem and how to interpret its output.

To sum this up:

capacity * (generalAbility + specificSkills) = someConstant

What's more, "general ability" can be viewed as a set of skills over a consistent world model; some of which overlap with the specific skills.

capacity * (skill_1 + skill_2 + ... + skill_n) = someConstant
Overlapping Domains

From this overview, it stands that this neural way of computing will have overlapping domains. To illustrate this, we'll map three arbitrary domains (Math, Management and General Abilities) with their common areas.

  1. Analysis: The intersection of mathematical skills and management abilities, focusing on analytical, data-driven decision-making in management.

  2. Practical Mathematics: Mathematical skills applied to everyday situations and problems, highlighting the practical side of math in daily life.

  3. Adaptability: It covers the intersection of management skills with general life abilities, emphasizing adaptability, and interpersonal skills in a management context.

  4. Competence: A combination of mathematical knowledge, management acumen, and general life skills, representing a well-rounded proficiency in handling complex, real-world scenarios that require a mix of these abilities.

Venn Diagram

From this perspective, we know that some of the skills displayed by these models will have overlapping "bits" of their parameters across many domains. A simpler example would be asking a model for a person's BMI, as it will need to know the formula and how to perform basic arithmetic to obtain the result.

We can see from this that the previous equation was rather simplistic.

It rather looks more like:

capacity * (domain_a * (skill_1 + skill_2) + domain_b * (skills) + ...) = someConstant

Where these "domains" could be seen as fat cohesive nodes on a graph. Without a priori probing of a Language Model's Latent Space, I can assume that this type of architecture should resemble something like this:

Worm Connectome The neuron partner diagram for PVX-class neurons in C. Elegans.

Or this:

C. Elegans Connectome The whole-animal connectome of a male C. Elegans. doi.org/10.1038/s41586-019-1352-7

While it is a large stretch to say that our text-token, synthetically trained, transformer-based (not spiking) artificial neural networks resemble biological connectomes; some principles will hold, most importantly that similar tokens will invariably be nested proximally. Because of the semantic multi-dimensional positions of the embeddings and the attention mechanism in Language Models, and because of physics restrictions and survival demands in Caenorhabditis elegans.

Physical neurons have synapses, axons and dendrites to do the connecting; our artificial equivalents have weights and biases to do similar things. It then stands that "capacity" is literally the "bit building blocks" of these artificial neural graphs.

We can then change the previous simplifying equation to:

(domain_a * (skill_1 + skill_2) + domain_b * (skills) + ...) = capacity
Where a domain is not necessarily a static thing, rather a combination of other domains as well.
domain_a = subdomain_1 + subdomain_2 + ... + subdomain_n
domain_b = subdomain_2 + subdomain_3 + ... + subdomain_n

Since we're limited to a specific amount of parameters we might want to visualize this "frontier", where the line is the sum of all parameters.

The Pareto Frontier

The Pareto Frontier acting as a visual guide. A line on a graph showing the optimal balance between a model's distribution of resources.

Pareto Frontier

This will not faithfully represent the actual "pareto frontier" for the model, but it should be an elegant way of visualizing the concept.

We can attempt a performance based metric for this, where "Depth" is represented by "Difficulty/Complexity" and "Breadth" by "Tasks". It turns out, we already kind of do this with the Benchmark Leaderboards.

Benchmark Scores

Each one of those metrics represents the sum of the evaluations for that set of questions.

The problems of Benchmarks

Like with human IQ tests, this has multiple problems:

Problem #1: We do not know what kind of "mental profile" the model has

For GSM8K, a single number does not tell us whether the model is better at addition, than subtraction or division. It just tells us: "this model should be kind of good at middle school math".

With MMLU—which is probably the best of the bunch—the problem is even worse. What "data" does the score reveal? Is this better at STEM? Humanities? Social Sciences? Law? Medicine? Accounting?

Solution: Make a very explicit set of tasks and show each of the results.

This is a much better representation of the model's intelligence than a single score. And, if the Performance score is correlated with depth (as in high complexity questions); this "Cognitive Profile" should be a much clearer depiction of the "Pareto Frontier" of the model.

Model Cognitive Profile Bonus points for identification of models by overlapping test scores.

It has the added benefit that it could be implemented by showing every single score on the current benchmarks.

Problem #2: There are infinite possible tasks

How do we know where to start? How do we figure out when to stop? It is not practical to have an extremely large benchmark.

If we look at math, the combinatorial problem explodes instantly. We can't have a question for each of the items in a countably infinite set of pairs of numbers and their operations.

Solution: Dimensionality reduction by range

We don't need every single pair of numbers to know that our model can do two digit additions. We do need, however, to add a reasonably difficult task so there are no adjacent cognitive tricks that the model can pull.

For instance, for a Language Model it is very much not the same to do "How much is 15 + 1000?" than "How much is 24 + 3127?" Just like with humans, LMs will resort to the shortcut of writing the 15 on top of 1000.

Problem #2.1: Many tasks are highly correlated

It is very highly likely that the performance of doing two digit additions, will bleed into the performance of doing three digit additions and so on.

In regular human intelligence testing, this is closely related to the concept of g. With Language Models however, no such thing applies—that we know of.

Solution: Task clustering and trimming

If two given tasks have a correlation so high, that the score in one will be a reliable predictor of the other. By choosing your p-values wisely, you won't need to run both tasks if you wish to evaluate the model.

Anecdotally speaking, in my systematic tests of arithmetic performance; one such question is: "How much is 1652 + 21256165?"

I can instantly tell whether a model is good at basic math by asking that proxy question.

Problem #3: Tasks can be memorized

When a measure becomes a target, it ceases to be a good measure Goodhart's law

And in Computer Science optimization, processes can be "memoized"; hitting what is basically a HashMap for the function call in question.

For Language Models, this means training the model on the benchmark's questions. Which we can detect via measuring Data Contamination, making it the number one concerns of a benchmark enjoyer.

Solution: Data shuffling

If a free-fall high school physics problem always has the same bodies in motion, we should change the specific numbers of these bodies to discourage memorization.

Problem #3.1: Answer patterns can be predicted

If a given test is making it really obvious that the correct answer is always A, then a Language Model is likely to catch that up and take advantage of it.

Solution: Answer shuffling

Self-explanatory. By changing the position of the answers randomly, we avoid this problem.

Problem #3.2: Task patterns can be memorized

The Cliff's Notes version of Data Contamination. Physics tasks/problems are amenable to this. Free-fall, fixed-body, tension problems and many more; can fall into clearly defined patterns.

In itself, this is not bad. But there's a difference between really understanding a problem and memorizing where a formula goes.

Solution 1: Mixing

If two trains will collide going head on, transform the trains into two people eating a mile-long spaghetti towards each other on the moon. No amount of training data will be able to capture the range of strange possibilities.

Solution 2: Indirection

Don't let the model have the problem right away, rather let it solve a precursor problem to figure out what type of problem they actually have.

If they need to solve an Electrical Physics quiz, don't give them the electrical charges. Make those charges show up from a previous deduction. Have two water volumes spring from different sources, and electrify those. Let them calculate the charges from there.

Knowing how to divide a problem into different sub-problems is a hallmark of effective problem solving.

Problem #4: Simple Questions are very low depth

Self-explanatory. Many questions have very simple answers, or they rather imply a Dictionary lookup.

Solution: Use multi-stage, continuous effort problems.

Run a simulation. Have the model survive in the wilderness, wait tables at a restaurant, design a bridge.

Conclusion

In conclusion, language models represent a fascinating intersection of computational power, language understanding, and artificial intelligence. By conceptualizing them as "language processing units" capable of interpreting natural language as code, we gain insight into their versatility and potential.

The concepts of breadth and depth provide a minimal—yet expressive—framework for evaluating their capabilities, with breadth referring to the range of topics they can handle and depth representing the computational complexity they can navigate.

While larger models generally exhibit greater breadth and depth, the relationship is not always linear, as evidenced by the remarkable performance of some smaller, fine-tuned models on specific tasks. By solving some of the problems related to benchmarking, we can derive an accurate "cognitive profile" that can also serve as a "Pareto Frontier" of sorts for measuring the intelligence of a model. This has plenty of implications for benchmarking in general.

Ultimately, language models are dynamic programs that acquire general abilities and specific skills through costly initial training and further fine-tuning. Their architecture may vaguely resemble biological neural networks, with overlapping domains and shared subdomains. As we continue to explore and refine these models, we move closer to unlocking their full potential, paving the way for more sophisticated natural language processing and a deeper understanding of the intricate interplay between language, computation, and artificial intelligence.


2024.03.24 Omar Bessa