If you ask most people what's going on under the hood with artificial intelligence, they'll likely say "I have no idea." If they venture further, it'll be something like: it's a "statistical something-or-other" or a "probabilistic whatchamacallit" (both are the correct technical terms, btw). And they aren't wrong, but if I recall my probability and statistics courses, many eyes were closed.

But there is a completely different way to view AI, and it's one that is more fun because it has visual examples, everyone loves pictures!


It's Compression

There is a lot of data out there, like so much it's mind boggling, estimated to be around 181 zettabytes or 181,000,000,000,000,000,000,000 bytes [1], but it takes time, energy and space to store and transfer it. In fact it would take about 1.8 billion years to download all that data on an average internet connection.

To make the data less bulky, we compress it, like what you do to your clothes when you pack for a trip.

How? Well take a book. Say, and probably everyone's favorite, Marcel Proust's À la recherche du temps perdu. It's a light read at around 1.2 million words, probably finish it in a single sunny afternoon. But if you wanted to store that on a computer, it would take 6 megabytes or 6,291,456 bytes (computers use binary, i.e. base 2, so multiples of two, like ants marching) of space, because the computer has to store every single letter of every single word plus all the punctuation and whitespace.

But to compress it, we can get clever. Let's say you have the sentence "the cat sat on the mat." Normally, you'd store:

t-h-e- -c-a-t- -s-a-t- -o-n- -t-h-e- -m-a-t-.

That's 23 characters. But notice "the" appears twice. So instead, we build a dictionary:

1 = the
2 = cat
3 = sat
4 = on
5 = mat

Now we can store: 1-2-3-4-1-5

Way smaller. And when you want to read it, just look up each number in the dictionary and boom, you get the exact original sentence back. This is called lossless compression, because the original data is perfectly preserved.

Lossless Compression: Dictionary Encoding

But what if we went further? What if instead of storing exact words, you converted "sat" to "was on" because they mean roughly the same thing? You'd compress even more, but now you can't perfectly recreate the original. You'd get "the cat was on the mat" instead of "the cat sat on the mat." Close, but not exact.

That's lossy compression. You lose some of the original data, but preserve the approximate meaning. You save space by accepting "close enough."


Sound and Images

For text, lossy compression is rarely used, because it's difficult to preserve meaning when you're messing with discrete words. Change one word and the whole sentence can flip.

But for sound and images? Totally different story.

Sound is a continuous wave. Images are grids of pixel values that blend smoothly into each other. Videos are just images plus sound, multiplied by frame rate and duration. Massive amounts of data.

But here's the thing: they operate on smooth continuums. A pixel that's RGB (100, 150, 200) is pretty similar to a pixel that's RGB (101, 149, 201). Your ear can't really tell the difference between a soundwave at 440.0 Hz and 440.1 Hz.

And that smooth continuum? That's where the magic happens.

If you remember functions from math class, they map an input to an output. Like y = 2x + 1 gives you a straight line: feed in any x value, get a y value out.

So instead of storing every single pixel value or soundwave point, we can find a function that approximates those values. Store the function parameters (like the 2 and 1 in our example), then when you want it back, just run the function and get something very close to the original.

Function Approximation: Storing a Curve Instead of Data Points

We're not storing the data anymore. We're storing the recipe to recreate something very similar to the data. (There is actual math behind why this works, called the universal approximation theorem [2]. The gist: enough layered functions, tuned right, can approximate almost any smooth relationship. Which, it turns out, describes a lot of the world.)

Now, "approximation" is the key word here. The function isn't perfect. It's close. The original data is gone. We can't get it back, because we only saved the function parameters.

And here's where it gets messy: if you compress once, it looks pretty good. But if you take that compressed version and compress it again? Then compress that? You're making an approximation of an approximation of an approximation. Each generation gets worse.

The same portrait photo shown four times, progressing from original detail to visible JPEG artifacts after repeated re-compression.

Left to right: original, one JPEG save, several saves, and repeated re-compression. Built from a photo by Desiray Green on Unsplash.

Eventually you end up with garbage. You've probably seen it: a photo that looks like someone poured wet concrete over the pixels. That's a JPEG that got re-compressed a dozen times. The format was actually designed around this [3]: it quietly discards fine detail your eyes are unlikely to miss, and that works great until it doesn't.


AI is Compression

If you took a statistics class, you might remember linear regression. You have a bunch of data points scattered on a graph, and you want to find the line that best fits them. That line is defined by just two numbers (slope and intercept), but it approximates the relationship across all your data points.

Well... that's exactly what we just did with sound and images. Find a function that approximates the data.

And that's what AI is doing.

AI models are functions with parameters that approximate patterns in data. When you send text to an LLM (Large Language Model, like ChatGPT or Claude), it runs the function with the parameters it learned during training plus your input, and outputs a result.

An approximation of patterns it saw in its training data.

But wait, didn't I say lossy compression on text is hard?


How Text Gets Compressed

Yeah, text is tricky because words are discrete chunks and meaning depends on context. So how do LLMs compress text?

They cheat. They turn text into something that acts like a smooth continuum.

Here's how:

  1. Take words, convert them to numbers (this is called tokenization)
  2. Pass those numbers through functions that analyze relationships between them
  3. Output new numbers that represent each word's meaning in a multi-dimensional space where words with similar meanings end up close to each other

So "cat" and "dog" end up near each other because they're both animals. "King" and "queen" end up near each other. "Happy" and "joyful" cluster together.

This happens in something called latent space. Fancy term for the hidden middle layers of the model where all the approximation happens. You can't see it directly, but that's where your text gets turned into coordinates in a giant multi-dimensional space.

There's a famous party trick from this: king - man + woman ≈ queen [4]. The math works because those words really are neighbors in meaning-space. A little eerie when you first see it.

Word Embeddings: Words as Points in Meaning Space

Now text exists as points in a smooth continuum, and we can use the same function approximation tricks we use on images and sound.

We pass those values through many layers of functions. Each layer builds on the last: first finding patterns in raw numbers, then patterns in those patterns, then patterns in those. The deeper you go, the more abstract and meaningful the relationships become. At the end, we convert it all back into words.


Useful

Here's where it gets interesting. AI training often happens in two main phases.

Pre-training is like the initial compression phase. You feed the model massive amounts of data (like trillions of words) and it learns to compress that data into function parameters by approximating all those relationships, and the relationships of relationships, and so on. At this stage it's very good at recreating text that looks like what it was trained on. Give it the title of a book it was trained on and it can generate a pretty good approximation of the text of that book.

Pre-training is a general term for the initial training phase in multi-phase training processes (which not all ML models use). But it's not just used in LLMs. For example at Siua, our base spatial models are pre-trained using various aspects of synthetic ballistic images in order to learn spatial relationships from sub-pixel features (a pixel is one solid color, so finding a 'feature' inside that is quite a feat!). We then train on real datasets for release models, like mlSpatial's v2 golf model, letting us achieve exceptional accuracy beyond what other systems can offer.

Very cool, but not the most useful on its own.

And functions take an input, remember f(x) = 2x + 1 from algebra? The input is x, the output is y. So when we give it an input, it runs the function with that input and outputs a result.

So what happens when we give it an input it wasn't trained on?

Well it's going to run that function, and it's going to output something... something new.

But because the functions have been tuned to approximate so much, trillions of words, that output will likely share enough of the same patterns that we get comprehensible text rather than gibberish. That's the whole trick: predict the next word, one at a time, repeat [5]. It sounds almost too simple. But trained on enough text, those parameters end up encoding something surprisingly close to language itself.

And to make that more useful, we can then 'fine-tune' the model. This takes all those parameters that represent all that compressed information, and it just tweaks them a little bit so that instead of outputting the original data, it can do things like output answers to questions, or write code.


So What?

Much of the behavior of LLMs, and AI in general, can make a lot of sense when you start thinking of it as lossy compression, including many of the failure cases. Why does it output wrong information?

Because it's not retrieving facts, it's just executing functions that produce approximations of the distribution of the training data.

The model doesn't 'know' anything. It has no access to the original training data. That's gone. What remains are the compressed patterns, the function parameters that approximate relationships it saw during training.

When you ask it a question, it's not thinking or reasoning or looking anything up. It's running those functions to generate the most probable next word, one at a time, based on the patterns it compressed.

Sometimes that lands you incredibly close to accurate information, especially for common patterns it saw thousands of times during training. Sometimes it lands you in a part of latent space that generates confident, plausible-sounding nonsense, because the approximation looked right from where it stood. Systematic surveys of this failure mode, where models generate confident, fluent, factually wrong statements, have documented it extensively across task types [6].

Why is it good at some things and terrible at others? Things that appeared thousands of times in training, common code patterns, standard writing structures, typical summaries, were compressed very well. The function approximation there is tight, the output reliable. Things that were rare, or require precise facts, or combine concepts in unusual ways, land in sparse regions of latent space, where the approximation is lossy in ways that matter.

Why does it degrade with longer context? Attention isn't magic — every token competes for influence, and that signal gets diluted across greater distance. The model also trained mostly on shorter documents, so very long contexts push it toward the edges of what it practiced. The approximation loosens the further you stray from the training sweet spot. This is also part of what drives model collapse: when AI output becomes training data for the next model, each generation compresses an already-compressed signal, and the artifacts compound. The JPEG problem, but for ideas.

None of this is a criticism. It's just what the thing is.

Knowing that AI is lossy compression tells you exactly when to trust it and when not to. Use it for things where "close enough" is genuinely close enough: drafting, summarizing, brainstorming, pattern recognition, code that you're going to review anyway. Don't use it for things where exactness matters: precise facts, legal text you won't verify, calculations you'll stake money on without checking.

The model doesn't know it's compressing. It doesn't know when it's in the wrong neighborhood. It just runs the function and returns the output. That's not a bug. It's the mechanism.

And honestly, most useful things work that way. Memory is lossy compression. Language itself is lossy compression. We approximate experience with words and most of the original gets lost. We navigate most of life on close enough.

The question isn't whether AI is lossy. Everything useful is.

The question is whether the approximation is good enough for what you actually need to do.

Sometimes it is. Sometimes it isn't. Now you can tell the difference.


References

  1. Statista. (2025). Volume of data or information created, captured, copied, and consumed worldwide from 2010 to 2029 (in zettabytes). The published series reports 181 zettabytes for 2025 and is based on IDC Global DataSphere forecasting. statista.com/statistics/871513/worldwide-data-created
  2. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314. The foundational proof that feedforward networks with a single hidden layer can approximate any continuous function to arbitrary precision. link.springer.com/article/10.1007/BF02551274
  3. Wallace, G. K. (1991). The JPEG still picture compression standard. Communications of the ACM, 34(4), 30–44. Explains the DCT-based lossy compression pipeline designed around perceptual limits of human vision. doi.org/10.1145/103085.103089
  4. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NeurIPS 2013. Introduced Word2Vec and demonstrated that semantic relationships are preserved as geometric relationships in embedding space. arxiv.org/abs/1310.4546
  5. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... Amodei, D. (2020). Language Models are Few-Shot Learners. Introduced GPT-3 as a large autoregressive language model and documented how next-token prediction at scale produces broad downstream capabilities. arxiv.org/abs/2005.14165
  6. Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., … Fung, P. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1–38. A systematic review of factual errors in language model outputs across summarization, question answering, dialogue, and LLM settings. arxiv.org/abs/2202.03629