Neural text compression: prediction, range coding, and deterministic decoding

Compression as prediction

The compressor reads a file as raw bytes, predicts each next byte with a neural model, and then uses a range encoder to store the real byte efficiently. On decode, it runs the same predictor again and reconstructs the exact original bytes.

Short version: the neural model creates good probabilities; the range encoder turns those probabilities into actual compression.

encode: bytes -> neural guesser -> probability table -> range coder -> .nc file
decode: .nc file -> range decoder + same neural guesser -> original bytes

Why text compression is different from neural image compression

Neural image codecs can reconstruct an image approximately and still look good. Text is different. If one byte changes, a word, number, or piece of code can change meaning immediately.

Images

Approximate reconstruction is often acceptable.

Text

For lossless compression, the output must match the original bytes exactly.

That is why a lossless neural text compressor does not store an embedding of the text and regenerate it later. Instead, it predicts the next exact byte and entropy-codes that byte.

The moving parts

Input bytes: the file is treated as byte values 0..255.
Context window: the predictor sees recent bytes.
Neural model: scores all 256 possible next bytes.
Softmax + quantizer: converts scores into deterministic integer frequencies.
Range coder: likely bytes use fewer bits, unlikely bytes use more bits.
Container: stores metadata plus the compressed payload.

                +------------------------------+
Input bytes --->| Context window (last N bytes)|
                +---------------+--------------+
                                |
                                v
                +------------------------------+
                | Neural model                 |
                | outputs logits/probabilities |
                +---------------+--------------+
                                |
                                v
                +------------------------------+
                | Softmax + quantizer          |
                | -> integer frequency table   |
                +---------------+--------------+
                                |
                                v
                +------------------------------+
                | Range encoder / decoder      |
                +---------------+--------------+
                                |
                                v
                +------------------------------+
                | .nc container                |
                | header + checkpoints + data  |
                +------------------------------+

Prediction quality

The neural model is important because it is the part that discovers what is predictable in the data. Compression gets better when the correct next byte gets a high probability.

A byte-level neural model can learn patterns that are hard to capture with a simple frequency table. It may learn that Hell is often followed by o, that JSON keys continue after an opening quote, or that Ruby code after def tends to form an identifier. This does not mean the model understands the file semantically. It means it has learned regularities that make the next byte less surprising.

Bad model:
a: 0.004
b: 0.004
c: 0.004
...

Good model:
space: 0.45
e:     0.20
o:     0.10
others: small

A weak model still decodes correctly. It just compresses poorly. A strong model gives the range encoder a better chance to save bits. Research systems such as DeepZip frame this as next-symbol prediction plus entropy coding, and more recent neural lossless compression work follows the same basic pattern at larger scale.²³

Turning probabilities into bits

The range encoder is the part that performs the actual bit-level compression. It takes the probability table from the model and the real next byte, then stores that byte using about as many bits as its probability deserves.

A range coder is a practical form of entropy coding. It assigns larger ranges to more likely symbols and smaller ranges to less likely symbols. More likely bytes therefore cost fewer bits; surprising bytes cost more. This is the same broad principle behind arithmetic coding, which represents a message using probability-proportional intervals and can approach the entropy limit when the probability model is good.¹

Division of labor: the model supplies the probabilities; the range encoder harvests those probabilities into a compressed bitstream.

Visual intuition

Start with one interval:
[0, 1)

Split it by probabilities:
A = 50%
B = 30%
C = 20%

[0.00 --------------------|-----------|------ 1.00)
           A                  B          C

If the actual symbol is B, keep only the B slice:
[0.50 ----------- 0.80)

Then split that smaller interval again for the next symbol.
Repeat until the whole message is encoded.

Likely symbols get larger slices and usually need fewer bits. Rare symbols get smaller slices and usually need more bits.

One pass through the pipeline

The phrase Hello world is hello world is read as bytes, not as words.

H e l l o _ w o r l d _ i s _ h e l l o _ w o r l d
(_ = space)

Per-byte flow

Step  Context   Model says likely      Actual byte   Stored in payload
----  --------  ---------------------  -----------   ---------------------------
1     ""        H, T, A, ...           H             code for H under dist #1
2     "H"       e, a, i, ...           e             code for e under dist #2
3     "He"      l, r, ...              l             code for l under dist #3
4     "Hel"     l, p, ...              l             code for l under dist #4
5     "Hell"    o, a, ...              o             code for o under dist #5
6     "Hello"   space, !, ...          space         code for space under dist #6
7     "Hello "  w, t, ...              w             code for w under dist #7
...

Later in the phrase, the second hello world can become cheaper if the model or copy/retrieval heads recognize that the same pattern already appeared.

What the .nc file stores

The compressed file does not store the phrase directly and does not literally store a probability table for every step. It stores metadata plus one continuous range-coded payload.

.nc file

+--------------------------------------------------------------+
| HEADER                                                       |
|--------------------------------------------------------------|
| magic                  = identifies this as an .nc file      |
| format version         = container version                   |
| original size          = 26 bytes                            |
| predictor              = neural                              |
| model identifier       = which model to use                  |
| predictor config       = temperature / quantizer settings    |
| checksum               = hash of original bytes              |
+--------------------------------------------------------------+

+--------------------------------------------------------------+
| CHECKPOINTS (if present)                                    |
|--------------------------------------------------------------|
| where decoding can resume / seek                            |
+--------------------------------------------------------------+

+--------------------------------------------------------------+
| PAYLOAD                                                      |
|--------------------------------------------------------------|
| one continuous range-coded bitstream                         |
|                                                              |
| [code(H | "" )][code(e | "H")][code(l | "He")]...           |
|                                                              |
| not stored as text                                           |
| not stored as per-byte probability tables                    |
| just the compressed choices                                  |
+--------------------------------------------------------------+

Exact decoding

Read the metadata.
Load the same model and settings.
Start with the same empty context.
Recompute the same probability table for each step.
Use the range decoder to recover the next byte.
Append that byte to context and repeat.

Even if the model is weak, reconstruction is exact. Weak predictions hurt ratio, not correctness.

Determinism in the decoder

Encoder and decoder must make exactly the same decisions. That means the whole process must follow fixed rules: same model, same math, same context handling, same quantization, same gate decisions.

Plain version

The compressed file does not contain the original text. It contains a coded path through a sequence of probability tables. The decoder must rebuild the same tables in the same order to follow that path back to the original bytes.

Same context bytes
-> same expert outputs
-> same gate formula
-> same final mixture
-> same frequency table
-> same decoded byte

No randomness is allowed at runtime. If something changes between encode and decode, the bitstream stops making sense. Work on model-driven compression highlights the same synchronization requirement: compression and decompression must reproduce the same probability predictions, because small nondeterministic differences can break decoding.⁴

Copy, retrieval, and the meta-gate

Text, code, logs, JSON, and configuration files often repeat. A neural predictor may learn this partially, but exact repetition is so important that it deserves specialized machinery. A copy head looks into the recent context and asks whether the next byte is probably copied from something nearby. A retrieval head looks for a similar earlier context and asks whether the continuation from that earlier location is likely to repeat now.

Different prediction sources are therefore good in different situations:

Base neural model: good at general byte patterns and syntax.
Short-copy expert: good at short repeats like punctuation and whitespace patterns.
Long-copy expert: good at longer repeated words, identifiers, or templates.
Retrieval expert: good at finding a similar earlier context and reusing its likely continuation.

A meta-gate is a small deterministic selector that decides how much to trust each expert at a given byte position. It might use expert agreement, exact-match strength, retrieval agreement, and context-class features. The gate can be rule-based or implemented as a frozen tiny model, but the contract is the same: given the same context and the same expert outputs, it must always produce the same weights.

if exact_match_score >= 92:
  choose long_copy
else if retrieval_score >= 85 and retrieval_agreement >= 2:
  choose retrieval
else if short_repeat_score >= 70:
  choose short_copy
else:
  choose base_model

This is not a bypass around compression. Copy and retrieval are still probability estimators; they just provide better probability tables to the entropy coder. That resembles the broader history of context-mixing compressors, where multiple statistical models are combined to improve next-symbol prediction.⁵

Where ratios fall apart

If the range coder is correct but the compression ratio is weak, the main problem is usually the probability model. The model may be too flat, too local, or too poor at recognizing exact repetition and structure.

Likely issue	Effect on compression
Context too short	Misses longer structure and repeated patterns
Weak copy behavior	Cannot exploit exact recurrence well
Weak retrieval	Finds similar context but not stable continuations
Poor calibration	Correct byte gets too little probability mass
Quantization damage	Sharp probabilities get flattened before coding

Calibration

Good ranking is not enough. A compressor benefits when probabilities are numerically well calibrated. If the model says a byte has 90% probability but it is right only half the time, the coder will overpay when the prediction fails. If the model is too cautious, it will miss chances to save bits.

Temperature settings, quantization knobs, and domain profiles such as prose, JSON, Ruby, logs, or mixed text are ways of adjusting probability sharpness and representation. The goal is not to make the model sound smarter. The goal is to make its probability tables match the actual byte stream more closely.

Next experiments without classic codecs

Context-dependent gating: choose expert weights based on the current kind of context.
Multi-order copy experts: treat short, medium, and long repeats as separate specialists.
Agreement-aware gating: trust experts more when they agree on the same next byte.
Better retrieval: score candidates by continuation agreement, not only similarity.
Within-file specialist blending: mix prose, JSON, Ruby, and log specialists per context rather than per file.
Long-range state: carry a deterministic summary of broader document context.

Most practical next step: add a deterministic meta-gate that uses expert agreement, exact-match strength, retrieval agreement, and context-class features to decide how much to trust the base model, short-copy, long-copy, and retrieval experts.

The shortest version

Raw bytes
  -> several predictors propose what comes next
  -> a deterministic gate decides whose advice matters most
  -> probabilities are quantized into integer counts
  -> the range encoder stores the real byte efficiently
  -> the container stores metadata + the final compressed payload

That is the whole system: prediction creates compressibility, range coding turns it into actual bit savings, and determinism keeps the whole process lossless.

Sources

Wikipedia contributors, Arithmetic coding, including references to Witten, Neal, and Cleary's Arithmetic Coding for Data Compression, Communications of the ACM, 1987.
Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, Idoia Ochoa, DeepZip: Lossless Data Compression using Recurrent Neural Networks, arXiv, 2018.
Roberto Tacconelli, Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding, arXiv, 2026.
Adam Adler, Synchronizing Probabilities in Model-Driven Compression, OpenReview, 2026.
Byron Knoll, cmix, lossless data compression program documentation.