Neural text compression in Ruby

The big idea

The compressor reads a file as raw bytes, predicts each next byte with a neural model, and then uses a range encoder to store the real byte efficiently. On decode, it runs the same predictor again and reconstructs the exact original bytes.

Short version: the neural model creates good probabilities; the range encoder turns those probabilities into actual compression.

encode: bytes -> neural guesser -> probability table -> range coder -> .nc file
decode: .nc file -> range decoder + same neural guesser -> original bytes

Why text compression is different from neural image compression

Neural image codecs can reconstruct an image approximately and still look good. Text is different. If one byte changes, a word, number, or piece of code can change meaning immediately.

Images

Approximate reconstruction is often acceptable.

Text

For lossless compression, the output must match the original bytes exactly.

That is why a lossless neural text compressor does not store an embedding of the text and regenerate it later. Instead, it predicts the next exact byte and entropy-codes that byte.

The main pieces

Input bytes: the file is treated as byte values 0..255.
Context window: the predictor sees recent bytes.
Neural model: scores all 256 possible next bytes.
Softmax + quantizer: converts scores into deterministic integer frequencies.
Range coder: likely bytes use fewer bits, unlikely bytes use more bits.
Container: stores metadata plus the compressed payload.

                +------------------------------+
Input bytes --->| Context window (last N bytes)|
                +---------------+--------------+
                                |
                                v
                +------------------------------+
                | Neural model                 |
                | outputs logits/probabilities |
                +---------------+--------------+
                                |
                                v
                +------------------------------+
                | Softmax + quantizer          |
                | -> integer frequency table   |
                +---------------+--------------+
                                |
                                v
                +------------------------------+
                | Range encoder / decoder      |
                +---------------+--------------+
                                |
                                v
                +------------------------------+
                | .nc container                |
                | header + checkpoints + data  |
                +------------------------------+

Why the neural model matters

The neural model is important because it is the part that discovers what is predictable in the data. Compression gets better when the correct next byte gets a high probability.

Bad model:
a: 0.004
b: 0.004
c: 0.004
...

Good model:
space: 0.45
e:     0.20
o:     0.10
others: small

A weak model still decodes correctly. It just compresses poorly. A strong model gives the range encoder a better chance to save bits.

Why the range encoder matters

The range encoder is the part that performs the actual bit-level compression. It takes the probability table from the model and the real next byte, then stores that byte using about as many bits as its probability deserves.

Division of labor: the model supplies the probabilities; the range encoder harvests those probabilities into a compressed bitstream.

Visual intuition

Start with one interval:
[0, 1)

Split it by probabilities:
A = 50%
B = 30%
C = 20%

[0.00 --------------------|-----------|------ 1.00)
           A                  B          C

If the actual symbol is B, keep only the B slice:
[0.50 ----------- 0.80)

Then split that smaller interval again for the next symbol.
Repeat until the whole message is encoded.

Likely symbols get larger slices and usually need fewer bits. Rare symbols get smaller slices and usually need more bits.

A full example

The phrase Hello world is hello world is read as bytes, not as words.

H e l l o _ w o r l d _ i s _ h e l l o _ w o r l d
(_ = space)

Per-byte flow

Step  Context   Model says likely      Actual byte   Stored in payload
----  --------  ---------------------  -----------   ---------------------------
1     ""        H, T, A, ...           H             code for H under dist #1
2     "H"       e, a, i, ...           e             code for e under dist #2
3     "He"      l, r, ...              l             code for l under dist #3
4     "Hel"     l, p, ...              l             code for l under dist #4
5     "Hell"    o, a, ...              o             code for o under dist #5
6     "Hello"   space, !, ...          space         code for space under dist #6
7     "Hello "  w, t, ...              w             code for w under dist #7
...

Later in the phrase, the second hello world can become cheaper if the model or copy/retrieval heads recognize that the same pattern already appeared.

What the .nc file stores

The compressed file does not store the phrase directly and does not literally store a probability table for every step. It stores metadata plus one continuous range-coded payload.

.nc file

+--------------------------------------------------------------+
| HEADER                                                       |
|--------------------------------------------------------------|
| magic                  = identifies this as an .nc file      |
| format version         = container version                   |
| original size          = 26 bytes                            |
| predictor              = neural                              |
| model identifier       = which model to use                  |
| predictor config       = temperature / quantizer settings    |
| checksum               = hash of original bytes              |
+--------------------------------------------------------------+

+--------------------------------------------------------------+
| CHECKPOINTS (if present)                                    |
|--------------------------------------------------------------|
| where decoding can resume / seek                            |
+--------------------------------------------------------------+

+--------------------------------------------------------------+
| PAYLOAD                                                      |
|--------------------------------------------------------------|
| one continuous range-coded bitstream                         |
|                                                              |
| [code(H | "" )][code(e | "H")][code(l | "He")]...           |
|                                                              |
| not stored as text                                           |
| not stored as per-byte probability tables                    |
| just the compressed choices                                  |
+--------------------------------------------------------------+

Why decompression is exact

Read the metadata.
Load the same model and settings.
Start with the same empty context.
Recompute the same probability table for each step.
Use the range decoder to recover the next byte.
Append that byte to context and repeat.

Even if the model is weak, reconstruction is exact. Weak predictions hurt ratio, not correctness.

What deterministic means

Encoder and decoder must make exactly the same decisions. That means the whole process must follow fixed rules: same model, same math, same context handling, same quantization, same gate decisions.

Same context bytes
-> same expert outputs
-> same gate formula
-> same final mixture
-> same frequency table
-> same decoded byte

No randomness is allowed at runtime. If something changes between encode and decode, the bitstream stops making sense.

Why use copy, retrieval, and a meta-gate

Different prediction sources are good in different situations:

Base neural model: good at general byte patterns and syntax.
Short-copy expert: good at short repeats like punctuation and whitespace patterns.
Long-copy expert: good at longer repeated words, identifiers, or templates.
Retrieval expert: good at finding a similar earlier context and reusing its likely continuation.

A meta-gate is a small referee that decides how much to trust each expert at a given byte position.

Imagine four students guessing the next byte:
- one knows general grammar
- one spots short repeats
- one spots long repeats
- one finds similar earlier text

The meta-gate is the teacher deciding whose guess should count more right now.

It stays deterministic by using fixed rules or a frozen tiny model. Given the same context and the same expert outputs, it always produces the same weights.

Why ratios can still be bad

If the range coder is correct but the compression ratio is weak, the main problem is usually the probability model. The model may be too flat, too local, or too poor at recognizing exact repetition and structure.

Likely issue	Effect on compression
Context too short	Misses longer structure and repeated patterns
Weak copy behavior	Cannot exploit exact recurrence well
Weak retrieval	Finds similar context but not stable continuations
Poor calibration	Correct byte gets too little probability mass
Quantization damage	Sharp probabilities get flattened before coding

What to improve next without using classic compression algorithms

Context-dependent gating: choose expert weights based on the current kind of context.
Multi-order copy experts: treat short, medium, and long repeats as separate specialists.
Agreement-aware gating: trust experts more when they agree on the same next byte.
Better retrieval: score candidates by continuation agreement, not only similarity.
Within-file specialist blending: mix prose, JSON, Ruby, and log specialists per context rather than per file.
Long-range state: carry a deterministic summary of broader document context.

Most practical next step: add a deterministic meta-gate that uses expert agreement, exact-match strength, retrieval agreement, and context-class features to decide how much to trust the base model, short-copy, long-copy, and retrieval experts.

Final mental model

Raw bytes
  -> several predictors propose what comes next
  -> a deterministic gate decides whose advice matters most
  -> probabilities are quantized into integer counts
  -> the range encoder stores the real byte efficiently
  -> the container stores metadata + the final compressed payload

That is the whole system: prediction creates compressibility, range coding turns it into actual bit savings, and determinism keeps the whole process lossless.