Hypernym Infinite Memory

Institutional brief for the v0.9 research lane: a small 4,096-context-class model is being evaluated with a memory substrate that accepts and curates prompts far beyond the native context window. This page separates measured evidence from large-model implications.

Status

Measured results exist. The current Tundra public gateway is waiting on its backend process. A separate direct-IP route is alive but currently needs the correct direct-route API key for continued v0.27 testing.

Base Model Class

~3B

IBM Granite 4.0 Micro class in the evaluated lane.

Native Context Class

4,096

This is the base model attention window class, not the accepted prompt pressure.

Observed Accepted Pressure

~199k

Prompt-token pressure reached before the ~300s serving wall in byte-threshold runs.

Pressure Ratio

~49x

~198,888 observed prompt tokens divided by 4,096 native context class.

One-Sentence Institutional Claim

Hypernym v0.9 shows that a small local/mobile-class model can operate under memory pressure tens of times larger than its native context, with measured exact structured-recall pockets produced by a substrate layer rather than by enlarging the base transformer context.

Executive frame

This is not a bigger context window claim. The observed behavior is a memory-substrate claim: a small model remains small, but the serving layer changes what survives into active attention under extreme input pressure.

Deck-Safe Language

Hypernym is a memory substrate for small models: it lets a 4,096-context-class, ~3B model accept and process workloads that create roughly 50x native-context pressure, while preserving task-relevant information through substrate curation rather than brute-force long-context attention.

What is unusual: the system is showing useful recall behavior in a compute profile that belongs closer to local/mobile inference than to frontier long-context serving. The substrate is the product: it changes the memory economics around the model.

What remains open: the exact recall pocket has to be widened from tail-sensitive structured recall into stronger position-independent recall before claiming general infinite memory.

Measured fact

A 4,096-context-class small model path accepted and returned responses under prompt pressure up to ~198,888 prompt tokens, about 48.6x native context.

Defensible interpretation

The substrate is acting as a memory-control layer: preserving selected useful information while forcing irrelevant or lower-priority material out of the active window.

Open hypothesis

If the multiplier transfers to larger native-context models, the same substrate pattern may reduce long-context cost and improve isolated enterprise memory.

What is measured

Prompts in the ~88k, ~133k, ~178k, and ~199k token-pressure range were accepted on serving paths built around a 4k-context-class model. A strict JSON pocket-generalization run processed 825,958 prompt tokens total across completed rows.

What is meaningful

The base model is not becoming a 200k-context transformer. The substrate is acting as a memory-control layer: deciding what to preserve, compress, evict, or surface into the active attention window.

What is not proven

This does not yet prove general arbitrary-position perfect recall. The strongest exact provenance behavior is currently a tail-position pocket, not universal random access.

Hard Numbers

Run / Evidence	Measured Result	Interpretation
Transport liveness	300,680 chars / 88,442 prompt tokens in 130.5s; 451,080 chars / 133,242 tokens in 197.9s; 601,480 chars / 178,042 tokens in 266.3s.	Serving path processed far beyond native context with roughly stable prefill throughput near the observed band.
Byte threshold	777,747 chars / 198,888 prompt tokens returned HTTP 200 at 299.38s; next compact row around 818,726 chars failed at 300.8s.	The current operational ceiling is a ~300s request wall, not a demonstrated substrate memory ceiling.
Pocket generalization	9 completed response rows, 7 HTTP-200 rows, 825,958 prompt tokens total, parse success mean 1.0 on HTTP-200 rows, semantic mean 0.8333, exact provenance mean 0.2857.	Exact structured recall exists in specific regimes, but does not generalize across all target positions yet.
Quality floor	20/20 HTTP-200 rows at low pressure, semantic mean 1.0, provenance mean 0.8.	The model can perform the structured task under low pressure; failures at high pressure are pressure/geometry effects, not just task misunderstanding.

Why This Is Different

Conventional context scaling makes the model window larger. This experiment keeps the base model small and adds a substrate that curates the oversized input into what the model can use. The result is not opaque RAG over a vector database; the memory behavior is driven through prompt/input pressure and substrate curation.

small modelsubstrate memorycurated attentionstructured recall

Why It Matters

If this behavior can be made robust, the deployment model changes: many isolated small instances could carry useful long-memory behavior without every user session needing a frontier-scale context window. That is the cost, privacy, locality, and concurrency thesis.

local/mobile classtenant isolationlower memory footprinthorizontal scale

Open Question: What If This Is Applied to Large Models?

This has not been proven yet, but the institutional implication is straightforward: if a substrate can give a 4k-context-class 3B model useful behavior under ~50x pressure, then applying the same substrate to larger models may create a multiplier on already-strong reasoning and longer native windows. The right claim is not that a 128k model automatically becomes perfect at millions of tokens; the right next test is whether the substrate shifts the cost/reliability curve for large models the same way it appears to shift the curve for a small one.

Large-model implication	Potential value	Evidence status
Context multiplier	A 32k, 128k, or larger context model may use substrate curation to survive workloads beyond its nominal window without full brute-force attention over everything.	Open hypothesis; not yet measured in this lane.
Cost shaping	Instead of scaling every request as full long-context attention, the system can attempt to preserve relevant memory geometry and reduce active attention burden.	Architectural implication; requires large-model A/B benchmarks.
Enterprise memory	Per-user or per-workflow memory stores could be isolated while still giving each session a long-lived recall surface.	Supported as a design direction; exact product SLA not proven.
Reliability frontier	Larger models may reduce synthesis errors while the substrate handles memory pressure; the combination could improve both recall and reasoning.	Open; needs comparative eval against long-context frontier baselines.

Institutional Implication If the Multiplier Transfers

The strongest institutional thesis is not that every model should be made huge. It is that memory can become a substrate-level service: each tenant, operator, device, or workflow can have an isolated memory store while the active model remains comparatively small, cheap, and deployable.

Deployment frame	What changes if real	Current evidence level
Small/local models	A 3B-class model can behave as though it has access to a much larger working memory surface for selected tasks.	Measured in current v0.9 lane under ~49x pressure.
Enterprise fleets	Instead of one expensive long-context request per user, run many smaller isolated memory-backed instances and route memory by tenant/workflow.	Architectural implication; concurrency economics still need formal measurement.
Frontier models	Use substrate memory to reduce the need to spend full attention over every token in every long-context interaction.	Open hypothesis; needs A/B against native 32k/128k+ model baselines.
Operational safety	Keep memory stores separated per user or mission context, reducing cross-run contamination compared with shared ad hoc context stuffing.	Design goal; isolation audits still required.

How To Say It In A SpaceX-Grade Deck

Slide phrase	Use this wording	Avoid this wording
Core capability	Memory substrate lets a 4k-context-class small model process workloads at ~49x native-context pressure in measured v0.9 runs.	Infinite context is solved.
Economic thesis	Move part of long-memory behavior from brute-force transformer context into a substrate that curates what reaches active attention.	It is just RAG, vector search, or chunk stuffing.
Small-model value	Local/mobile-class models may gain useful long-memory behavior without becoming frontier-scale serving stacks.	A 3B model is now equivalent to a frontier model.
Large-model upside	If transfer holds, larger models could use the same substrate to extend useful memory while reducing full-attention cost pressure.	Large-model impact is already proven.

Boundaries We Should Not Overstate

Do not claim solved infinite memory, perfect recall, or universal random access. The defensible statement is narrower and stronger: the system has accepted and processed extreme prompt pressure relative to native context, and it has produced exact structured recall in measurable pockets. The current experiments are pushing that pocket toward position-independent recall.

Current Research Queue

Tail causality is waiting on the Tundra backend. It will test whether the exactness pocket follows physical tail position or record identity. v0.27 overlapping snowflake set-cover is ready on the direct route, but the live run is paused on a direct-route API-key mismatch after a single 401 response.