Hypernym Infinite Memory
Institutional brief for the v0.9 research lane: a small 4,096-context-class model is being evaluated with a memory substrate that accepts and curates prompts far beyond the native context window. This page separates measured evidence from large-model implications.
Measured results exist. The current Tundra public gateway is waiting on its backend process. A separate direct-IP route is alive but currently needs the correct direct-route API key for continued v0.27 testing.
One-Sentence Institutional Claim
Hypernym v0.9 shows that a small local/mobile-class model can operate under memory pressure tens of times larger than its native context, with measured exact structured-recall pockets produced by a substrate layer rather than by enlarging the base transformer context.
This is not a bigger context window claim. The observed behavior is a memory-substrate claim: a small model remains small, but the serving layer changes what survives into active attention under extreme input pressure.
Deck-Safe Language
Hypernym is a memory substrate for small models: it lets a 4,096-context-class, ~3B model accept and process workloads that create roughly 50x native-context pressure, while preserving task-relevant information through substrate curation rather than brute-force long-context attention.
What is unusual: the system is showing useful recall behavior in a compute profile that belongs closer to local/mobile inference than to frontier long-context serving. The substrate is the product: it changes the memory economics around the model.
What remains open: the exact recall pocket has to be widened from tail-sensitive structured recall into stronger position-independent recall before claiming general infinite memory.
Measured fact
A 4,096-context-class small model path accepted and returned responses under prompt pressure up to ~198,888 prompt tokens, about 48.6x native context.
Defensible interpretation
The substrate is acting as a memory-control layer: preserving selected useful information while forcing irrelevant or lower-priority material out of the active window.
Open hypothesis
If the multiplier transfers to larger native-context models, the same substrate pattern may reduce long-context cost and improve isolated enterprise memory.
What is measured
Prompts in the ~88k, ~133k, ~178k, and ~199k token-pressure range were accepted on serving paths built around a 4k-context-class model. A strict JSON pocket-generalization run processed 825,958 prompt tokens total across completed rows.
What is meaningful
The base model is not becoming a 200k-context transformer. The substrate is acting as a memory-control layer: deciding what to preserve, compress, evict, or surface into the active attention window.
What is not proven
This does not yet prove general arbitrary-position perfect recall. The strongest exact provenance behavior is currently a tail-position pocket, not universal random access.
Hard Numbers
| Run / Evidence | Measured Result | Interpretation |
|---|---|---|
| Transport liveness | 300,680 chars / 88,442 prompt tokens in 130.5s; 451,080 chars / 133,242 tokens in 197.9s; 601,480 chars / 178,042 tokens in 266.3s. | Serving path processed far beyond native context with roughly stable prefill throughput near the observed band. |
| Byte threshold | 777,747 chars / 198,888 prompt tokens returned HTTP 200 at 299.38s; next compact row around 818,726 chars failed at 300.8s. | The current operational ceiling is a ~300s request wall, not a demonstrated substrate memory ceiling. |
| Pocket generalization | 9 completed response rows, 7 HTTP-200 rows, 825,958 prompt tokens total, parse success mean 1.0 on HTTP-200 rows, semantic mean 0.8333, exact provenance mean 0.2857. | Exact structured recall exists in specific regimes, but does not generalize across all target positions yet. |
| Quality floor | 20/20 HTTP-200 rows at low pressure, semantic mean 1.0, provenance mean 0.8. | The model can perform the structured task under low pressure; failures at high pressure are pressure/geometry effects, not just task misunderstanding. |
Why This Is Different
Conventional context scaling makes the model window larger. This experiment keeps the base model small and adds a substrate that curates the oversized input into what the model can use. The result is not opaque RAG over a vector database; the memory behavior is driven through prompt/input pressure and substrate curation.
small modelsubstrate memorycurated attentionstructured recallWhy It Matters
If this behavior can be made robust, the deployment model changes: many isolated small instances could carry useful long-memory behavior without every user session needing a frontier-scale context window. That is the cost, privacy, locality, and concurrency thesis.
local/mobile classtenant isolationlower memory footprinthorizontal scaleOpen Question: What If This Is Applied to Large Models?
This has not been proven yet, but the institutional implication is straightforward: if a substrate can give a 4k-context-class 3B model useful behavior under ~50x pressure, then applying the same substrate to larger models may create a multiplier on already-strong reasoning and longer native windows. The right claim is not that a 128k model automatically becomes perfect at millions of tokens; the right next test is whether the substrate shifts the cost/reliability curve for large models the same way it appears to shift the curve for a small one.
| Large-model implication | Potential value | Evidence status |
|---|---|---|
| Context multiplier | A 32k, 128k, or larger context model may use substrate curation to survive workloads beyond its nominal window without full brute-force attention over everything. | Open hypothesis; not yet measured in this lane. |
| Cost shaping | Instead of scaling every request as full long-context attention, the system can attempt to preserve relevant memory geometry and reduce active attention burden. | Architectural implication; requires large-model A/B benchmarks. |
| Enterprise memory | Per-user or per-workflow memory stores could be isolated while still giving each session a long-lived recall surface. | Supported as a design direction; exact product SLA not proven. |
| Reliability frontier | Larger models may reduce synthesis errors while the substrate handles memory pressure; the combination could improve both recall and reasoning. | Open; needs comparative eval against long-context frontier baselines. |
Institutional Implication If the Multiplier Transfers
The strongest institutional thesis is not that every model should be made huge. It is that memory can become a substrate-level service: each tenant, operator, device, or workflow can have an isolated memory store while the active model remains comparatively small, cheap, and deployable.
| Deployment frame | What changes if real | Current evidence level |
|---|---|---|
| Small/local models | A 3B-class model can behave as though it has access to a much larger working memory surface for selected tasks. | Measured in current v0.9 lane under ~49x pressure. |
| Enterprise fleets | Instead of one expensive long-context request per user, run many smaller isolated memory-backed instances and route memory by tenant/workflow. | Architectural implication; concurrency economics still need formal measurement. |
| Frontier models | Use substrate memory to reduce the need to spend full attention over every token in every long-context interaction. | Open hypothesis; needs A/B against native 32k/128k+ model baselines. |
| Operational safety | Keep memory stores separated per user or mission context, reducing cross-run contamination compared with shared ad hoc context stuffing. | Design goal; isolation audits still required. |
How To Say It In A SpaceX-Grade Deck
| Slide phrase | Use this wording | Avoid this wording |
|---|---|---|
| Core capability | Memory substrate lets a 4k-context-class small model process workloads at ~49x native-context pressure in measured v0.9 runs. | Infinite context is solved. |
| Economic thesis | Move part of long-memory behavior from brute-force transformer context into a substrate that curates what reaches active attention. | It is just RAG, vector search, or chunk stuffing. |
| Small-model value | Local/mobile-class models may gain useful long-memory behavior without becoming frontier-scale serving stacks. | A 3B model is now equivalent to a frontier model. |
| Large-model upside | If transfer holds, larger models could use the same substrate to extend useful memory while reducing full-attention cost pressure. | Large-model impact is already proven. |
Boundaries We Should Not Overstate
Do not claim solved infinite memory, perfect recall, or universal random access. The defensible statement is narrower and stronger: the system has accepted and processed extreme prompt pressure relative to native context, and it has produced exact structured recall in measurable pockets. The current experiments are pushing that pocket toward position-independent recall.
Current Research Queue
Tail causality is waiting on the Tundra backend. It will test whether the exactness pocket follows physical tail position or record identity. v0.27 overlapping snowflake set-cover is ready on the direct route, but the live run is paused on a direct-route API-key mismatch after a single 401 response.