Post-Transformer Hackathon · Pathway × IIT Ropar
0
Tokens processed
0%
Avg BDH activation
0
Synapses probed

Same token.
Completely different neural response.

Type any word and watch BDH fire ~5% of neurons while a Transformer activates nearly all of them — on the exact same input.

Path A: Visualization Section 6.4 of BDH Paper Sparse ReLU Activations Interpretability by Design
Input token
Try:
128
Dragon Hatchling (BDH)
Post-transformer architecture · Pathway
~5%
neurons fired
activation rate
Transformer (GPT-style)
Dense matrix attention · Current standard
~95%
neurons fired
activation rate
Top Active Neurons
BDH sparse activations
Active neurons: — / — (—%)
Silent neurons: — / — (—%)
Transformer would activate: ~95% (—) neurons
Syntax/structure — neurons 0–10
Semantic meaning — neurons 11–25
Currency / numbers — neurons 26–40
Geography / places — neurons 41–55
Medical / biology — neurons 56–70
Language pattern — neurons 71–90
Abstract reasoning — neurons 91+
Legend
Active — BDH
Active — Transformer
Silent neuron
Activation density across layers
% of neurons firing at each depth — same input token
BDH
Transformer
Hebbian Memory — σ Matrix
Synaptic state for the last matched token
Each cell shows how strongly two neurons wired together processing this token. Brighter = stronger synaptic bond. This is Hebbian learning made visible.
Memory History

Insight cards

Why sparsity matters for efficiency
Fewer active neurons = less computation per token. BDH's ~5% activation rate means roughly 19× fewer neuron computations compared to a dense transformer on the same input.
🔬
Why sparsity enables interpretability
When only 5% of neurons fire, you can actually inspect them. The BDH paper demonstrates "currency synapses" and "country synapses" — single synapses that encode one concept, consistently, across languages.
🧠
This isn't pruning or distillation
Transformer sparsity requires post-hoc tricks (L1 regularization, pruning, distillation). BDH's sparsity emerges naturally from its architecture — sparse ReLU activations are built into the design, not forced.

Architecture comparison at a glance

Property Transformer BDH (Dragon Hatchling)
Activation density ~95–100% of neurons fire ~5% of neurons fire
Memory mechanism KV-cache (grows with context) Hebbian synapses (constant size)
Attention complexity O(T²) — quadratic O(T) — linear
Interpretability Black box, polysemantic neurons Graph structure, monosemantic synapses
Learning after training Frozen weights, no adaptation Inference-time Hebbian updates
Structure Dense matrix layers Scale-free graph of neurons

Synapse Inspector

In BDH, individual synapses reliably encode specific concepts — this is called monosemanticity. Transformers cannot do this.

Probe a word

Continuous Memory

BDH learns at inference time via Hebbian updates. The BDH panel strengthens synapses and stores facts instantly, while the Transformer panel remains fixed and cannot learn without retraining.

Teach BDH a fact

BDH

Synapse strength is idle

    Transformer

    Cannot learn at inference time
    Requires full retraining to incorporate new facts.
    I don't know.