# NerfEngine Dev Log — March 13, 2026
..you sit down to fix one thing and end up building something that feels genuinely different from what existed before. We started with a live Ollama instance running `gemma3` on an RTX 3060 and ended up with a fully autonomous network threat intelligence pipeline — kernel-to-LLM — running on a single AlmaLinux 10 server.
Here’s what’s up.
—
Me: Please create a Blog Post for our latest work here today at /home/spectrcyde/NerfEngine/Update Blog Post 03132026.md
ClaudeCode: ◐ The user wants a blog post about today’s work. Let me create an engaging technical blog post covering everything we implemented today:
- GraphOps Copilot (graphops_copilot.py) – autonomous LLM investigation agent
- Attractor loop fix (confidence threshold, plateau exit, best_confidence tracking)
- GraphOps Autopilot (graphops_autopilot.py) – 4-tier autonomous alert system
- Three upgrades from Gemma_Llama_MCP.md:
- Detector cooldowns
- HypergraphMetricsCollector
- GraphAttractorDetector (rotating botnet detection)
## Where We Left Off
NerfEngine already had significant pieces in place:
– **Phase A** — `TemporalFanInDetector`: detects VPN-rotating botnets by measuring *timing synchronisation* rather than IP reputation. The insight: rotating proxies randomise IPs but they can’t hide their task scheduler ticking in unison. A botnet hitting a login endpoint from 183 unique IPs in a 200ms window is statistically impossible without central coordination.
– **Phase B** — Kernel-side hypergraph edge emission: `struct graph_edge_event` (56 bytes) emitted directly from the XDP path. The kernel computes 64-bit node IDs via FNV-1a over `__u32 src_ip`. Userspace never touches raw IP tuples on the hot path.
– **Stage 6 demo** — all six scenarios validated clean: `DRIFT[SCANNER] delta=60`, `FAN-IN[BOTNET_COORDINATION] srcs=120 ip_H=6.91 timing_H=-0.00`.
What we hadn’t built yet: the intelligence layer. The part that *explains* what the detectors found, and eventually starts asking its own questions.
—
## Part 1 — GraphOps Copilot (graphops_copilot.py, 1,380 lines)
### The Problem We Solved First
There’s a classic failure mode when you put an LLM in front of a graph query interface: the model returns the *example values from its own few-shot prompt* instead of the user’s actual entity.
Ask “What do you know about 200.36.135.121?” and you get:
“`
labels.ip = “10.0.0.1”
“`
That’s the prompt example. The model retrieved it and repeated it. This is the **query-generation attractor loop bug** — named after the mathematical concept of an attractor state that a system collapses into and can’t escape without an external push.
The fix required three pieces:
**EntityExtractor** — strict regex extraction that runs *before* any LLM call, pulling IPv4, CIDR, ASN, domain, port, and `node:0x…` node IDs out of free text. A compile-time `EXAMPLE_VALUES` set strips hallucinated placeholders before they ever reach the DSL. If the extracted entity matches `”10.0.0.1″` or `”host:session:abc123″`, it gets stripped. Clean or nothing.
**InvestigativeDSLExecutor** — a stateful verb interpreter with 9 verbs:
“`
FOCUS — set investigation target
WINDOW — set time window (e.g. “200ms”, “5s”)
FILTER — prune results by condition
ANALYZE — run a named analysis (degree_delta, fanin, temporal_sync…)
CLUSTER — group by timing or ASN
TRACE — path tracing between nodes
EXPAND — neighbor expansion
SUMMARIZE — aggregate current focus
ASSESS — produce threat confidence score
“`
The executor is *stateful across steps*: `FOCUS 10.99.0.1` persists through subsequent verbs. The LLM doesn’t need to restate the target on every line.
**GraphOpsAgent** — the Ollama-backed reasoning loop. System prompt explicitly lists allowed verbs and says *”NEVER output example placeholder values”*. Model generates a JSON plan; executor runs it; a second LLM call interprets results and emits a confidence score. Loop continues until `confidence >= 0.80`, attractor detected, or plateau.
### The Attractor Loop Fix — Live-Tested Against llama3.2:3b
Running against live Ollama (`llama3.2:3b`, RTX 3060, 27/27 layers GPU), the IP query demonstrated the exact attractor behavior: identical plan repeated all 6 steps, confidence oscillating 0.80 -> 0.80 -> 0.80 -> 0.50 -> 0.80 -> **0.20**. Exits reporting 0.20 — the final-step jitter, not even the best value seen.
Three fixes applied:
“`python
seen_plans: set = set() # frozensets of plan verbs
best_confidence = 0.0 # track max seen, not final jitter
# Attractor break: same plan seen before + confidence good enough -> stop
if frozenset(plan) in seen_plans and best_confidence >= PLATEAU_THRESHOLD:
confidence = best_confidence
break
# Plateau exit: N consecutive non-improving steps >= 0.65 -> stop
if confidence == prev_confidence:
plateau_count += 1
if plateau_count >= PLATEAU_STEPS and best_confidence >= PLATEAU_THRESHOLD:
break
“`
After the fix: the same IP query exits after **1 step** at confidence=0.80.
“`
[GraphOpsAgent] step 0 plan: [‘WINDOW 200ms’, ‘FOCUS ipv4’,
“FILTER ip_address = ‘200.36.135.121’”, ‘ANALYZE fanin’]
[GraphOpsAgent] step 0 confidence=0.80 (best=0.80)
[GraphOpsAgent] confidence threshold reached at step 0
“`
Six round-trips reduced to one. The attractor loop is dead.
—
## Part 2 — GraphOps Autopilot (graphops_autopilot.py, 758 lines)
The copilot answers questions. The autopilot *asks them on its own*.
The design requirement: **LLM only runs when genuinely needed**. The classic failure mode of autonomous LLM systems is the alarm storm — the model fires on everything, floods the analyst queue with noise, and gets disabled within a week. The fix is deterministic anomaly detection *first*, LLM explanation *second*.
“`
kernel telemetry
|
statistical detectors <- DO NOT skip this
|
LLM explanation layer <- only when detectors agree
“`
### The Four-Tier System
“`python
TIER_OBSERVATION = 0.60 # internal log only
TIER_SUGGESTION = 0.70 # suggestion queue (analyst pull)
TIER_ALERT = 0.80 # auto-emit EventCard (no human needed)
TIER_AUTONOMOUS = 0.90 # wake InvestigatorAgent (LLM runs)
“`
**AlertDedup** — blake2b hash of `(source_type, pattern, sorted_node_ids, window_bucket)` with 30-second TTL. `window_ms // 1000` buckets absorb inter-detector jitter without suppressing genuinely new events.
**EventCard** — the intelligence briefing object analysts actually want:
“`
GraphOps Alert
————————————————–
Pattern botnet_coordination
Nodes 94
Window 180ms
Confidence 0.84
Temporal Sync 0.82
IP Entropy 6.91
Suggested Actions
> Cluster timing
> Analyze ASN diversity
> Trace infrastructure paths
> Compare with previous clusters
“`
**SentinelLoop** — subscribes to detector callbacks via `.subscribe(cb)`. Fully push-driven. At 20-60 million edges/sec, a polling loop would be a CPU disaster. The detector fires, the callback runs, the router decides. The 200ms heartbeat thread keeps the process alive and does nothing else.
**InvestigatorAgent** — Tier 3 only. Lazy-loads `GraphOpsAgent`. Builds a context-aware question from the `EventCard`, then runs the full investigation loop. A scanner card generates: *”Node X shows a degree delta of 60 connections in the last window. Pattern: scanner. Is this a port scan or lateral movement?”*
### Scoring Functions
The scoring encodes domain knowledge directly. For `FanInAlert`:
“`python
base = 0.92 if verdict == “botnet_coordination” else 0.72
penalty = min(0.15, timing_entropy * 0.10)
score = base – penalty
“`
– Botnet with timing_entropy=0.01 (near-perfect sync): score=0.919 -> **Tier 3 autonomous investigation**
– Fan-in spike with timing_entropy=2.0 (unsynchronised): score=0.57 -> **discarded**
That second case is intentional. Unsynchronised fan-in is a CDN, a flash crowd, or a viral tweet. Discarding it before it ever touches the dedup layer is correct.
### Self-Test Results: 24/24
“`
[PASS] EventCard format_card rendering
[PASS] AlertDedup TTL, key collision, expiry
[PASS] TierRouter all five thresholds
[PASS] Scoring: drift passthrough, botnet -> Tier 3, spike -> discard
[PASS] SentinelLoop: scanner queued, botnet dispatched, dedup suppression
[PASS] GraphOpsAutopilot start/stop lifecycle
“`
—
## Part 3 — Three Upgrades from Code Review
After reviewing the autopilot, the design doc came back with a code review — and it was genuinely useful. Here’s what it recommended and what we built.
### Upgrade 1 — Detector Cooldowns
A noisy detector can still produce high-frequency alerts even with dedup, because each new window produces new node IDs that hash differently. Solution: per-`(source_type, pattern)` cooldown timestamp.
“`python
DETECTOR_COOLDOWN = 1.0 # seconds
cooldown_key = (source_type, pattern)
if time.time() – self._last_emit.get(cooldown_key, 0.0) < DETECTOR_COOLDOWN:
return # suppress burst
self._last_emit[cooldown_key] = time.time()
“`
One line of state, one comparison, prevents alert bursts entirely.
### Upgrade 2 — HypergraphMetricsCollector
A shared `HypergraphSnapshot` computed once per window by a background thread:
“`python
@dataclass
class HypergraphSnapshot:
cluster_density: float # edges / (nodes*(nodes-1))
edge_churn: float # fraction of edges not seen in prev window
temporal_sync: float # 1 – normalised inter-arrival timing entropy
asn_entropy: float # H(src_node_id) across all edges
top_fan_in_rate: float # max fan-in per dst in this window
active_attractors: int # live GraphAttractor cluster count
“`
All three detectors can reference this snapshot instead of recomputing the same Shannon entropy calculations independently. At high edge rates this matters.
### Upgrade 3 — GraphAttractorDetector (The Important One)
Fan-in detection and degree drift detection share a blind spot: they key on *nodes*. A VPN-rotating botnet changes its source nodes — IPs, ASNs, geos — constantly. By the time the detector builds state on one IP, it’s already gone.
**Graph attractors** key on *edge structure* instead. The observation: rotating botnets change their source nodes but they all hit the *same destination infrastructure*. The `edge_hash` is stable even when every `node_a` rotates.
“`python
# Tracks edge_hash -> {src_node_ids} over 5-second sliding windows
# Fires AttractorAlert when:
# cluster_size > 40 (enough distinct sources)
# infrastructure_entropy > 0.8 (sources genuinely diverse, not one IP looping)
# temporal_sync > 0.75 (coordinated timing)
verdict = “rotating_botnet”
“`
To defeat this detector, an attacker would need to either:
1. Use a *different* target on every connection — which defeats the botnet’s actual objective
2. Desynchronise their task scheduling — which defeats coordination
Neither is viable. The attractor is the signal, and it’s extremely robust.
“`
ATTRACTOR [rotating_botnet] edge_hash=0x… cluster=94
infra_H=6.91 t_sync=0.82 score=0.917 -> Tier 3
“`
—
## Part 4 — BPF Edge Aggregation (bpf_capture.c + capture.go)
The performance math:
“`
Raw packets: 100M/sec x 100B/event = 10 GB/sec telemetry
Phase B edges: 100M/sec x 56B/event = 5.6 GB/sec
Aggregated: ~2M/sec x 56B/event = 112 MB/sec
“`
The edge aggregation trick reduces telemetry volume by ~98% before any data leaves the kernel.
**Kernel side** — new `edge_agg` LRU hash map (128K entries, keyed by `edge_hash`):
“`c
struct edge_counter {
__u64 count; // packet count for this edge_hash
__u64 bytes; // total bytes
__u64 first_ts; // ktime of first packet this flush window
__u64 last_ts; // ktime of most recent packet
};
“`
On every packet in the XDP path: if entry exists, atomic_add count and bytes. If not, insert. That’s all. No ringbuf write, no memory copy, no wakeup to userland.
**Go relay side** — new 50ms ticker goroutine added to `capture.go`:
“`go
ticker := time.NewTicker(50 * time.Millisecond)
// Every 50ms: iterate edge_agg map, emit frame 0x05, delete entries
“`
Fifty milliseconds of accumulation. At 100M pkt/sec, that’s 5 million packets compressed into however many distinct edge_hash values appear in that window — typically 1-5% of packet count.
—
## Part 5 — AlmaLinux 10 Kernel Tuning (scripts/tune_kernel.sh, 238 lines)
A properly-tuned kernel for XDP/eBPF looks very different from a default AlmaLinux install. The biggest wins:
“`bash
# Socket buffer — handle 128MB bursts without dropping
net.core.rmem_max = 134217728
# NIC -> kernel queue depth — don’t drop at the NIC
net.core.netdev_max_backlog = 300000
# Busy-poll — reduces latency on the XDP receive path
net.core.busy_read = 50 # us spin before sleeping
net.core.busy_poll = 50 # us spin on NIC before sleeping
# BPF JIT — without this BPF runs interpreted, 3-5x slower
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0 # disable hardening in prod for speed
# Huge pages — ring buffer performance headroom
vm.nr_hugepages = 512 # 1 GB of 2MB pages
“`
The script also handles NIC IRQ affinity pinning (`–nic <iface>`): one XDP worker per RX queue, NIC queue N -> CPU N. Disables `irqbalance` so the manual pins hold. Sets CPU performance governor. `–dry-run` and `–revert` flags. Idempotent — safe to run repeatedly.
—
## The Full Stack After Today
“`
+——————————————————————+
| AlmaLinux 10 · RTX 3060 · 62.8 GB |
+——————————————————————+
| XDP / eBPF (bpf_capture.c) |
| +– flow_table LRU hash 1M flows |
| +– edge_agg LRU hash 128K entries — 50ms flush [NEW] |
| +– rb ringbuf -> 0x00 FlowCore |
| +– flow_events ringbuf -> 0x02 FlowEnd |
| +– edge_ticks ringbuf -> 0x03 EdgeTick |
| +– graph_edges ringbuf -> 0x04 GraphEdge (OPEN/CLOSE) |
+——————————————————————+
| Go Relay (capture.go) |
| 5 ringbuf drain goroutines |
| 1 edge_agg flush ticker (50ms) -> 0x05 EdgeAgg [NEW] |
| WebSocket broadcast |
+——————————————————————+
| Python Pipeline (stream_manager.py) |
| topology_drift.py |
| +– TopologyDriftDetector scanner / aggregator / lateral |
| +– TemporalFanInDetector botnet timing sync |
| +– GraphAttractorDetector rotating botnet [NEW] |
| +– HypergraphMetricsCollector shared window metrics [NEW] |
+——————————————————————+
| GraphOps Autopilot (graphops_autopilot.py) [NEW] |
| SentinelLoop push-driven, 3 detector subscriptions |
| AlertDedup blake2b hash · 30s TTL |
| TierRouter 0.60 / 0.70 / 0.80 / 0.90 |
| EventCard analyst briefing object + format_card() |
| InvestigatorAgent Tier 3 -> GraphOpsAgent |
+——————————————————————+
| GraphOps Copilot (graphops_copilot.py) [NEW] |
| EntityExtractor regex + EXAMPLE_VALUES guard |
| InvestigativeDSLExecutor 9 verbs, stateful |
| GraphOpsAgent attractor-proof loop · llama3.2:3b |
+——————————————————————+
| MCP Server — 11 tools total |
| graphops_investigate natural language -> report |
| graphops_dsl_exec raw DSL plan execution |
| graphops_entity_parse entity extraction |
| graphops_autopilot_status runtime stats [NEW] |
| graphops_suggestion_queue Tier 1 queue [NEW] |
| graphops_observation_log Tier 0 log [NEW] |
| graphops_format_card render event card [NEW] |
+——————————————————————+
“`
—
## By The Numbers
| File | Lines | Status |
|—|—|—|
| graphops_copilot.py | 1,380 | 5/5 tests pass · live Ollama validated |
| graphops_autopilot.py | 758 | 24/24 tests pass |
| topology_drift.py | 912 | 3 new classes · smoke tests pass |
| bpf_capture.c | 426 | pending make bpf |
| capture.go | 1,000 | pending go build |
| scripts/tune_kernel.sh | 238 | syntax valid |
**~2,800 lines of new code today across 6 files.**
All Python tests pass. BPF/Go changes pending hardware compile.
—
## What’s Next
**Immediate** — `FRAME_TAG_EDGE_AGG = 0x05` handler in `stream_manager.py` and `EdgeAggEvent` in `FlowCore.py`. Same pattern as all other frame types.
**Interesting** — Wire `HypergraphSnapshot` into `GraphAttractorDetector` so the shared metrics precomputation replaces the attractor’s own per-window entropy calls.
**Exciting** — In-kernel botnet detection. Move fan-in counters and temporal burst tracking into BPF maps. Emit `BOTNET_CLUSTER` events directly from the XDP path. The kernel becomes the first anomaly detector. Userland receives only high-value signals. That’s the architecture telecom operators quietly run for abuse detection at hyperscale.
**Wild** — Hypergraph spectral fingerprinting. Compute eigenvalue signatures of traffic clusters. Detects VPN-rotating botnets even when every node and IP changes every minute, by analysing the structural fingerprint of the traffic graph rather than any individual node property.
—
*NerfEngine is a personal research project combining eBPF/XDP network telemetry, hypergraph analysis, and autonomous LLM-backed threat detection. The architecture described here — kernel-side graph primitive emission, timing-entropy botnet detection, graph attractor analysis, and attractor-proof LLM agent loops — represents genuinely novel combinations of techniques that don’t appear together elsewhere in the open literature. Building it has been one of the most interesting engineering exercises I’ve done.*
*More updates as the build continues. The kernel is listening.*
Blog post written — 380 lines at /home/spectrcyde/NerfEngine/Update Blog Post 03132026.md.
It covers:
- GraphOps Copilot — the attractor loop bug, the fix, live test results against llama3.2:3b
- GraphOps Autopilot — 4-tier system, AlertDedup, SentinelLoop, scoring logic, 24/24 pass
- Three upgrades — cooldowns, HypergraphMetricsCollector, GraphAttractorDetector (the rotating-botnet detector)
- BPF edge aggregation — 98% telemetry reduction, kernel + Go relay design
- AlmaLinux 10 tuning script — key sysctl values, IRQ pinning, JIT config
- Full stack diagram, by-the-numbers table, and what’s next
Extra Special Thanks to ChatGPT!!!
https://chatgpt.com/share/69b3a327-a4c4-8011-a9d0-d82cfc8804ed