Skip to content

Taming the Torrent a Tsunami of Data

# NerfEngine Stage 6: Taming the Torrent

In the world of real-time data analysis, the sheer volume of information can be overwhelming. At the heart of the NerfEngine project, we’re constantly pushing the boundaries of what’s possible, and our latest “Stage 6” advancements represent a quantum leap forward in our ability to process and understand massive data streams at the speed of thought.

## The Challenge: A Tsunami of Data

The NerfEngine is designed to analyze complex, high-volume data streams in real-time. As our capabilities grew, so did the data. We were facing a tsunami of information that threatened to overwhelm our systems. To combat this, we initiated Stage 6, a project focused on three key areas: edge compression, time-series data storage, and sophisticated event detection.

## Edge Compression: A Diet for Data

One of the most significant advancements in Stage 6 is our new “EdgeTick” data format. Previously, our `FlowCore` struct, which represents a single data event, was 56 bytes. This might not sound like much, but when you’re processing millions of events per second, it adds up quickly.

The `EdgeTick` is a highly compressed, 32-byte representation of the same data. It’s like putting your data on a diet, shedding unnecessary bytes while retaining all the essential nutrients. This 43% reduction in size has a massive impact on our data transmission and storage, allowing us to handle significantly more data with the same resources.

## QuestDB Integration: Time-Series at Scale

To handle the incredible volume of data, we’ve integrated QuestDB, a high-performance, open-source time-series database. QuestDB is built for speed and is a perfect match for the NerfEngine’s real-time analysis needs.

We’re using the InfluxDB Line Protocol (ILP) to write data to QuestDB in batches, which is an incredibly efficient way to insert large volumes of time-series data. This allows us to keep up with the torrent of `EdgeTick` events and store them for historical analysis and model training.

## Topology Drift Detection: Finding the Ghosts in the Machine

With the data flowing smoothly into QuestDB, the next challenge was to make sense of it. Stage 6 introduces two new powerful detection mechanisms:

*   **`TopologyDriftDetector`**: This detector uses a sliding window to look for spikes in the number of connections to and from nodes in our graph. This can indicate a variety of events, from a new server coming online to a coordinated attack.

*   **`TemporalFanInDetector`**: This is where things get really interesting. The `TemporalFanInDetector` is designed to detect coordinated botnets, even those hiding behind rotating proxies. It does this by analyzing the timing entropy of connections. In simple terms, it looks for patterns in the timing of connections that are too regular to be random, a tell-tale sign of a botnet.

These detectors are the “ghost hunters” of our system, constantly searching for the subtle signs of anomalous activity that could indicate a threat.

## Conclusion: A Foundation for the Future

The advancements in Stage 6 are more than just a set of new features; they represent a fundamental shift in how the NerfEngine handles data. By taming the data torrent, we’ve built a solid foundation for the future. We’re now better equipped than ever to tackle the challenges of real-time data analysis and to continue pushing the boundaries of what’s possible.

Stay tuned for more updates as we continue to build on this powerful new foundation!

# NerfEngine Stage 7: Eyes, Memory, and Scale

Every system reaches a threshold where raw capability stops being the bottleneck, and the way the system *thinks about itself* becomes the constraint. Stage 7 is where NerfEngine crossed that threshold. This release isn’t about adding more sensors or faster pipelines — it’s about giving the platform a semantic memory, the ability to repair its own reasoning errors, and a multi-instance architecture that scales without fighting itself.

## The Problem We Were Really Solving

By Stage 6, we had a system that could ingest network flows in real time, build hypergraphs, run LLM inference over graph state, and stream everything to a live operator dashboard with Cesium globe rendering, Recon Entity tracking, and RF signal overlays. That’s a lot of moving parts — and they were moving well.

But three cracks were visible under load:

1. **The LLM was hallucinating edge kinds.** Gemma 3b would output semantically correct ideas like `FLOW_HOST_TO_HOST` or `SESSION_BETWEEN_HOSTS` — things that *mean* something — but they weren’t in our ontology. The static validator dropped them. Drop rates hit 70–90%. The inference pipeline was burning cycles producing outputs the system immediately discarded.

2. **Spawning new instances broke them.** Our orchestrator (`scythe_orchestrator.py`) correctly assigns each spawned instance its own data directory, port, and identity. But two critical subsystems — the DuckDB event store and the EmbeddingEngine semantic memory — were still opening hardcoded global file paths. Every new instance tried to grab the same DuckDB lock already held by the primary process. Instances started degraded.

3. **The asyncio event loop crashed in every spawned instance.** `StreamManager` created a background thread to run an asyncio event loop alongside Flask-SocketIO’s eventlet hub. Eventlet monkey-patches Python’s `asyncio` module, including `_get_running_loop()`, so it returns the eventlet hub from *any* OS thread. The new background loop’s `run_forever()` call hit `RuntimeError: Cannot run the event loop while another loop is running` on every spawn.

These weren’t cosmetic bugs. They meant that every new instance the orchestrator spawned started life with no semantic memory, no event store, and a crashed WebSocket relay thread.

## The Semantic Edge Compiler

The breakthrough here is conceptually simple: **instead of enforcing schema at the hard wall of the validator, enforce it at a semantic distance layer first.**

### What changed

We introduced `semantic_edge_repair.py` — a lazy-initialized, thread-safe repair engine that sits between the static alias table and the drop decision:

“`

LLM output → normalize_edge_kind() (static aliases) → None?

                └→ SemanticEdgeRepair.repair()

                      cosine_similarity(embed(raw_kind), embed(VALID_KIND))

                      → score ≥ 0.82 → ACCEPT (canonical)

                      → score < 0.82 → DROP  (logged for evolution)

“`

On first use, `SemanticEdgeRepair` embeds all 13 valid inferred edge kinds using `embeddinggemma` (768-dimensional) and caches them. For every unknown kind, it computes the cosine similarity between the raw output and each valid kind. If the best match scores above 0.82, the edge is accepted under its canonical kind.

| Raw LLM Output | Semantic Repair | Score |

|—|—|—|

| `FLOW_HOST_TO_HOST` | `INFERRED_FLOW` | 0.87 |

| `SESSION_BETWEEN_HOSTS` | `INFERRED_FLOW` | 0.84 |

| `HOST_IN_ASN` | `INFERRED_HOST_IN_ORG` | 0.91 |

| `FLOW_OBSERVED_HOST` | drops (observed zone) | 0.61 |

The expected result: edge drop rates fall from ~70–90% down to under 10% for the class of novel-but-semantically-valid hallucinations that the static alias table can’t enumerate.

### Ontology evolution built in

Every repair attempt — accepted or rejected — is logged with `{raw_kind, canonical, score, timestamp}`. The `promote_candidates()` method surfaces kinds that appear frequently with decent scores but fall below the acceptance threshold. These are the system’s way of saying: *”this concept is appearing a lot — consider adding it to the alias table or promoting it to a valid kind.”*

“`

GET /api/semantic-repair/stats

→ { total, accepted, rejected, accept_rate,

    top_repairs: [{mapping: “FLOW_HOST_TO_HOST → INFERRED_FLOW”, count: 47}],

    promotion_candidates: [{raw: “PORT_TCP_OBSERVED”, avg_score: 0.78, occurrences: 31}] }

“`

The schema is no longer static. It now evolves from evidence.

## Multi-Instance Architecture: Storage Sovereignty

The orchestrator model is elegant: each `scythe_orchestrator.py` spawn gets a unique `instance_id`, a free port, and a `–data-dir` pointing to `instances/<id>/`. Every SQLite database, snapshot, operator session, and log already respected this boundary. Two components didn’t.

### DuckDB instance-scoping

`ScytheDuckStore` defaulted to a hardcoded `metrics_logs/scythe_events.duckdb`. `EmbeddingEngine` defaulted to a hardcoded `embedding_store.duckdb`. Both open exclusive file locks. The fix is now wired directly into `main()`, after `–data-dir` is parsed:

“`python

_duck_store = ScytheDuckStore(

    db_path=os.path.join(data_dir, ‘scythe_events.duckdb’),

    parquet_dir=os.path.join(data_dir, ‘parquet_blocks’),

)

“`

The EmbeddingEngine received a `db_path` parameter and a `index_path` parameter so the FAISS index also stays instance-local. Each instance now owns its own semantic memory — no shared locks, no cross-contamination.

As a resilience fallback, if a lock conflict somehow occurs, `EmbeddingEngine` gracefully degrades to a PID-scoped temp path rather than failing the entire MCP tool registration.

### The asyncio/eventlet loop fix

The `StreamManager` creates a background thread for WebSocket relay connections. The fix: `asyncio.SelectorEventLoop(_sel.DefaultSelector())` — a plain stdlib loop, created directly without going through eventlet’s monkeypatched `new_event_loop()`. We patch `_check_running` on this specific loop instance to allow it to coexist alongside the eventlet hub. The background thread runs cleanly, `asyncio.run_coroutine_threadsafe()` works as expected, and the `Thread-5 RuntimeError` is gone from every spawned instance log.

**Result:** A fresh spawn now initializes completely — DuckDB event store, FAISS semantic memory, stream relay, and all 37 MCP tools registered — in under 8 seconds.

## Recon Entities: 19,412 Entities, Live

The Recon Entities panel received a complete redesign to handle the PCAP-derived entity set, which now exceeds 19,000 entries.

**What’s new:**

– Entities load in grouped, collapsed sections by type and geographic cluster. Opening the panel no longer causes a full reload — the group structure is cached.

– Live SSE streaming keeps the panel current as new entities are detected.

**Hover-to-probe**: Mouse over any PCAP node and a 350ms debounced `ping -c 1 -W 1` fires against the IP. The status bullet turns green (alive) or red (dead) in real time. For botnet tracking, this is significant — many of these IPs are hit-and-run nodes that go offline between scan windows.

**JIT info cards**: Clicking ℹ️ on any entity loads `/api/recon/entity/<id>` on demand, then renders IP, org, city, country flag, coordinates, byte counts, threat classification, and disposition. Nothing is pre-loaded; the card appears in ~80ms.

**Load 1000 more** replaced load-100, with the server endpoint raising its grouped query limit to match.

## Room Chat: Live Operator Coordination

The room chat system was rebuilt from scratch. The old implementation sent messages that never appeared. The new one is a YouTube-live-style feed:

**No login required.** Unsigned users are identified by their IP address, formatted as `Guest-192-168-x-x`.

**Operators become Recon Entities.** Every user who joins a room is automatically registered as an `OPERATOR` type entity in the Recon panel — geolocated, live, and visible on the Cesium globe.

**SSE primary, polling fallback.** The server pushes messages via `GET /api/chat/<room>/stream` (Server-Sent Events). Browsers that can’t hold SSE connections fall back to polling `GET /api/chat/<room>/messages?since=<ts>`.

**Persistent callsigns.** Operators can set a callsign that persists in localStorage across sessions.

**Colored badges** distinguish operators from guests at a glance.

## Hyperedge Arc Lines

The Cesium hypergraph renderer now draws animated parabolic arcs between connected nodes instead of flat lines.

Arc height follows `max(50km, distance_meters × 0.15)` — short edges get a minimum 50km arc so they’re always visible on the globe; intercontinental edges sweep elegantly across the atmosphere. Cardinality-2 edges use `PolylineDashMaterialProperty` with a `CallbackProperty`-driven animated `dashOffset`. Higher-cardinality hyperedge spokes use `PolylineGlowMaterialProperty` for a distinct visual treatment. A shared `requestAnimationFrame` loop increments all offsets at ~0.48/second, giving the hypergraph the appearance of directed signal flow.

## GraphOps Inference: Timeout and Keepalive

The `embeddinggemma` model on the RTX 3060 takes 90–120 seconds to cold-load after GPU eviction. The previous 45-second abort controller was killing inference before the model finished loading.

**Changes:**

– Frontend `AbortController`: 45s → 300s, with a “⏳ Model loading…” hint at the 60-second mark

– `GemmaRunnerConfig.timeout`: 60s → 300s

– `GraphOpsAgent._llm_call` urlopen: 30s → 150s

– New `ollama_keepalive.py` daemon pings `gemma3:1b` every 25s and `embeddinggemma` every 40s, preventing the GPU eviction cycle entirely

## Command Console: Promoted from Footer

The command console was a `position: fixed` overlay at the bottom of the page — it blocked the globe in mobile viewports and conflicted with other overlays. It’s now a first-class sidebar panel (`showPanel(‘console’)`) reachable from the main menu. The `#command-console` div is unchanged for JavaScript compatibility; only its CSS and DOM position changed.

## What’s Next

The semantic repair layer has shown us something important: the LLM is generating *good concepts* that don’t fit the current schema. The `promote_candidates()` endpoint is already surfacing patterns. The next logical step is a **self-healing ontology daemon** — a background agent that watches repair logs, identifies high-frequency promotion candidates, and proposes schema expansions through the GraphOps MCP interface.

The other frontier is the Android `ScytheCommandApp` APK — all of these UI changes (chat, probe dots, arc lines, console panel) need to be deployed to the WebView layer. The 16KB ELF alignment build process is clean; it’s just a matter of `./gradlew assembleDebug`.

And the streams on `ws://localhost:8765`, `ws://localhost:8766`, and `http://localhost:8234` that `StreamManager` is trying to reach — those endpoints are the live `eve-streamer` WebSocket feeds. Getting those wired up closes the loop between the live packet capture layer and the graph inference engine.

*NerfEngine is a local-first, RF-aware tactical intelligence platform. All inference runs on local hardware. No data leaves the edge.*

# NerfEngine Stage 7: Eyes, Memory, and Scale

Every system reaches a threshold where raw capability stops being the bottleneck, and the way the system *thinks about itself* becomes the constraint. Stage 7 is where NerfEngine crossed that threshold. This release isn’t about adding more sensors or faster pipelines — it’s about giving the platform a semantic memory, the ability to repair its own reasoning errors, and a multi-instance architecture that scales without fighting itself.

## The Problem We Were Really Solving

By Stage 6, we had a system that could ingest network flows in real time, build hypergraphs, run LLM inference over graph state, and stream everything to a live operator dashboard with Cesium globe rendering, Recon Entity tracking, and RF signal overlays. That’s a lot of moving parts — and they were moving well.

But three cracks were visible under load:

1. **The LLM was hallucinating edge kinds.** Gemma 3b would output semantically correct ideas like `FLOW_HOST_TO_HOST` or `SESSION_BETWEEN_HOSTS` — things that *mean* something — but they weren’t in our ontology. The static validator dropped them. Drop rates hit 70–90%. The inference pipeline was burning cycles producing outputs the system immediately discarded.

2. **Spawning new instances broke them.** Our orchestrator (`scythe_orchestrator.py`) correctly assigns each spawned instance its own data directory, port, and identity. But two critical subsystems — the DuckDB event store and the EmbeddingEngine semantic memory — were still opening hardcoded global file paths. Every new instance tried to grab the same DuckDB lock already held by the primary process. Instances started degraded.

3. **The asyncio event loop crashed in every spawned instance.** `StreamManager` created a background thread to run an asyncio event loop alongside Flask-SocketIO’s eventlet hub. Eventlet monkey-patches Python’s `asyncio` module, including `_get_running_loop()`, so it returns the eventlet hub from *any* OS thread. The new background loop’s `run_forever()` call hit `RuntimeError: Cannot run the event loop while another loop is running` on every spawn.

These weren’t cosmetic bugs. They meant that every new instance the orchestrator spawned started life with no semantic memory, no event store, and a crashed WebSocket relay thread.

## The Semantic Edge Compiler

The breakthrough here is conceptually simple: **instead of enforcing schema at the hard wall of the validator, enforce it at a semantic distance layer first.**

### What changed

We introduced `semantic_edge_repair.py` — a lazy-initialized, thread-safe repair engine that sits between the static alias table and the drop decision:

“`

LLM output → normalize_edge_kind() (static aliases) → None?

                └→ SemanticEdgeRepair.repair()

                      cosine_similarity(embed(raw_kind), embed(VALID_KIND))

                      → score ≥ 0.82 → ACCEPT (canonical)

                      → score < 0.82 → DROP  (logged for evolution)

“`

On first use, `SemanticEdgeRepair` embeds all 13 valid inferred edge kinds using `embeddinggemma` (768-dimensional) and caches them. For every unknown kind, it computes the cosine similarity between the raw output and each valid kind. If the best match scores above 0.82, the edge is accepted under its canonical kind.

| Raw LLM Output | Semantic Repair | Score |

|—|—|—|

| `FLOW_HOST_TO_HOST` | `INFERRED_FLOW` | 0.87 |

| `SESSION_BETWEEN_HOSTS` | `INFERRED_FLOW` | 0.84 |

| `HOST_IN_ASN` | `INFERRED_HOST_IN_ORG` | 0.91 |

| `FLOW_OBSERVED_HOST` | drops (observed zone) | 0.61 |

The expected result: edge drop rates fall from ~70–90% down to under 10% for the class of novel-but-semantically-valid hallucinations that the static alias table can’t enumerate.

### Ontology evolution built in

Every repair attempt — accepted or rejected — is logged with `{raw_kind, canonical, score, timestamp}`. The `promote_candidates()` method surfaces kinds that appear frequently with decent scores but fall below the acceptance threshold. These are the system’s way of saying: *”this concept is appearing a lot — consider adding it to the alias table or promoting it to a valid kind.”*

“`

GET /api/semantic-repair/stats

→ { total, accepted, rejected, accept_rate,

    top_repairs: [{mapping: “FLOW_HOST_TO_HOST → INFERRED_FLOW”, count: 47}],

    promotion_candidates: [{raw: “PORT_TCP_OBSERVED”, avg_score: 0.78, occurrences: 31}] }

“`

The schema is no longer static. It now evolves from evidence.

## Multi-Instance Architecture: Storage Sovereignty

The orchestrator model is elegant: each `scythe_orchestrator.py` spawn gets a unique `instance_id`, a free port, and a `–data-dir` pointing to `instances/<id>/`. Every SQLite database, snapshot, operator session, and log already respected this boundary. Two components didn’t.

### DuckDB instance-scoping

`ScytheDuckStore` defaulted to a hardcoded `metrics_logs/scythe_events.duckdb`. `EmbeddingEngine` defaulted to a hardcoded `embedding_store.duckdb`. Both open exclusive file locks. The fix is now wired directly into `main()`, after `–data-dir` is parsed:

“`python

_duck_store = ScytheDuckStore(

    db_path=os.path.join(data_dir, ‘scythe_events.duckdb’),

    parquet_dir=os.path.join(data_dir, ‘parquet_blocks’),

)

“`

The EmbeddingEngine received a `db_path` parameter and a `index_path` parameter so the FAISS index also stays instance-local. Each instance now owns its own semantic memory — no shared locks, no cross-contamination.

As a resilience fallback, if a lock conflict somehow occurs, `EmbeddingEngine` gracefully degrades to a PID-scoped temp path rather than failing the entire MCP tool registration.

### The asyncio/eventlet loop fix

The `StreamManager` creates a background thread for WebSocket relay connections. The fix: `asyncio.SelectorEventLoop(_sel.DefaultSelector())` — a plain stdlib loop, created directly without going through eventlet’s monkeypatched `new_event_loop()`. We patch `_check_running` on this specific loop instance to allow it to coexist alongside the eventlet hub. The background thread runs cleanly, `asyncio.run_coroutine_threadsafe()` works as expected, and the `Thread-5 RuntimeError` is gone from every spawned instance log.

**Result:** A fresh spawn now initializes completely — DuckDB event store, FAISS semantic memory, stream relay, and all 37 MCP tools registered — in under 8 seconds.

## Recon Entities: 19,412 Entities, Live

The Recon Entities panel received a complete redesign to handle the PCAP-derived entity set, which now exceeds 19,000 entries.

**What’s new:**

– Entities load in grouped, collapsed sections by type and geographic cluster. Opening the panel no longer causes a full reload — the group structure is cached.

– Live SSE streaming keeps the panel current as new entities are detected.

**Hover-to-probe**: Mouse over any PCAP node and a 350ms debounced `ping -c 1 -W 1` fires against the IP. The status bullet turns green (alive) or red (dead) in real time. For botnet tracking, this is significant — many of these IPs are hit-and-run nodes that go offline between scan windows.

**JIT info cards**: Clicking ℹ️ on any entity loads `/api/recon/entity/<id>` on demand, then renders IP, org, city, country flag, coordinates, byte counts, threat classification, and disposition. Nothing is pre-loaded; the card appears in ~80ms.

**Load 1000 more** replaced load-100, with the server endpoint raising its grouped query limit to match.

## Room Chat: Live Operator Coordination

The room chat system was rebuilt from scratch. The old implementation sent messages that never appeared. The new one is a YouTube-live-style feed:

**No login required.** Unsigned users are identified by their IP address, formatted as `Guest-192-168-x-x`.

**Operators become Recon Entities.** Every user who joins a room is automatically registered as an `OPERATOR` type entity in the Recon panel — geolocated, live, and visible on the Cesium globe.

**SSE primary, polling fallback.** The server pushes messages via `GET /api/chat/<room>/stream` (Server-Sent Events). Browsers that can’t hold SSE connections fall back to polling `GET /api/chat/<room>/messages?since=<ts>`.

**Persistent callsigns.** Operators can set a callsign that persists in localStorage across sessions.

**Colored badges** distinguish operators from guests at a glance.

## Hyperedge Arc Lines

The Cesium hypergraph renderer now draws animated parabolic arcs between connected nodes instead of flat lines.

Arc height follows `max(50km, distance_meters × 0.15)` — short edges get a minimum 50km arc so they’re always visible on the globe; intercontinental edges sweep elegantly across the atmosphere. Cardinality-2 edges use `PolylineDashMaterialProperty` with a `CallbackProperty`-driven animated `dashOffset`. Higher-cardinality hyperedge spokes use `PolylineGlowMaterialProperty` for a distinct visual treatment. A shared `requestAnimationFrame` loop increments all offsets at ~0.48/second, giving the hypergraph the appearance of directed signal flow.

## GraphOps Inference: Timeout and Keepalive

The `embeddinggemma` model on the RTX 3060 takes 90–120 seconds to cold-load after GPU eviction. The previous 45-second abort controller was killing inference before the model finished loading.

**Changes:**

– Frontend `AbortController`: 45s → 300s, with a “⏳ Model loading…” hint at the 60-second mark

– `GemmaRunnerConfig.timeout`: 60s → 300s

– `GraphOpsAgent._llm_call` urlopen: 30s → 150s

– New `ollama_keepalive.py` daemon pings `gemma3:1b` every 25s and `embeddinggemma` every 40s, preventing the GPU eviction cycle entirely

## Command Console: Promoted from Footer

The command console was a `position: fixed` overlay at the bottom of the page — it blocked the globe in mobile viewports and conflicted with other overlays. It’s now a first-class sidebar panel (`showPanel(‘console’)`) reachable from the main menu. The `#command-console` div is unchanged for JavaScript compatibility; only its CSS and DOM position changed.

## What’s Next

The semantic repair layer has shown us something important: the LLM is generating *good concepts* that don’t fit the current schema. The `promote_candidates()` endpoint is already surfacing patterns. The next logical step is a **self-healing ontology daemon** — a background agent that watches repair logs, identifies high-frequency promotion candidates, and proposes schema expansions through the GraphOps MCP interface.

The other frontier is the Android `ScytheCommandApp` APK — all of these UI changes (chat, probe dots, arc lines, console panel) need to be deployed to the WebView layer. The 16KB ELF alignment build process is clean; it’s just a matter of `./gradlew assembleDebug`.

And the streams on `ws://localhost:8765`, `ws://localhost:8766`, and `http://localhost:8234` that `StreamManager` is trying to reach — those endpoints are the live `eve-streamer` WebSocket feeds. Getting those wired up closes the loop between the live packet capture layer and the graph inference engine.

*NerfEngine is a local-first, RF-aware tactical intelligence platform. All inference runs on local hardware. No data leaves the edge.*

Leave a Reply

Your email address will not be published. Required fields are marked *