Edge Agent Architecture: Full Stack Reference Guide

Last reviewed: 2026-05-22 · Marcus Rüb

An edge agent architecture consists of layered subsystems — a data ingestion layer, an agent runtime, a local inference engine, a state and memory layer, an action layer, and an optional hybrid sync layer — all coordinated to enable autonomous, locally-grounded AI decisions at or near the data source.

This page provides a complete reference architecture. Each component is described independently so teams can adapt the stack to their hardware constraints and use-case requirements.

Top-Level Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                        CLOUD / ON-PREM DC                       │
│   ┌──────────────┐  ┌────────────────┐  ┌───────────────────┐  │
│   │ Cloud Agent  │  │ Model Registry │  │ Cross-Asset Store │  │
│   │ (Frontier LLM│  │ (Versions,     │  │ (Event history,   │  │
│   │  reasoning)  │  │  Manifests)    │  │  KPIs, Policies)  │  │
│   └──────┬───────┘  └────────┬───────┘  └─────────┬─────────┘  │
└──────────┼───────────────────┼────────────────────┼────────────┘
           │ HTTPS / MQTT over │ TLS        Delta    │ Sync
           │ WAN (async)       │            Sync     │
┌──────────┼───────────────────┼────────────────────┼────────────┐
│          │      EDGE GATEWAY / AGENT HOST          │            │
│   ┌──────▼───────────────────────────────────────▼──────────┐  │
│   │                   AGENT ORCHESTRATOR                     │  │
│   │  Task queue │ Tool router │ Context builder │ Planner    │  │
│   └──────┬──────────────┬────────────────────────────────────┘  │
│          │              │                                        │
│   ┌──────▼──────┐ ┌─────▼──────────┐  ┌──────────────────────┐ │
│   │  LOCAL LLM  │ │ VECTOR DB /    │  │  ACTION LAYER        │ │
│   │  INFERENCE  │ │ RAG CORPUS     │  │  OPC UA write        │ │
│   │  (llama.cpp │ │ (Qdrant,       │  │  MQTT publish        │ │
│   │   Ollama,   │ │  ChromaDB,     │  │  REST API call       │ │
│   │   OpenVINO) │ │  Milvus)       │  │  Dashboard update    │ │
│   └─────────────┘ └────────────────┘  └──────────────────────┘ │
│          │                                                       │
│   ┌──────▼───────────────────────────────────────────────────┐  │
│   │                  DATA INGESTION LAYER                    │  │
│   │   OPC UA Client │ MQTT Sub │ Modbus Poller │ S7 Reader   │  │
│   └──────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
           │
     Field Devices: PLCs, Sensors, Drives, Vision Systems

Component Descriptions

Data Ingestion Layer

The ingestion layer is the bridge between field data and the agent. It abstracts the industrial protocol complexity so the agent runtime works with normalized, typed data objects regardless of whether the source is OPC UA, Modbus TCP, or MQTT.

Key responsibilities:

Subscribe to OPC UA nodes (subscriptions, not polling, where possible)
Poll Modbus TCP holding registers on configurable intervals
Subscribe to MQTT topics from field SCADA or broker
Normalize all incoming data to a common schema (tag name, value, timestamp, quality code)
Apply dead-band filtering to suppress noise before forwarding to the agent

A common implementation uses Eclipse Milo (Java/Kotlin OPC UA client), pymodbus, or the paho-mqtt Python client. For higher-performance deployments, open-source bridges like Eclipse Kura or custom C++ adapters are used.

Agent Orchestrator

The orchestrator is the core control loop. It:

Receives events from the ingestion layer or a scheduler
Builds a context object (relevant recent readings, agent memory, retrieved documents)
Routes the context to the appropriate tool (LLM, classifier, rules engine, external API)
Interprets the tool’s output and decides on actions
Executes actions via the action layer
Persists the decision and its outcome to the local state store

In Python-based implementations, frameworks like LangChain, LlamaIndex, or custom asyncio loops serve as the orchestrator. The orchestrator maintains a local event loop independent of cloud connectivity.

Local LLM Inference

The inference component serves model requests from the orchestrator. In production industrial deployments, the recommended setup is a locally running inference server rather than a direct library call — this allows the model to be swapped without redeploying the agent code.

Recommended serving configurations:

Hardware	Inference Server	Recommended Model
Intel x86 industrial PC, 16 GB RAM	Ollama (OpenVINO backend) or llama.cpp server	Qwen3-4B Q4_K_M or Phi-4-mini
NVIDIA Jetson AGX Orin	Ollama (CUDA) or TensorRT-LLM	Llama 3.3 8B Q4
ARM DIN-rail gateway, 8 GB RAM	llama.cpp server (CPU)	Phi-4-mini Q4_K_M or SmolLM3
Industrial PC + discrete GPU (8 GB VRAM)	Ollama CUDA or llama.cpp CUDA	Llama 3.3 8B or Gemma 3 9B

Vector DB / RAG Corpus

The local vector database stores embedded chunks of machine documentation, historical fault logs, standard operating procedures, and any domain knowledge the agent needs to retrieve at inference time.

Component choices:

Qdrant — Rust-based, embeddable, excellent performance, good fit for industrial edge
ChromaDB — Python-native, easy to embed, lower performance ceiling
Milvus Lite — Embedded mode of Milvus; appropriate for single-node deployments
SQLite + sqlite-vss — Minimal footprint option for very constrained hardware

The embedding model runs locally. Common choices: nomic-embed-text (768-dim, good quality), all-MiniLM-L6-v2 (384-dim, faster), or a domain-fine-tuned encoder.

Action Layer

The action layer executes decisions made by the orchestrator. Actions are categorized by authorization level:

Action Class	Authorization Required	Examples
Read / observe	None	OPC UA read, historian query
Inform / notify	None	Dashboard update, email alert, MQTT publish to status topic
Recommend	None	Advisory text to operator UI
Actuate (low risk)	Operator acknowledgment	Write a setpoint within pre-approved bounds
Actuate (high risk)	Formal approval workflow	Shutdown sequence, mode change

This tiered authorization model aligns with IEC 62443 access control requirements and prevents autonomous actuation in safety-relevant contexts.

State and Memory Layer

Edge agents require persistent state across restarts. A lightweight embedded database (SQLite is standard) stores:

Short-term conversational memory (last N interactions)
Long-term event log (structured records of agent decisions)
Outbox queue (unsynced events for deferred upload)
Agent configuration and tool manifests

For multi-agent setups, a shared Redis instance on the local network enables inter-agent memory sharing.

Model Registry

The model registry is a lightweight service (local or cloud-hosted) that tracks:

Current deployed model version per device
Available model updates and their integrity hashes
RAG corpus version and document manifest

The edge agent polls the registry on a configurable interval. Updates are downloaded, hash-verified, and staged before being activated during a maintenance window. This prevents mid-session model swaps that could change agent behavior unpredictably.

Observability Layer

Production edge agents require observability. Key telemetry:

Inference latency per request (P50, P95, P99)
Token throughput (tokens/second)
RAG retrieval latency and hit rate
Action execution success/failure counts
Outbox queue depth (connectivity health proxy)
Model version in use

Lightweight options: Prometheus + Grafana (local), or structured JSON logs shipped to a central ELK stack when connectivity is available.

Reference Implementation: edge-agents

The abstract architecture above has a concrete open-source reference: ForestHub’s edge-agents runtime (github.com/ForestHubAI/edge-agents). It implements the agent orchestrator and action layer as a directed workflow graph with five typed edges:

Edge type	Maps to
`control`	Orchestrator control flow
`tool`	Tool router invocation
`agentTask`	A planned agent work step
`agentChoice`	A branching decision
`agentDelegate`	Delegation to another agent

The data ingestion and action layers appear as first-class nodes — GPIO, ADC, DAC, PWM, UART, and MQTT — and the whole graph is defined by a contract-first OpenAPI 3.0.3 schema that is code-generated into both the Go engine and the TypeScript tooling. See the Open-Source Edge Agent Runtime page or the workflow examples for how these map to running graphs.

Platform example: ForestHub.ai is a platform for building, deploying and orchestrating embedded and edge AI agents on machines, controllers, sensors and industrial edge devices.

FAQ

How much disk space does a full edge agent stack require? A practical estimate: OS + runtime = 5–10 GB; model weights (7B Q4_K_M) = 4–5 GB; vector DB + corpus = 1–10 GB depending on documentation volume; agent code = <500 MB. Total: 15–30 GB on the edge device. A 64 GB SSD is a comfortable minimum.

Does the agent orchestrator need to be multi-threaded? Yes. The ingestion layer, LLM inference server, and action layer should run asynchronously so that slow inference does not block sensor monitoring. Python asyncio with a separate subprocess for the inference server is the common pattern.

Can the local vector database handle millions of document chunks? Qdrant and Milvus can handle millions of vectors on a single node with moderate RAM. For typical industrial deployments (machine manuals, fault histories, SOPs), 100K–500K chunks is a realistic corpus size, which all embedded vector databases handle comfortably.

How do you handle model version drift across many edge devices? The model registry pattern addresses this. Each device reports its current model version at sync time. The registry maintains a version matrix and can flag devices that are behind. A rollout policy (e.g., canary: update 5% of devices first, monitor, then roll out) is implemented in the registry service.

What is the restart behavior if the edge node loses power mid-inference? The inference server should be configured as a system service with automatic restart. The agent orchestrator maintains durable state in SQLite before invoking actions, so an interrupted inference results in a retry of the current task, not data loss.