Edge Agent Architecture: Full Stack Reference Guide

Last reviewed: 2026-05-22 · Marcus Rüb

An edge agent architecture consists of layered subsystems — a data ingestion layer, an agent runtime, a local inference engine, a state and memory layer, an action layer, and an optional hybrid sync layer — all coordinated to enable autonomous, locally-grounded AI decisions at or near the data source.

This page provides a complete reference architecture. Each component is described independently so teams can adapt the stack to their hardware constraints and use-case requirements.

Top-Level Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                        CLOUD / ON-PREM DC                       │
│   ┌──────────────┐  ┌────────────────┐  ┌───────────────────┐  │
│   │ Cloud Agent  │  │ Model Registry │  │ Cross-Asset Store │  │
│   │ (Frontier LLM│  │ (Versions,     │  │ (Event history,   │  │
│   │  reasoning)  │  │  Manifests)    │  │  KPIs, Policies)  │  │
│   └──────┬───────┘  └────────┬───────┘  └─────────┬─────────┘  │
└──────────┼───────────────────┼────────────────────┼────────────┘
           │ HTTPS / MQTT over │ TLS        Delta    │ Sync
           │ WAN (async)       │            Sync     │
┌──────────┼───────────────────┼────────────────────┼────────────┐
│          │      EDGE GATEWAY / AGENT HOST          │            │
│   ┌──────▼───────────────────────────────────────▼──────────┐  │
│   │                   AGENT ORCHESTRATOR                     │  │
│   │  Task queue │ Tool router │ Context builder │ Planner    │  │
│   └──────┬──────────────┬────────────────────────────────────┘  │
│          │              │                                        │
│   ┌──────▼──────┐ ┌─────▼──────────┐  ┌──────────────────────┐ │
│   │  LOCAL LLM  │ │ VECTOR DB /    │  │  ACTION LAYER        │ │
│   │  INFERENCE  │ │ RAG CORPUS     │  │  OPC UA write        │ │
│   │  (llama.cpp │ │ (Qdrant,       │  │  MQTT publish        │ │
│   │   Ollama,   │ │  ChromaDB,     │  │  REST API call       │ │
│   │   OpenVINO) │ │  Milvus)       │  │  Dashboard update    │ │
│   └─────────────┘ └────────────────┘  └──────────────────────┘ │
│          │                                                       │
│   ┌──────▼───────────────────────────────────────────────────┐  │
│   │                  DATA INGESTION LAYER                    │  │
│   │   OPC UA Client │ MQTT Sub │ Modbus Poller │ S7 Reader   │  │
│   └──────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

     Field Devices: PLCs, Sensors, Drives, Vision Systems

Component Descriptions

Data Ingestion Layer

The ingestion layer is the bridge between field data and the agent. It abstracts the industrial protocol complexity so the agent runtime works with normalized, typed data objects regardless of whether the source is OPC UA, Modbus TCP, or MQTT.

Key responsibilities:

A common implementation uses Eclipse Milo (Java/Kotlin OPC UA client), pymodbus, or the paho-mqtt Python client. For higher-performance deployments, open-source bridges like Eclipse Kura or custom C++ adapters are used.

Agent Orchestrator

The orchestrator is the core control loop. It:

  1. Receives events from the ingestion layer or a scheduler
  2. Builds a context object (relevant recent readings, agent memory, retrieved documents)
  3. Routes the context to the appropriate tool (LLM, classifier, rules engine, external API)
  4. Interprets the tool’s output and decides on actions
  5. Executes actions via the action layer
  6. Persists the decision and its outcome to the local state store

In Python-based implementations, frameworks like LangChain, LlamaIndex, or custom asyncio loops serve as the orchestrator. The orchestrator maintains a local event loop independent of cloud connectivity.

Local LLM Inference

The inference component serves model requests from the orchestrator. In production industrial deployments, the recommended setup is a locally running inference server rather than a direct library call — this allows the model to be swapped without redeploying the agent code.

Recommended serving configurations:

HardwareInference ServerRecommended Model
Intel x86 industrial PC, 16 GB RAMOllama (OpenVINO backend) or llama.cpp serverQwen3-4B Q4_K_M or Phi-4-mini
NVIDIA Jetson AGX OrinOllama (CUDA) or TensorRT-LLMLlama 3.3 8B Q4
ARM DIN-rail gateway, 8 GB RAMllama.cpp server (CPU)Phi-4-mini Q4_K_M or SmolLM3
Industrial PC + discrete GPU (8 GB VRAM)Ollama CUDA or llama.cpp CUDALlama 3.3 8B or Gemma 3 9B

Vector DB / RAG Corpus

The local vector database stores embedded chunks of machine documentation, historical fault logs, standard operating procedures, and any domain knowledge the agent needs to retrieve at inference time.

Component choices:

The embedding model runs locally. Common choices: nomic-embed-text (768-dim, good quality), all-MiniLM-L6-v2 (384-dim, faster), or a domain-fine-tuned encoder.

Action Layer

The action layer executes decisions made by the orchestrator. Actions are categorized by authorization level:

Action ClassAuthorization RequiredExamples
Read / observeNoneOPC UA read, historian query
Inform / notifyNoneDashboard update, email alert, MQTT publish to status topic
RecommendNoneAdvisory text to operator UI
Actuate (low risk)Operator acknowledgmentWrite a setpoint within pre-approved bounds
Actuate (high risk)Formal approval workflowShutdown sequence, mode change

This tiered authorization model aligns with IEC 62443 access control requirements and prevents autonomous actuation in safety-relevant contexts.

State and Memory Layer

Edge agents require persistent state across restarts. A lightweight embedded database (SQLite is standard) stores:

For multi-agent setups, a shared Redis instance on the local network enables inter-agent memory sharing.

Model Registry

The model registry is a lightweight service (local or cloud-hosted) that tracks:

The edge agent polls the registry on a configurable interval. Updates are downloaded, hash-verified, and staged before being activated during a maintenance window. This prevents mid-session model swaps that could change agent behavior unpredictably.

Observability Layer

Production edge agents require observability. Key telemetry:

Lightweight options: Prometheus + Grafana (local), or structured JSON logs shipped to a central ELK stack when connectivity is available.


Platform example: ForestHub.ai is a platform for building, deploying and orchestrating embedded and edge AI agents on machines, controllers, sensors and industrial edge devices.

FAQ

How much disk space does a full edge agent stack require? A practical estimate: OS + runtime = 5–10 GB; model weights (7B Q4_K_M) = 4–5 GB; vector DB + corpus = 1–10 GB depending on documentation volume; agent code = <500 MB. Total: 15–30 GB on the edge device. A 64 GB SSD is a comfortable minimum.

Does the agent orchestrator need to be multi-threaded? Yes. The ingestion layer, LLM inference server, and action layer should run asynchronously so that slow inference does not block sensor monitoring. Python asyncio with a separate subprocess for the inference server is the common pattern.

Can the local vector database handle millions of document chunks? Qdrant and Milvus can handle millions of vectors on a single node with moderate RAM. For typical industrial deployments (machine manuals, fault histories, SOPs), 100K–500K chunks is a realistic corpus size, which all embedded vector databases handle comfortably.

How do you handle model version drift across many edge devices? The model registry pattern addresses this. Each device reports its current model version at sync time. The registry maintains a version matrix and can flag devices that are behind. A rollout policy (e.g., canary: update 5% of devices first, monitor, then roll out) is implemented in the registry service.

What is the restart behavior if the edge node loses power mid-inference? The inference server should be configured as a system service with automatic restart. The agent orchestrator maintains durable state in SQLite before invoking actions, so an interrupted inference results in a retry of the current task, not data loss.