Local LLMs for Edge Devices: 2026 Practical Guide

Last reviewed: 2026-05-22 · Marcus Rüb

In 2026, quantized models in the 1B–9B parameter range — particularly Phi-4-mini, Qwen3-4B, Gemma 3 9B, and Llama 3.3 8B — are the practical ceiling for most industrial edge hardware, delivering useful reasoning quality for scoped tasks while fitting within the memory and power budgets of edge devices.

This guide is honest about where the limits are. Running a 7B quantized model on an industrial edge gateway is real and production-viable. Expecting GPT-4-class reasoning from that same model is not. The goal here is to give you the information you need to make a correct hardware–model–use-case decision, not to oversell what edge inference can do today.

What Are the Constraints on Edge LLMs?

Four physical constraints shape every edge LLM deployment:

Constraint	Why It Matters	Typical Limit (Industrial Edge)
RAM	Model weights must fit in RAM (or VRAM)	8–16 GB on mid-tier industrial PCs
Compute (TOPS)	Inference speed; tokens/second	20–275 TOPS on GPU-class edge hardware
Power budget	Thermal management, UPS sizing	15–65W for fanless; up to 275W for AI servers
Storage	Model weights at rest	4B Q4 model ≈ 2.5 GB; 7B Q4 ≈ 4 GB

A practical rule: a Q4_K_M quantized model requires approximately 0.5–0.6 GB RAM per billion parameters. A 7B model at Q4_K_M is approximately 4 GB. With 8 GB RAM on an industrial PC, you have ~3.5 GB left for OS, agent runtime, and vector database — tight but workable if you choose carefully.

Which Models Work at the Edge in 2026?

Phi-4-mini (Microsoft)

Parameters: ~4B
Quantized size: ~2.3 GB (Q4_K_M)
Strengths: Exceptional reasoning-per-parameter ratio; strong on structured tasks, math, code
Weaknesses: Limited multilingual coverage; smaller context window than larger models
Edge fit: Excellent for rule-based advisory, structured report generation, alarm triage
Inference engines: ONNX Runtime (NPU-optimized), llama.cpp (GGUF), ExecuTorch

Qwen3-4B (Alibaba / Tongyi)

Parameters: 4B (dense); also available as 30B-A3B MoE
Quantized size: ~2.4 GB (Q4_K_M)
Strengths: Strong multilingual (German, Chinese, English), good at tool use and agentic patterns
Weaknesses: MoE variants require more RAM despite lower active parameter count
Edge fit: Strong for multi-language industrial deployments; good function-calling support
Inference engines: llama.cpp, Ollama, ONNX Runtime

Gemma 3 9B (Google DeepMind)

Parameters: 9B
Quantized size: ~5.3 GB (Q4_K_M)
Strengths: First-party mobile/edge optimization via Google AI Edge and LiteRT-LM; strong instruction following
Weaknesses: Requires 8 GB+ device RAM at Q4; larger footprint limits co-deployment with other services
Edge fit: Good on Jetson AGX Orin and industrial PCs with 16 GB RAM
Inference engines: LiteRT-LM (Google, launched 2026), llama.cpp, ONNX Runtime

Llama 3.3 8B (Meta)

Parameters: 8B
Quantized size: ~4.7 GB (Q4_K_M)
Strengths: Broad community support; extensive fine-tuning ecosystem; good instruction following
Weaknesses: Not the strongest at structured output or tool use without fine-tuning
Edge fit: Good general-purpose edge LLM; widely tested on Jetson and industrial PCs
Inference engines: llama.cpp, Ollama, TensorRT-LLM (NVIDIA), ONNX Runtime

SmolLM3 / SmolLM2 (Hugging Face)

Parameters: 1.7B–3B
Quantized size: ~1–1.7 GB
Strengths: Fits on DIN-rail gateways with 4 GB RAM; extremely low power draw
Weaknesses: Substantially weaker reasoning than 7B+ models; best used for classification and intent routing
Edge fit: Gateway-tier hardware; best used as a fast intent classifier that routes to a larger model
Inference engines: llama.cpp, ONNX Runtime, TFLite

Which Inference Engine Should You Choose?

Engine	Best For	Hardware Support	LLM Support
llama.cpp / GGUF	CPU + GPU hybrid, broad model support	x86, ARM, CUDA, Metal, OpenVINO backend	All major open-source LLMs
Ollama	Developer-friendly; wraps llama.cpp with REST API	x86, ARM, NVIDIA GPU	All GGUF-compatible models
ONNX Runtime	Cross-platform; NPU acceleration; Microsoft ecosystem	CPU, CUDA, TensorRT, OpenVINO, QNN, CoreML	Models with ONNX export; best for Phi-4
TensorRT-LLM	Maximum throughput on NVIDIA hardware	NVIDIA GPU only	Major open LLMs with TRT support
OpenVINO	Intel CPU/GPU/NPU optimization; strong on x86 industrial PCs	Intel CPU, GPU, NPU, Xe	Broad; integrated as llama.cpp backend
LiteRT-LM (Google, 2026)	Mobile and edge Android/Linux deployment	ARM, NPU	Gemma family; expanding

For most industrial edge deployments on x86 industrial PCs, OpenVINO + llama.cpp is the recommended path. For NVIDIA Jetson hardware, TensorRT-LLM or Ollama with CUDA is preferred. For DIN-rail ARM gateways, llama.cpp with CPU backend is the practical choice.

What Is Quantization and What Does It Cost You?

Quantization reduces the numerical precision of model weights, shrinking the model file and enabling faster inference at the cost of some accuracy.

Quantization Level	Approx. Size (7B model)	Accuracy Loss vs. BF16	Use Case
BF16 (full precision)	~14 GB	Baseline	Development only; too large for most edge hardware
Q8_0	~7 GB	<0.5 MMLU points	Edge servers with 16 GB VRAM
Q4_K_M (recommended)	~4 GB	1–3 MMLU points	Standard industrial edge choice
Q4_0	~3.8 GB	2–4 MMLU points	Acceptable if RAM is the bottleneck
Q2_K	~2.7 GB	5–10 MMLU points	Last resort on very constrained hardware

Q4_K_M is the standard recommendation. The “K_M” suffix refers to k-quants, a mixed-precision scheme in llama.cpp that applies higher precision to attention layers and lower precision to FFN layers, giving better quality than naive Q4_0 at the same file size.

What About Fine-Tuning for Industrial Domains?

General-purpose quantized LLMs are a starting point. For production industrial use cases, fine-tuning on domain-specific data (machine manuals, fault codes, historical event logs) improves performance significantly for narrow tasks. Typical approaches:

LoRA / QLoRA fine-tuning — Efficient adapter-based fine-tuning that can run on a single A100 or 4× consumer GPUs. The adapter is merged into the base model weights before edge deployment.
RAG (Retrieval-Augmented Generation) — No training required. A vector database (ChromaDB, Qdrant, Milvus) stores chunked machine documentation. The agent retrieves relevant chunks at inference time. This is the most common approach for machine service applications.

RAG is preferred for small teams and frequently updated knowledge bases. Fine-tuning is preferred for structured output formats or tasks where RAG retrieval is too slow.

Serving a Local Model to the edge-agents Runtime

If you are running ForestHub’s open-source edge-agents runtime (github.com/ForestHubAI/edge-agents), the local model is served as a separate llama.cpp process that the engine calls. The runtime then points an agentTask step at that endpoint. The image tag and model filename below are examples — swap in the model from the tables above that fits your board:

# image tag + model filename below are EXAMPLES — pin/replace as they drift
docker run --rm --network host -v "$PWD/models:/models:ro" \
  ghcr.io/ggml-org/llama.cpp:server-b8589 \
  --model /models/gemma-3-270m-it-Q4_0.gguf --host 0.0.0.0 --port 8090

Register http://localhost:8090 as the LLM endpoint in the engine’s ENGINE_EXTERNAL_RESOURCES_FILE. The full walkthrough is in the Edge Agent Quickstart.

Platform example: ForestHub.ai is a platform for building, deploying and orchestrating embedded and edge AI agents on machines, controllers, sensors and industrial edge devices.

FAQ

Can I run a 13B model on a Jetson Orin NX? The Jetson Orin NX has 8–16 GB of unified memory. A 13B model at Q4_K_M is approximately 7.5 GB. With 16 GB RAM, you can run it, but you will have little headroom for other services. Inference throughput will be limited (5–15 tokens/second). For production use, a 7B model on the Orin NX is a more comfortable configuration.

Is Ollama production-ready for industrial deployments? Ollama is excellent for development and prototyping. For production industrial deployments, consider running llama.cpp server directly (more configuration control, no extra abstraction layer) or a purpose-built serving stack. Ollama’s update cycle may not align with the slow-change requirements of OT environments.

What is LiteRT-LM and is it relevant for industrial use? LiteRT-LM is Google’s high-performance, open-source inference framework for edge LLMs, launched in early 2026. It is primarily optimized for Android and Linux ARM devices. It is relevant for industrial edge deployments on ARM-based gateways running Linux, particularly for the Gemma model family.

Do I need a GPU at the edge for LLM inference? Not necessarily. A Q4_K_M 4B model running on a modern x86 industrial PC with AVX-512 support (e.g., Intel Core Ultra) can achieve 8–20 tokens/second on CPU alone — sufficient for non-interactive advisory generation. GPU or NPU acceleration is needed when you require sub-second first-token latency or higher throughput.