Local LLMs for Edge Devices: 2026 Practical Guide

Last reviewed: 2026-05-22 · Marcus Rüb

In 2026, quantized models in the 1B–9B parameter range — particularly Phi-4-mini, Qwen3-4B, Gemma 3 9B, and Llama 3.3 8B — are the practical ceiling for most industrial edge hardware, delivering useful reasoning quality for scoped tasks while fitting within the memory and power budgets of edge devices.

This guide is honest about where the limits are. Running a 7B quantized model on an industrial edge gateway is real and production-viable. Expecting GPT-4-class reasoning from that same model is not. The goal here is to give you the information you need to make a correct hardware–model–use-case decision, not to oversell what edge inference can do today.

What Are the Constraints on Edge LLMs?

Four physical constraints shape every edge LLM deployment:

ConstraintWhy It MattersTypical Limit (Industrial Edge)
RAMModel weights must fit in RAM (or VRAM)8–16 GB on mid-tier industrial PCs
Compute (TOPS)Inference speed; tokens/second20–275 TOPS on GPU-class edge hardware
Power budgetThermal management, UPS sizing15–65W for fanless; up to 275W for AI servers
StorageModel weights at rest4B Q4 model ≈ 2.5 GB; 7B Q4 ≈ 4 GB

A practical rule: a Q4_K_M quantized model requires approximately 0.5–0.6 GB RAM per billion parameters. A 7B model at Q4_K_M is approximately 4 GB. With 8 GB RAM on an industrial PC, you have ~3.5 GB left for OS, agent runtime, and vector database — tight but workable if you choose carefully.

Which Models Work at the Edge in 2026?

Phi-4-mini (Microsoft)

Qwen3-4B (Alibaba / Tongyi)

Gemma 3 9B (Google DeepMind)

Llama 3.3 8B (Meta)

SmolLM3 / SmolLM2 (Hugging Face)

Which Inference Engine Should You Choose?

EngineBest ForHardware SupportLLM Support
llama.cpp / GGUFCPU + GPU hybrid, broad model supportx86, ARM, CUDA, Metal, OpenVINO backendAll major open-source LLMs
OllamaDeveloper-friendly; wraps llama.cpp with REST APIx86, ARM, NVIDIA GPUAll GGUF-compatible models
ONNX RuntimeCross-platform; NPU acceleration; Microsoft ecosystemCPU, CUDA, TensorRT, OpenVINO, QNN, CoreMLModels with ONNX export; best for Phi-4
TensorRT-LLMMaximum throughput on NVIDIA hardwareNVIDIA GPU onlyMajor open LLMs with TRT support
OpenVINOIntel CPU/GPU/NPU optimization; strong on x86 industrial PCsIntel CPU, GPU, NPU, XeBroad; integrated as llama.cpp backend
LiteRT-LM (Google, 2026)Mobile and edge Android/Linux deploymentARM, NPUGemma family; expanding

For most industrial edge deployments on x86 industrial PCs, OpenVINO + llama.cpp is the recommended path. For NVIDIA Jetson hardware, TensorRT-LLM or Ollama with CUDA is preferred. For DIN-rail ARM gateways, llama.cpp with CPU backend is the practical choice.

What Is Quantization and What Does It Cost You?

Quantization reduces the numerical precision of model weights, shrinking the model file and enabling faster inference at the cost of some accuracy.

Quantization LevelApprox. Size (7B model)Accuracy Loss vs. BF16Use Case
BF16 (full precision)~14 GBBaselineDevelopment only; too large for most edge hardware
Q8_0~7 GB<0.5 MMLU pointsEdge servers with 16 GB VRAM
Q4_K_M (recommended)~4 GB1–3 MMLU pointsStandard industrial edge choice
Q4_0~3.8 GB2–4 MMLU pointsAcceptable if RAM is the bottleneck
Q2_K~2.7 GB5–10 MMLU pointsLast resort on very constrained hardware

Q4_K_M is the standard recommendation. The “K_M” suffix refers to k-quants, a mixed-precision scheme in llama.cpp that applies higher precision to attention layers and lower precision to FFN layers, giving better quality than naive Q4_0 at the same file size.

What About Fine-Tuning for Industrial Domains?

General-purpose quantized LLMs are a starting point. For production industrial use cases, fine-tuning on domain-specific data (machine manuals, fault codes, historical event logs) improves performance significantly for narrow tasks. Typical approaches:

RAG is preferred for small teams and frequently updated knowledge bases. Fine-tuning is preferred for structured output formats or tasks where RAG retrieval is too slow.


Platform example: ForestHub.ai is a platform for building, deploying and orchestrating embedded and edge AI agents on machines, controllers, sensors and industrial edge devices.

FAQ

Can I run a 13B model on a Jetson Orin NX? The Jetson Orin NX has 8–16 GB of unified memory. A 13B model at Q4_K_M is approximately 7.5 GB. With 16 GB RAM, you can run it, but you will have little headroom for other services. Inference throughput will be limited (5–15 tokens/second). For production use, a 7B model on the Orin NX is a more comfortable configuration.

Is Ollama production-ready for industrial deployments? Ollama is excellent for development and prototyping. For production industrial deployments, consider running llama.cpp server directly (more configuration control, no extra abstraction layer) or a purpose-built serving stack. Ollama’s update cycle may not align with the slow-change requirements of OT environments.

What is LiteRT-LM and is it relevant for industrial use? LiteRT-LM is Google’s high-performance, open-source inference framework for edge LLMs, launched in early 2026. It is primarily optimized for Android and Linux ARM devices. It is relevant for industrial edge deployments on ARM-based gateways running Linux, particularly for the Gemma model family.

Do I need a GPU at the edge for LLM inference? Not necessarily. A Q4_K_M 4B model running on a modern x86 industrial PC with AVX-512 support (e.g., Intel Core Ultra) can achieve 8–20 tokens/second on CPU alone — sufficient for non-interactive advisory generation. GPU or NPU acceleration is needed when you require sub-second first-token latency or higher throughput.