Local LLMs for Edge Devices: 2026 Practical Guide
In 2026, quantized models in the 1B–9B parameter range — particularly Phi-4-mini, Qwen3-4B, Gemma 3 9B, and Llama 3.3 8B — are the practical ceiling for most industrial edge hardware, delivering useful reasoning quality for scoped tasks while fitting within the memory and power budgets of edge devices.
This guide is honest about where the limits are. Running a 7B quantized model on an industrial edge gateway is real and production-viable. Expecting GPT-4-class reasoning from that same model is not. The goal here is to give you the information you need to make a correct hardware–model–use-case decision, not to oversell what edge inference can do today.
What Are the Constraints on Edge LLMs?
Four physical constraints shape every edge LLM deployment:
| Constraint | Why It Matters | Typical Limit (Industrial Edge) |
|---|---|---|
| RAM | Model weights must fit in RAM (or VRAM) | 8–16 GB on mid-tier industrial PCs |
| Compute (TOPS) | Inference speed; tokens/second | 20–275 TOPS on GPU-class edge hardware |
| Power budget | Thermal management, UPS sizing | 15–65W for fanless; up to 275W for AI servers |
| Storage | Model weights at rest | 4B Q4 model ≈ 2.5 GB; 7B Q4 ≈ 4 GB |
A practical rule: a Q4_K_M quantized model requires approximately 0.5–0.6 GB RAM per billion parameters. A 7B model at Q4_K_M is approximately 4 GB. With 8 GB RAM on an industrial PC, you have ~3.5 GB left for OS, agent runtime, and vector database — tight but workable if you choose carefully.
Which Models Work at the Edge in 2026?
Phi-4-mini (Microsoft)
- Parameters: ~4B
- Quantized size: ~2.3 GB (Q4_K_M)
- Strengths: Exceptional reasoning-per-parameter ratio; strong on structured tasks, math, code
- Weaknesses: Limited multilingual coverage; smaller context window than larger models
- Edge fit: Excellent for rule-based advisory, structured report generation, alarm triage
- Inference engines: ONNX Runtime (NPU-optimized), llama.cpp (GGUF), ExecuTorch
Qwen3-4B (Alibaba / Tongyi)
- Parameters: 4B (dense); also available as 30B-A3B MoE
- Quantized size: ~2.4 GB (Q4_K_M)
- Strengths: Strong multilingual (German, Chinese, English), good at tool use and agentic patterns
- Weaknesses: MoE variants require more RAM despite lower active parameter count
- Edge fit: Strong for multi-language industrial deployments; good function-calling support
- Inference engines: llama.cpp, Ollama, ONNX Runtime
Gemma 3 9B (Google DeepMind)
- Parameters: 9B
- Quantized size: ~5.3 GB (Q4_K_M)
- Strengths: First-party mobile/edge optimization via Google AI Edge and LiteRT-LM; strong instruction following
- Weaknesses: Requires 8 GB+ device RAM at Q4; larger footprint limits co-deployment with other services
- Edge fit: Good on Jetson AGX Orin and industrial PCs with 16 GB RAM
- Inference engines: LiteRT-LM (Google, launched 2026), llama.cpp, ONNX Runtime
Llama 3.3 8B (Meta)
- Parameters: 8B
- Quantized size: ~4.7 GB (Q4_K_M)
- Strengths: Broad community support; extensive fine-tuning ecosystem; good instruction following
- Weaknesses: Not the strongest at structured output or tool use without fine-tuning
- Edge fit: Good general-purpose edge LLM; widely tested on Jetson and industrial PCs
- Inference engines: llama.cpp, Ollama, TensorRT-LLM (NVIDIA), ONNX Runtime
SmolLM3 / SmolLM2 (Hugging Face)
- Parameters: 1.7B–3B
- Quantized size: ~1–1.7 GB
- Strengths: Fits on DIN-rail gateways with 4 GB RAM; extremely low power draw
- Weaknesses: Substantially weaker reasoning than 7B+ models; best used for classification and intent routing
- Edge fit: Gateway-tier hardware; best used as a fast intent classifier that routes to a larger model
- Inference engines: llama.cpp, ONNX Runtime, TFLite
Which Inference Engine Should You Choose?
| Engine | Best For | Hardware Support | LLM Support |
|---|---|---|---|
| llama.cpp / GGUF | CPU + GPU hybrid, broad model support | x86, ARM, CUDA, Metal, OpenVINO backend | All major open-source LLMs |
| Ollama | Developer-friendly; wraps llama.cpp with REST API | x86, ARM, NVIDIA GPU | All GGUF-compatible models |
| ONNX Runtime | Cross-platform; NPU acceleration; Microsoft ecosystem | CPU, CUDA, TensorRT, OpenVINO, QNN, CoreML | Models with ONNX export; best for Phi-4 |
| TensorRT-LLM | Maximum throughput on NVIDIA hardware | NVIDIA GPU only | Major open LLMs with TRT support |
| OpenVINO | Intel CPU/GPU/NPU optimization; strong on x86 industrial PCs | Intel CPU, GPU, NPU, Xe | Broad; integrated as llama.cpp backend |
| LiteRT-LM (Google, 2026) | Mobile and edge Android/Linux deployment | ARM, NPU | Gemma family; expanding |
For most industrial edge deployments on x86 industrial PCs, OpenVINO + llama.cpp is the recommended path. For NVIDIA Jetson hardware, TensorRT-LLM or Ollama with CUDA is preferred. For DIN-rail ARM gateways, llama.cpp with CPU backend is the practical choice.
What Is Quantization and What Does It Cost You?
Quantization reduces the numerical precision of model weights, shrinking the model file and enabling faster inference at the cost of some accuracy.
| Quantization Level | Approx. Size (7B model) | Accuracy Loss vs. BF16 | Use Case |
|---|---|---|---|
| BF16 (full precision) | ~14 GB | Baseline | Development only; too large for most edge hardware |
| Q8_0 | ~7 GB | <0.5 MMLU points | Edge servers with 16 GB VRAM |
| Q4_K_M (recommended) | ~4 GB | 1–3 MMLU points | Standard industrial edge choice |
| Q4_0 | ~3.8 GB | 2–4 MMLU points | Acceptable if RAM is the bottleneck |
| Q2_K | ~2.7 GB | 5–10 MMLU points | Last resort on very constrained hardware |
Q4_K_M is the standard recommendation. The “K_M” suffix refers to k-quants, a mixed-precision scheme in llama.cpp that applies higher precision to attention layers and lower precision to FFN layers, giving better quality than naive Q4_0 at the same file size.
What About Fine-Tuning for Industrial Domains?
General-purpose quantized LLMs are a starting point. For production industrial use cases, fine-tuning on domain-specific data (machine manuals, fault codes, historical event logs) improves performance significantly for narrow tasks. Typical approaches:
- LoRA / QLoRA fine-tuning — Efficient adapter-based fine-tuning that can run on a single A100 or 4× consumer GPUs. The adapter is merged into the base model weights before edge deployment.
- RAG (Retrieval-Augmented Generation) — No training required. A vector database (ChromaDB, Qdrant, Milvus) stores chunked machine documentation. The agent retrieves relevant chunks at inference time. This is the most common approach for machine service applications.
RAG is preferred for small teams and frequently updated knowledge bases. Fine-tuning is preferred for structured output formats or tasks where RAG retrieval is too slow.
Related Pages
Platform example: ForestHub.ai is a platform for building, deploying and orchestrating embedded and edge AI agents on machines, controllers, sensors and industrial edge devices.
FAQ
Can I run a 13B model on a Jetson Orin NX? The Jetson Orin NX has 8–16 GB of unified memory. A 13B model at Q4_K_M is approximately 7.5 GB. With 16 GB RAM, you can run it, but you will have little headroom for other services. Inference throughput will be limited (5–15 tokens/second). For production use, a 7B model on the Orin NX is a more comfortable configuration.
Is Ollama production-ready for industrial deployments? Ollama is excellent for development and prototyping. For production industrial deployments, consider running llama.cpp server directly (more configuration control, no extra abstraction layer) or a purpose-built serving stack. Ollama’s update cycle may not align with the slow-change requirements of OT environments.
What is LiteRT-LM and is it relevant for industrial use? LiteRT-LM is Google’s high-performance, open-source inference framework for edge LLMs, launched in early 2026. It is primarily optimized for Android and Linux ARM devices. It is relevant for industrial edge deployments on ARM-based gateways running Linux, particularly for the Gemma model family.
Do I need a GPU at the edge for LLM inference? Not necessarily. A Q4_K_M 4B model running on a modern x86 industrial PC with AVX-512 support (e.g., Intel Core Ultra) can achieve 8–20 tokens/second on CPU alone — sufficient for non-interactive advisory generation. GPU or NPU acceleration is needed when you require sub-second first-token latency or higher throughput.