LAPTOP REPAIR GLOSSARY / PROFESSIONAL SOFTWARE / LOCAL AI / LLM

Local AI / LLM

Running large language models locally on your own hardware — without sending data to cloud services. Performance is determined almost entirely by GPU VRAM. If the model doesn’t fit in VRAM, inference speed drops from 30–50 tokens/sec to 1–3 tokens/sec.

📋 TABLE OF CONTENTS

What It Is
Key Hardware Requirements
People Also Ask

WHAT IT IS

Local AI / LLM refers to running AI models like LLaMA 3, Mistral, Mixtral, or Phi directly on your own computer rather than via a cloud API. This enables full privacy, no subscription costs, and offline use. Popular tools include Ollama, LM Studio, and llama.cpp.

Category: AI / Machine Learning | Common models: LLaMA 3, Mistral, Mixtral, Phi-3, Qwen | Tools: Ollama, LM Studio, llama.cpp

KEY HARDWARE REQUIREMENTS

GPU (VRAM): The critical bottleneck. 7B models need ~5–6 GB VRAM at Q4; 70B models need 40+ GB.
RAM: 32 GB for 7B models; 64 GB for 13B–34B with partial CPU offload.
CPU: Handles tokenisation and CPU offload layers.
Storage: 1–4 TB NVMe SSD — a 7B model is ~4–5 GB; a 70B model is 40+ GB.

PEOPLE ALSO ASK

What is quantisation in LLMs?

Quantisation reduces the precision of model weights to lower-bit formats like Q4 or Q8 to reduce VRAM requirements. Q4 roughly halves the memory requirement with only a modest quality reduction.

What is CPU offloading in local LLM inference?

CPU offloading is when some model layers run on the CPU using system RAM because the model is too large for VRAM. The more layers offloaded to CPU, the slower the inference — from ~30–50 tokens/sec on full GPU to 1–5 tokens/sec with heavy CPU offloading.

Need a PC built for local AI?

We build high-VRAM AI workstations with RTX 40 series GPUs, 64–128 GB RAM, and fast NVMe storage — built to run LLMs fully offline.

See Local AI Workstation Builds →

RELATED TERMS & READING