The Anubhav portal was launched in March 2015 at the behest of the Hon'ble Prime Minister for retiring government officials to leave a record of their experiences while in Govt service .
When running large language models in Ollama, the difference between CPU and GPU performance is significant. The best choice depends on your hardware, model size, and intended workload.
CPU vs GPU Comparison
Feature
CPU
GPU
Inference Speed
Slow
Much faster
Parallel Processing
Limited
Thousands of cores
Memory
Uses system RAM
Uses VRAM
Power Efficiency
Lower for AI workloads
Higher for AI workloads
Cost
No extra hardware
Requires dedicated GPU
Best For
Small models, testing
Large models, production
CPU Performance
Advantages
No dedicated graphics card required.
Uses available system RAM.
Suitable for:
Development
Small language models
Learning Ollama
Disadvantages
Much slower token generation.
High CPU usage.
Less efficient with models larger than 7B.
Example:
ollama run llama3
If no compatible GPU is detected, Ollama automatically runs on the CPU.
Typical speeds:
Model
CPU Tokens/sec
3B
10–25
7B
5–15
13B
2–8
70B
Often impractical
Performance depends heavily on:
CPU generation
Number of cores
Memory bandwidth
Quantization level
GPU Performance
GPUs are designed for massive parallel computation, making them ideal for transformer inference.
Advantages
5×–20× faster inference.
Lower latency.
Better for:
Coding assistants
Chatbots
Long context windows
Concurrent users
Typical speeds:
GPU
7B Model
RTX 3060 12GB
45–70 tokens/sec
RTX 4070
70–120 tokens/sec
RTX 4090
120–250+ tokens/sec
Apple M3 Max
50–100 tokens/sec
VRAM Requirements
Approximate VRAM needed:
Model
VRAM
3B
2–4 GB
7B
4–8 GB
13B
8–12 GB
32B
20–24 GB
70B
40–48+ GB
Quantized models (Q4, Q5, Q6) require less memory.
CPU Bottlenecks
CPU inference is limited by:
Matrix multiplication speed
Cache size
Memory bandwidth
Thread scheduling
Even a 16-core CPU cannot match the throughput of a modern GPU for LLM inference.
GPU Bottlenecks
GPU performance depends on:
VRAM capacity
Memory bandwidth
CUDA cores (NVIDIA)
Metal Performance Shaders (Apple Silicon)
ROCm support (AMD)
If the model exceeds available VRAM, Ollama may offload layers to system RAM, reducing performance.
Mixed CPU + GPU Execution
Ollama can split computation between GPU and CPU when necessary.
Benefits:
Allows running models larger than available VRAM.
Drawbacks:
Slower than full GPU execution.
Increased latency due to data transfer between system RAM and VRAM.
Example Performance
Prompt:
"Explain quantum computing in 500 words."
Approximate response times:
Hardware
Time
Intel Core i7 (CPU only)
20–40 seconds
RTX 3060
4–8 seconds
RTX 4070
2–5 seconds
RTX 4090
1–3 seconds
Which Should You Choose?
Use Case
Recommendation
Learning Ollama
CPU is sufficient
Casual chatting
CPU or entry-level GPU
Software development
GPU preferred
Large code generation
GPU strongly recommended
Running 13B+ models
GPU
Multi-user API server
High-end GPU
Production deployments
GPU
Join MindStick Community
You need to log in or register to vote on answers or questions.
We use cookies to ensure you have the best browsing experience on our website. By using our site, you
acknowledge that you have read and understood our
Cookie Policy &
Privacy Policy.
When running large language models in Ollama, the difference between CPU and GPU performance is significant. The best choice depends on your hardware, model size, and intended workload.
CPU vs GPU Comparison
CPU Performance
Advantages
Disadvantages
Example:
If no compatible GPU is detected, Ollama automatically runs on the CPU.
Typical speeds:
Performance depends heavily on:
GPU Performance
GPUs are designed for massive parallel computation, making them ideal for transformer inference.
Advantages
Typical speeds:
VRAM Requirements
Approximate VRAM needed:
Quantized models (Q4, Q5, Q6) require less memory.
CPU Bottlenecks
CPU inference is limited by:
Even a 16-core CPU cannot match the throughput of a modern GPU for LLM inference.
GPU Bottlenecks
GPU performance depends on:
If the model exceeds available VRAM, Ollama may offload layers to system RAM, reducing performance.
Mixed CPU + GPU Execution
Ollama can split computation between GPU and CPU when necessary.
Benefits:
Drawbacks:
Example Performance
Prompt:
Approximate response times:
Which Should You Choose?