Licensed to be used in conjunction with basebox, only.
LLM Recommendations for On-Premises Deployment
Quick reference for selecting models based on available GPU hardware.
Methodology
Disclaimer: This list does not replace testing and verification on the target hardware.
System Requirements
The inference engine (vLLM) requires:
- NVIDIA GPU with compute capability 7.0+ (Volta architecture or newer)
- NVIDIA Driver:
- Linux: version 575.51.03 or newer
- Windows: version 576.02 or newer
- Driver requirement is based on CUDA 12.9 compatibility
Hardware Tiers
| Tier | Example GPUs | Total VRAM | Typical Use |
|---|---|---|---|
| Enterprise | 4x H100 | 320 GB | High-volume RAG, multiple concurrent users |
| Professional | 4x L40S | 192 GB | Medium RAG workloads, team usage |
| Workstation | 2x RTX 4090 | 48 GB | Light RAG, single-user scenarios |
| Entry | 1x L4 or 1x 24GB GPU | 24 GB | Testing, low-volume usage |
Recommended Models by Hardware
TP (Tensor Parallelism): Splits model layers across multiple GPUs. TP=2 uses 2 GPUs, TP=4 uses 4 GPUs. Higher TP increases available VRAM for context and concurrency. See GPU Allocation for constraints.
4x H100 (320 GB VRAM)
Best for: High-volume RAG, large context windows, production workloads.
| Model | Quantization | Context | Concurrent Users | HuggingFace |
|---|---|---|---|---|
| GPT-OSS 120B (TP=4) | MXFP4 | 128k | 15-20 | Link |
| GPT-OSS 120B (TP=2) | MXFP4 | 32-64k | 8-12 | Link |
| Llama 3.3 70B Instruct | FP8 | 65k | 10-12 | Link |
| Llama 3.3 70B Instruct | FP8 | 32k | 20-25 | Link |
| Qwen2.5 72B Instruct | Int8 | 65k | 10-12 | Link |
| DeepSeek R1 Distill 70B | FP8 | 32k | 10-12 | Link |
Notes:
- Recommended: GPT-OSS 120B with TP=4 for full 128k context; MoE architecture (117B total, 5.1B active) delivers fast inference
- GPT-OSS 120B with TP=2 leaves 2 GPUs free for Whisper/RAG but reduces max context
- Llama 3.3 70B FP8 is a solid alternative if GPT-OSS compatibility is a concern
- FP8/Int8 preserves quality while halving memory vs FP16; prefer over AWQ when VRAM allows
- Qwen2.5 72B shows faster response times than Llama 3.3 70B in testing
4x L40S (192 GB VRAM)
Best for: Team usage, medium RAG workloads.
| Model | Quantization | Context | Concurrent Users | HuggingFace |
|---|---|---|---|---|
| GPT-OSS 120B (TP=4) | AWQ | 32-64k | 5-8 | Link |
| Llama 3.3 70B Instruct | FP8 | 65k | 5-6 | Link |
| Llama 3.3 70B Instruct | FP8 | 32k | 10-12 | Link |
| Qwen2.5 72B Instruct | Int8 | 32k | 5-6 | Link |
| Qwen 3 32B | BNB-4bit | 32k | 8-10 | Link |
Notes:
- Recommended: GPT-OSS 120B AWQ with TP=4; requires all 4 GPUs (doesn't fit on single L40S)
- Llama 3.3 70B FP8 is a solid alternative with higher concurrency at 65k context
- Qwen 3 32B requires
/no_thinksuffix to disable thinking mode - L40S has less VRAM per GPU (48GB vs 80GB), limiting context compared to H100
2x RTX 4090 (48 GB VRAM)
Best for: Single-user, light RAG, development.
| Model | Quantization | Context | Concurrent Users | HuggingFace |
|---|---|---|---|---|
| GPT-OSS 20B | MXFP4 | 64k | 2-3 | Link |
| Llama 3.1 8B Instruct | FP8 | 128k | 1-2 | Link |
| DeepSeek R1 Distill 8B | AWQ | 32k | 1-2 | Link |
| Mistral 7B Instruct v0.3 | FP16 | 32k | 1-2 | Link |
Notes:
- Recommended: GPT-OSS 20B for best quality; MoE model (21B total, 3.6B active) fits in ~16GB with MXFP4, leaving room for 64k context
- 8B models are alternatives if longer context (128k) is needed over quality
- DeepSeek R1 requires backend handling of thinking tokens
1x L4 / 24 GB GPU
Best for: Testing, demos, low-volume single user.
| Model | Quantization | Context | Concurrent Users | HuggingFace |
|---|---|---|---|---|
| GPT-OSS 20B | MXFP4 | 32k | 1 | Link |
| Qwen3 4B Instruct | FP16 | 30k | 1 | Link |
| Llama 3.1 8B Instruct | FP8 | 32k | 1 | Link |
| Mistral 7B Instruct v0.3 | FP8 | 16k | 1 | Link |
Notes:
- Recommended: GPT-OSS 20B for best quality on 24GB hardware; MoE with only 3.6B active parameters; fits in 16GB with MXFP4
- Qwen3 4B is an alternative with best throughput (23 tokens/sec) if speed is priority over quality
- Max practical context around 30k tokens
Quick Selection Guide
| Your Situation | Recommended Model |
|---|---|
| Enterprise (4x H100) | GPT-OSS 120B MXFP4, TP=2 or TP=4 |
| Professional (4x L40S) | GPT-OSS 120B AWQ or Llama 3.3 70B FP8 |
| Workstation (2x RTX 4090) | GPT-OSS 20B MXFP4, TP=2 |
| Entry (1x 24GB GPU) | GPT-OSS 20B MXFP4 |
| Faster response times | Qwen2.5 72B Int8 on 4x H100 |
| Minimum hardware, testing | Qwen3 4B Instruct |
| Need reasoning capabilities | DeepSeek R1 Distill (8B or 70B) |
VRAM Requirements Reference
Approximate VRAM per model (inference only, excludes KV cache overhead):
| Model | FP16 | FP8/Int8 | Int4/AWQ/MXFP4 |
|---|---|---|---|
| GPT-OSS 120B (MoE) | ~240 GB | ~120 GB | ~60 GB |
| 70-72B | ~140 GB | ~70 GB | ~35 GB |
| 34B | ~68 GB | ~34 GB | ~18 GB |
| 32B | ~64 GB | ~32 GB | ~16 GB |
| GPT-OSS 20B (MoE) | ~42 GB | ~22 GB | ~16 GB |
| 13B | ~26 GB | ~13 GB | ~7 GB |
| 8B | ~16 GB | ~8 GB | ~4 GB |
| 7B | ~14 GB | ~7 GB | ~4 GB |
| 4B | ~8 GB | ~4 GB | ~2 GB |
Note: GPT-OSS models use Mixture-of-Experts (MoE) architecture with significantly fewer active parameters (3.6B for 20B, 5.1B for 120B), resulting in faster inference than dense models of similar total size.
Context length significantly increases memory usage. Longer contexts require additional VRAM for KV cache.
Context Size vs Concurrent Users
There is a direct trade-off between context window size and the number of concurrent users:
- KV cache grows with context: Each token in the context requires memory for key-value cache. A 65k context uses roughly 4x the KV cache memory of a 16k context.
- Concurrent requests multiply memory: Each concurrent user needs their own KV cache allocation.
- Practical formula:
Available VRAM = Model weights + (Context size × Concurrent users × KV cache per token)
Example for Llama 3.3 70B FP8 on 4x H100 (320 GB):
| Context | Max Concurrent Users | Use Case |
|---|---|---|
| 65k | 10-12 | Large document RAG |
| 32k | 20-25 | Standard RAG, team usage |
| 16k | 40-50 | Short queries, high concurrency |
Recommendation: For RAG workloads, prioritize context size over concurrent users. A 65k context allows processing larger documents and more retrieved chunks, improving answer quality.
KV cache optimization: Most inference engines support KV cache quantization, which can roughly double concurrent user capacity at a slight quality cost. The estimates above assume default (FP16) KV cache.
GPU Allocation
The basebox stack includes multiple GPU-capable components:
| Component | GPU Usage | Purpose |
|---|---|---|
| LLM Server (vLLM) | High VRAM, latency-sensitive | Text generation, chat |
| Audio Server (Whisper) | Low VRAM (~8GB) | Audio transcription |
| RAG Server | Low-moderate GPU usage | Embedding, reranking, OCR |
Tensor Parallelism Constraints
LLM GPU allocation depends on model tensor parallelism (TP) requirements:
- TP must divide attention heads evenly - e.g., GPT-OSS 120B (64 heads) supports TP=1, 2, 4, 8 but not TP=3
- GPT-OSS 120B: TP=2 minimum recommended; Data Parallelism not supported (see Known Issues)
- 70B models: TP=2 or TP=4 typical
This means not all GPUs may be used for the LLM. Remaining GPUs can run Whisper and RAG.
Example: 4x H100 with GPT-OSS 120B
| GPUs | Service | Notes |
|---|---|---|
| 0, 1 | LLM (TP=2) | GPT-OSS 120B with tensor parallel |
| 2 | Whisper + RAG | Shared GPU for auxiliary services |
| 3 | Unused | Available for future scaling |
Default: LLM + Whisper Shared GPU
When LLM uses all available GPUs, Whisper shares one GPU with this memory split:
| Service | GPU Memory | ~Usage (80GB GPU) |
|---|---|---|
| LLM | 85% | 68 GB |
| Whisper | 10% | 8 GB |
| Buffer | 5% | 4 GB |
When to Use GPU for RAG
RAG runs on CPU by default but supports GPU acceleration. Enable GPU mode for RAG if:
- OCR is required for scanned PDFs or image files
- Reranking performance is critical and CPU is too slow
- Spare GPUs are available due to TP constraints (as in the 4x H100 example above)
Known Issues
- Llama 3.3 70B AWQ: May show high perplexity; tune repetition penalty and temperature
- DeepSeek R1 models: Backend must filter thinking tokens from responses
- Qwen 3 32B: Only BNB-4bit works with vLLM; append
/no_thinkto user prompts - GPT-OSS 120B: Does not work reliably with Data Parallelism (produces garbled output); use Tensor Parallelism instead
- GPT-OSS models: Require vLLM 0.10.1+; reasoning parser must be configured for proper reasoning token handling
- GGUF models: vLLM has experimental GGUF support but with limitations (single-file only, may be incompatible with some features); prefer native quantization formats (FP8, AWQ, GPTQ) for production