Skip to content

LLM Recommendations for On-Premises Deployment

Quick reference for selecting models based on available GPU hardware.

Methodology

Disclaimer: This list does not replace testing and verification on the target hardware.

System Requirements

The inference engine (vLLM) requires:

  • NVIDIA GPU with compute capability 7.0+ (Volta architecture or newer)
  • NVIDIA Driver:
  • Linux: version 575.51.03 or newer
  • Windows: version 576.02 or newer
  • Driver requirement is based on CUDA 12.9 compatibility

Hardware Tiers

Tier Example GPUs Total VRAM Typical Use
Enterprise 4x H100 320 GB High-volume RAG, multiple concurrent users
Professional 4x L40S 192 GB Medium RAG workloads, team usage
Workstation 2x RTX 4090 48 GB Light RAG, single-user scenarios
Entry 1x L4 or 1x 24GB GPU 24 GB Testing, low-volume usage

TP (Tensor Parallelism): Splits model layers across multiple GPUs. TP=2 uses 2 GPUs, TP=4 uses 4 GPUs. Higher TP increases available VRAM for context and concurrency. See GPU Allocation for constraints.

4x H100 (320 GB VRAM)

Best for: High-volume RAG, large context windows, production workloads.

Model Quantization Context Concurrent Users HuggingFace
GPT-OSS 120B (TP=4) MXFP4 128k 15-20 Link
GPT-OSS 120B (TP=2) MXFP4 32-64k 8-12 Link
Llama 3.3 70B Instruct FP8 65k 10-12 Link
Llama 3.3 70B Instruct FP8 32k 20-25 Link
Qwen2.5 72B Instruct Int8 65k 10-12 Link
DeepSeek R1 Distill 70B FP8 32k 10-12 Link

Notes:

  • Recommended: GPT-OSS 120B with TP=4 for full 128k context; MoE architecture (117B total, 5.1B active) delivers fast inference
  • GPT-OSS 120B with TP=2 leaves 2 GPUs free for Whisper/RAG but reduces max context
  • Llama 3.3 70B FP8 is a solid alternative if GPT-OSS compatibility is a concern
  • FP8/Int8 preserves quality while halving memory vs FP16; prefer over AWQ when VRAM allows
  • Qwen2.5 72B shows faster response times than Llama 3.3 70B in testing

4x L40S (192 GB VRAM)

Best for: Team usage, medium RAG workloads.

Model Quantization Context Concurrent Users HuggingFace
GPT-OSS 120B (TP=4) AWQ 32-64k 5-8 Link
Llama 3.3 70B Instruct FP8 65k 5-6 Link
Llama 3.3 70B Instruct FP8 32k 10-12 Link
Qwen2.5 72B Instruct Int8 32k 5-6 Link
Qwen 3 32B BNB-4bit 32k 8-10 Link

Notes:

  • Recommended: GPT-OSS 120B AWQ with TP=4; requires all 4 GPUs (doesn't fit on single L40S)
  • Llama 3.3 70B FP8 is a solid alternative with higher concurrency at 65k context
  • Qwen 3 32B requires /no_think suffix to disable thinking mode
  • L40S has less VRAM per GPU (48GB vs 80GB), limiting context compared to H100

2x RTX 4090 (48 GB VRAM)

Best for: Single-user, light RAG, development.

Model Quantization Context Concurrent Users HuggingFace
GPT-OSS 20B MXFP4 64k 2-3 Link
Llama 3.1 8B Instruct FP8 128k 1-2 Link
DeepSeek R1 Distill 8B AWQ 32k 1-2 Link
Mistral 7B Instruct v0.3 FP16 32k 1-2 Link

Notes:

  • Recommended: GPT-OSS 20B for best quality; MoE model (21B total, 3.6B active) fits in ~16GB with MXFP4, leaving room for 64k context
  • 8B models are alternatives if longer context (128k) is needed over quality
  • DeepSeek R1 requires backend handling of thinking tokens

1x L4 / 24 GB GPU

Best for: Testing, demos, low-volume single user.

Model Quantization Context Concurrent Users HuggingFace
GPT-OSS 20B MXFP4 32k 1 Link
Qwen3 4B Instruct FP16 30k 1 Link
Llama 3.1 8B Instruct FP8 32k 1 Link
Mistral 7B Instruct v0.3 FP8 16k 1 Link

Notes:

  • Recommended: GPT-OSS 20B for best quality on 24GB hardware; MoE with only 3.6B active parameters; fits in 16GB with MXFP4
  • Qwen3 4B is an alternative with best throughput (23 tokens/sec) if speed is priority over quality
  • Max practical context around 30k tokens

Quick Selection Guide

Your Situation Recommended Model
Enterprise (4x H100) GPT-OSS 120B MXFP4, TP=2 or TP=4
Professional (4x L40S) GPT-OSS 120B AWQ or Llama 3.3 70B FP8
Workstation (2x RTX 4090) GPT-OSS 20B MXFP4, TP=2
Entry (1x 24GB GPU) GPT-OSS 20B MXFP4
Faster response times Qwen2.5 72B Int8 on 4x H100
Minimum hardware, testing Qwen3 4B Instruct
Need reasoning capabilities DeepSeek R1 Distill (8B or 70B)

VRAM Requirements Reference

Approximate VRAM per model (inference only, excludes KV cache overhead):

Model FP16 FP8/Int8 Int4/AWQ/MXFP4
GPT-OSS 120B (MoE) ~240 GB ~120 GB ~60 GB
70-72B ~140 GB ~70 GB ~35 GB
34B ~68 GB ~34 GB ~18 GB
32B ~64 GB ~32 GB ~16 GB
GPT-OSS 20B (MoE) ~42 GB ~22 GB ~16 GB
13B ~26 GB ~13 GB ~7 GB
8B ~16 GB ~8 GB ~4 GB
7B ~14 GB ~7 GB ~4 GB
4B ~8 GB ~4 GB ~2 GB

Note: GPT-OSS models use Mixture-of-Experts (MoE) architecture with significantly fewer active parameters (3.6B for 20B, 5.1B for 120B), resulting in faster inference than dense models of similar total size.

Context length significantly increases memory usage. Longer contexts require additional VRAM for KV cache.

Context Size vs Concurrent Users

There is a direct trade-off between context window size and the number of concurrent users:

  • KV cache grows with context: Each token in the context requires memory for key-value cache. A 65k context uses roughly 4x the KV cache memory of a 16k context.
  • Concurrent requests multiply memory: Each concurrent user needs their own KV cache allocation.
  • Practical formula: Available VRAM = Model weights + (Context size × Concurrent users × KV cache per token)

Example for Llama 3.3 70B FP8 on 4x H100 (320 GB):

Context Max Concurrent Users Use Case
65k 10-12 Large document RAG
32k 20-25 Standard RAG, team usage
16k 40-50 Short queries, high concurrency

Recommendation: For RAG workloads, prioritize context size over concurrent users. A 65k context allows processing larger documents and more retrieved chunks, improving answer quality.

KV cache optimization: Most inference engines support KV cache quantization, which can roughly double concurrent user capacity at a slight quality cost. The estimates above assume default (FP16) KV cache.


GPU Allocation

The basebox stack includes multiple GPU-capable components:

Component GPU Usage Purpose
LLM Server (vLLM) High VRAM, latency-sensitive Text generation, chat
Audio Server (Whisper) Low VRAM (~8GB) Audio transcription
RAG Server Low-moderate GPU usage Embedding, reranking, OCR

Tensor Parallelism Constraints

LLM GPU allocation depends on model tensor parallelism (TP) requirements:

  • TP must divide attention heads evenly - e.g., GPT-OSS 120B (64 heads) supports TP=1, 2, 4, 8 but not TP=3
  • GPT-OSS 120B: TP=2 minimum recommended; Data Parallelism not supported (see Known Issues)
  • 70B models: TP=2 or TP=4 typical

This means not all GPUs may be used for the LLM. Remaining GPUs can run Whisper and RAG.

Example: 4x H100 with GPT-OSS 120B

GPUs Service Notes
0, 1 LLM (TP=2) GPT-OSS 120B with tensor parallel
2 Whisper + RAG Shared GPU for auxiliary services
3 Unused Available for future scaling

Default: LLM + Whisper Shared GPU

When LLM uses all available GPUs, Whisper shares one GPU with this memory split:

Service GPU Memory ~Usage (80GB GPU)
LLM 85% 68 GB
Whisper 10% 8 GB
Buffer 5% 4 GB

When to Use GPU for RAG

RAG runs on CPU by default but supports GPU acceleration. Enable GPU mode for RAG if:

  1. OCR is required for scanned PDFs or image files
  2. Reranking performance is critical and CPU is too slow
  3. Spare GPUs are available due to TP constraints (as in the 4x H100 example above)

Known Issues

  • Llama 3.3 70B AWQ: May show high perplexity; tune repetition penalty and temperature
  • DeepSeek R1 models: Backend must filter thinking tokens from responses
  • Qwen 3 32B: Only BNB-4bit works with vLLM; append /no_think to user prompts
  • GPT-OSS 120B: Does not work reliably with Data Parallelism (produces garbled output); use Tensor Parallelism instead
  • GPT-OSS models: Require vLLM 0.10.1+; reasoning parser must be configured for proper reasoning token handling
  • GGUF models: vLLM has experimental GGUF support but with limitations (single-file only, may be incompatible with some features); prefer native quantization formats (FP8, AWQ, GPTQ) for production