Copyright © 2022- basebox GmbH, all rights reserved.
Licensed to be used in conjunction with basebox, only.

LLM Recommendations for On-Premises Deployment

Quick reference for selecting models based on available GPU hardware.

Methodology

Disclaimer: This list does not replace testing and verification on the target hardware.

System Requirements

The inference engine (vLLM) requires:

NVIDIA GPU with compute capability 7.0+ (Volta architecture or newer)
NVIDIA Driver:
Linux: version 575.51.03 or newer
Windows: version 576.02 or newer
Driver requirement is based on CUDA 12.9 compatibility

Hardware Tiers

Tier	Example GPUs	Total VRAM	Typical Use
Enterprise	4x H100	320 GB	High-volume RAG, multiple concurrent users
Professional	4x L40S	192 GB	Medium RAG workloads, team usage
Workstation	2x RTX 4090	48 GB	Light RAG, single-user scenarios
Entry	1x L4 or 1x 24GB GPU	24 GB	Testing, low-volume usage

Recommended Models by Hardware

TP (Tensor Parallelism): Splits model layers across multiple GPUs. TP=2 uses 2 GPUs, TP=4 uses 4 GPUs. Higher TP increases available VRAM for context and concurrency. See GPU Allocation for constraints.

4x H100 (320 GB VRAM)

Best for: High-volume RAG, large context windows, production workloads.

Model	Quantization	Context	Concurrent Users	HuggingFace
GPT-OSS 120B (TP=4)	MXFP4	128k	15-20	Link
GPT-OSS 120B (TP=2)	MXFP4	32-64k	8-12	Link
Llama 3.3 70B Instruct	FP8	65k	10-12	Link
Llama 3.3 70B Instruct	FP8	32k	20-25	Link
Qwen2.5 72B Instruct	Int8	65k	10-12	Link
DeepSeek R1 Distill 70B	FP8	32k	10-12	Link

Notes:

Recommended: GPT-OSS 120B with TP=4 for full 128k context; MoE architecture (117B total, 5.1B active) delivers fast inference
GPT-OSS 120B with TP=2 leaves 2 GPUs free for Whisper/RAG but reduces max context
Llama 3.3 70B FP8 is a solid alternative if GPT-OSS compatibility is a concern
FP8/Int8 preserves quality while halving memory vs FP16; prefer over AWQ when VRAM allows
Qwen2.5 72B shows faster response times than Llama 3.3 70B in testing

4x L40S (192 GB VRAM)

Best for: Team usage, medium RAG workloads.

Model	Quantization	Context	Concurrent Users	HuggingFace
GPT-OSS 120B (TP=4)	AWQ	32-64k	5-8	Link
Llama 3.3 70B Instruct	FP8	65k	5-6	Link
Llama 3.3 70B Instruct	FP8	32k	10-12	Link
Qwen2.5 72B Instruct	Int8	32k	5-6	Link
Qwen 3 32B	BNB-4bit	32k	8-10	Link

Notes:

Recommended: GPT-OSS 120B AWQ with TP=4; requires all 4 GPUs (doesn't fit on single L40S)
Llama 3.3 70B FP8 is a solid alternative with higher concurrency at 65k context
Qwen 3 32B requires /no_think suffix to disable thinking mode
L40S has less VRAM per GPU (48GB vs 80GB), limiting context compared to H100

2x RTX 4090 (48 GB VRAM)

Best for: Single-user, light RAG, development.

Model	Quantization	Context	Concurrent Users	HuggingFace
GPT-OSS 20B	MXFP4	64k	2-3	Link
Llama 3.1 8B Instruct	FP8	128k	1-2	Link
DeepSeek R1 Distill 8B	AWQ	32k	1-2	Link
Mistral 7B Instruct v0.3	FP16	32k	1-2	Link

Notes:

Recommended: GPT-OSS 20B for best quality; MoE model (21B total, 3.6B active) fits in ~16GB with MXFP4, leaving room for 64k context
8B models are alternatives if longer context (128k) is needed over quality
DeepSeek R1 requires backend handling of thinking tokens

1x L4 / 24 GB GPU

Best for: Testing, demos, low-volume single user.

Model	Quantization	Context	Concurrent Users	HuggingFace
GPT-OSS 20B	MXFP4	32k	1	Link
Qwen3 4B Instruct	FP16	30k	1	Link
Llama 3.1 8B Instruct	FP8	32k	1	Link
Mistral 7B Instruct v0.3	FP8	16k	1	Link

Notes:

Recommended: GPT-OSS 20B for best quality on 24GB hardware; MoE with only 3.6B active parameters; fits in 16GB with MXFP4
Qwen3 4B is an alternative with best throughput (23 tokens/sec) if speed is priority over quality
Max practical context around 30k tokens

Quick Selection Guide

Your Situation	Recommended Model
Enterprise (4x H100)	GPT-OSS 120B MXFP4, TP=2 or TP=4
Professional (4x L40S)	GPT-OSS 120B AWQ or Llama 3.3 70B FP8
Workstation (2x RTX 4090)	GPT-OSS 20B MXFP4, TP=2
Entry (1x 24GB GPU)	GPT-OSS 20B MXFP4
Faster response times	Qwen2.5 72B Int8 on 4x H100
Minimum hardware, testing	Qwen3 4B Instruct
Need reasoning capabilities	DeepSeek R1 Distill (8B or 70B)

VRAM Requirements Reference

Approximate VRAM per model (inference only, excludes KV cache overhead):

Model	FP16	FP8/Int8	Int4/AWQ/MXFP4
GPT-OSS 120B (MoE)	~240 GB	~120 GB	~60 GB
70-72B	~140 GB	~70 GB	~35 GB
34B	~68 GB	~34 GB	~18 GB
32B	~64 GB	~32 GB	~16 GB
GPT-OSS 20B (MoE)	~42 GB	~22 GB	~16 GB
13B	~26 GB	~13 GB	~7 GB
8B	~16 GB	~8 GB	~4 GB
7B	~14 GB	~7 GB	~4 GB
4B	~8 GB	~4 GB	~2 GB

Note: GPT-OSS models use Mixture-of-Experts (MoE) architecture with significantly fewer active parameters (3.6B for 20B, 5.1B for 120B), resulting in faster inference than dense models of similar total size.

Context length significantly increases memory usage. Longer contexts require additional VRAM for KV cache.

Context Size vs Concurrent Users

There is a direct trade-off between context window size and the number of concurrent users:

KV cache grows with context: Each token in the context requires memory for key-value cache. A 65k context uses roughly 4x the KV cache memory of a 16k context.
Concurrent requests multiply memory: Each concurrent user needs their own KV cache allocation.
Practical formula: Available VRAM = Model weights + (Context size × Concurrent users × KV cache per token)

Example for Llama 3.3 70B FP8 on 4x H100 (320 GB):

Context	Max Concurrent Users	Use Case
65k	10-12	Large document RAG
32k	20-25	Standard RAG, team usage
16k	40-50	Short queries, high concurrency

Recommendation: For RAG workloads, prioritize context size over concurrent users. A 65k context allows processing larger documents and more retrieved chunks, improving answer quality.

KV cache optimization: Most inference engines support KV cache quantization, which can roughly double concurrent user capacity at a slight quality cost. The estimates above assume default (FP16) KV cache.

GPU Allocation

The basebox stack includes multiple GPU-capable components:

Component	GPU Usage	Purpose
LLM Server (vLLM)	High VRAM, latency-sensitive	Text generation, chat
Audio Server (Whisper)	Low VRAM (~8GB)	Audio transcription
RAG Server	Low-moderate GPU usage	Embedding, reranking, OCR

Tensor Parallelism Constraints

LLM GPU allocation depends on model tensor parallelism (TP) requirements:

TP must divide attention heads evenly - e.g., GPT-OSS 120B (64 heads) supports TP=1, 2, 4, 8 but not TP=3
GPT-OSS 120B: TP=2 minimum recommended; Data Parallelism not supported (see Known Issues)
70B models: TP=2 or TP=4 typical

This means not all GPUs may be used for the LLM. Remaining GPUs can run Whisper and RAG.

Example: 4x H100 with GPT-OSS 120B

GPUs	Service	Notes
0, 1	LLM (TP=2)	GPT-OSS 120B with tensor parallel
2	Whisper + RAG	Shared GPU for auxiliary services
3	Unused	Available for future scaling

Default: LLM + Whisper Shared GPU

When LLM uses all available GPUs, Whisper shares one GPU with this memory split:

Service	GPU Memory	~Usage (80GB GPU)
LLM	85%	68 GB
Whisper	10%	8 GB
Buffer	5%	4 GB

When to Use GPU for RAG

RAG runs on CPU by default but supports GPU acceleration. Enable GPU mode for RAG if:

OCR is required for scanned PDFs or image files
Reranking performance is critical and CPU is too slow
Spare GPUs are available due to TP constraints (as in the 4x H100 example above)

Known Issues

Llama 3.3 70B AWQ: May show high perplexity; tune repetition penalty and temperature
DeepSeek R1 models: Backend must filter thinking tokens from responses
Qwen 3 32B: Only BNB-4bit works with vLLM; append /no_think to user prompts
GPT-OSS 120B: Does not work reliably with Data Parallelism (produces garbled output); use Tensor Parallelism instead
GPT-OSS models: Require vLLM 0.10.1+; reasoning parser must be configured for proper reasoning token handling
GGUF models: vLLM has experimental GGUF support but with limitations (single-file only, may be incompatible with some features); prefer native quantization formats (FP8, AWQ, GPTQ) for production