# Private Cloud / On-Prem Installation Using Open-Weight AI Models You can also power Tabnine by supporting open-weight models that are installed on-premises or on one of the private clouds mentioned above. ### **Models** For Self-Hosted (SH) customers, your hardware needs depend on whether or not you already have any open-weight models within your infrastructure.

	Tabnine-Supported Open-Weight Models
	Devstral-Small-2-24B-Instruct-2512
	Devstral-2-123B-Instruct-2512**
	MiniMax-M2.5
	GLM-4.7
	Qwen-3-Coder-480B-A35B-Instruct
	Qwen-3-30B (Chat only)

If not, we will install one of the following models on-premises for you:

	Open-Weight Models that Tabnine Offers to Install On-Prem
	Devstral-Small-2-24B-Instruct-2512
	Devstral-2-123B-Instruct-2512**
	MiniMax-M2.5

{% hint style="info" %} \*\*Devstral 2 (123B parameters) is operating under a modified MIT license. If your organization's global consolidated monthly revenue is exceeding $20 million, utilizing this model requires Devstral's permission.Hardware Requirements. {% endhint %} #### **Hardware Requirements** There are different installation requirements, aimed to make sure users have the optimal experience when using Tabnine. Those requirements will be different for Agentic workflows or Chat. **Agent + Chat**

Agent + Chat	≤100 Users — Recommended	≤100 Users — Minimal	101-500 Users — Recommended	101-500 Users — Minimal	501-1000 Users — Recommended	501-1000 Users — Minimal	1001-2000 Users — Recommended	1001-2000 Users — Minimal
Devstral-Small-2-24B-Instruct-2512	2 B200	2 H100	2 B200	3 H100	4 B200	6 H100	8 B200	12 H100
Devstral-2-123B-Instruct-2512	4 B200	4 H100	8 B200	8 H100	16 B200	8 B200	24 B200	16 B200
MiniMax-M2.5	2 B200	2 H200	4 B200	4 H200	8 B200	8 H200	16 B200	16 H200
GLM-4.7	2 B200	8 H100	4 B200	2 B200	8 B200	4 B200	16 B200	8 B200
Qwen-3-Coder-480B-A35B-Instruct	2 B200	8 H100	4 B200	2 B200	8 B200	4 B200	16 B200	8 B200

**Chat Only**

Chat Only	≤100 Users — Recommended	≤100 Users — Minimal	101-500 Users — Recommended	101-500 Users — Minimal	501-1000 Users — Recommended	501-1000 Users — Minimal	1001-2000 Users — Recommended	1001-2000 Users — Minimal
Devstral-Small-2-24B-Instruct-2512	2 B200	2 H100	2 B200	2 H100	2 B200	2 H100	2 B200	4 H100
Devstral-2-123B-Instruct-2512	2 B200	4 H100	2 B200	4 H100	4 B200	8 H100	8 B200	16 H100
MiniMax-M2.5	2 B200	2 H 200	2 B200	2 H200 / 4 H100	2 B200	4 H200 / 8 H100	3 B200	8 H200
GLM-4.7	2 B200	8 H100	2 B200	2 B200	4 B200	2 B200	6 B200	4 B200
Qwen-3-Coder-480B-A35B-Instruct	2 B200	8 H100	2 B200	8 H100	4 B200	4 B200	8 B200	8 B200
Qwen-3-30B	2 B200	2 H100	2 B200	2 H100	2 B200	2 H100	2 B200	2 H100

**GPU Availability by Cloud Provider**

GPU	AWS	Azure	GCP
H100	p5.4xlarge (H100 80GB)	NC40ads_H100_v5 (H100 94GB)	a3-highgpu-1g (H100 80GB)
H200	p5en.48xlarge (8×H200 141GB)	ND96isr_H200_v5 (8×H200 141GB)	a3-ultragpu-8g (8×H200 141GB)
B200	p6-b200.48xlarge (8×B200 HBM3e)	ND128isr_NDR_GB200_v6 (4×Blackwell 192GB)	a4-highgpu-8g (8×B200 HBM3e)

{% hint style="info" %} If you don’t have an open-weight model that is not on the list, contact us and our team will work with you. {% endhint %} ### Open-Weight Model Installation #### Devstral-2-123B-Instruct-2512 {% tabs %} {% tab title="Standalone Docker" %}

**Execution Script:** ```bash read -r -d '' STARTUP_SCRIPT <<'EOF' || true #!/bin/bash set -e # 1. System & Docker Setup apt update apt install -y apt-transport-https ca-certificates curl software-properties-common curl -fsSL https://get.docker.com -o get-docker.sh && sh get-docker.sh usermod -aG docker $USER # 2. Local Model Cache mkdir -p /home/ubuntu/data chown -R $USER:$USER /home/ubuntu/data # 3. Launch devstral-2-123B docker run -d --gpus all \ --name devstral-123b \ -p 0.0.0.0:8000:8000 \ -v /home/ubuntu/data:/hf_cache \ --ipc=host \ -e HF_HOME=/hf_cache \ -e VLLM_ATTENTION_BACKEND=FLASHINFER \ -e VLLM_FLASHINFER_MOE_BACKEND=throughput \ -e VLLM_USE_FLASHINFER_MOE_FP16=1 \ -e VLLM_USE_FLASHINFER_MOE_FP8=1 \ -e VLLM_USE_FLASHINFER_MOE_FP4=1 \ -e VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 \ -e SAFETENSORS_FAST_GPU=1 \ -e VLLM_SERVER_DEV_MODE=1 \ -e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 \ vllm/vllm-openai:v0.13.0 \ mistralai/Devstral-2-123B-Instruct-2512 \ --port 8000 \ --max-model-len 128000 \ --max-num-seqs 64 \ --tool-call-parser mistral \ --enable-auto-tool-choice \ --tensor-parallel-size 8 \ --gpu-memory-utilization 0.95 \ --swap-space 16 \ --trust-remote-code EOF ``` {% endtab %} {% tab title="Kubernetes" %}

**1. Create Namespace** ```bash kubectl create namespace devstral-123b ``` **2. Deployment Manifest (`devstral-123b.yaml`)** ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: devstral-deployment namespace: devstral-123b spec: replicas: 1 selector: matchLabels: app: devstral-123b template: metadata: labels: app: devstral-123b spec: containers: - name: vllm-engine image: vllm/vllm-openai:v0.13.0 args: - "mistralai/Devstral-2-123B-Instruct-2512" - "--port" - "8000" - "--max-model-len" - "128000" - "--max-num-seqs" - "64" - "--tensor-parallel-size" - "8" - "--tool-call-parser" - "mistral" - "--enable-auto-tool-choice" - "--gpu-memory-utilization" - "0.95" - "--trust-remote-code" env: - name: HF_HOME value: "/hf_cache" - name: VLLM_ATTENTION_BACKEND value: "FLASHINFER" - name: VLLM_FLASHINFER_MOE_BACKEND value: "throughput" resources: limits: nvidia.com/gpu: 8 # Required for 123B weights requests: nvidia.com/gpu: 8 volumeMounts: - name: model-cache mountPath: /hf_cache - name: dshm mountPath: /dev/shm volumes: - name: model-cache emptyDir: {} - name: dshm emptyDir: medium: Memory sizeLimit: "24Gi" # Increased for 8x H100 NCCL comms --- apiVersion: v1 kind: Service metadata: name: devstral-svc namespace: devstral-123b spec: ports: - port: 80 targetPort: 8000 selector: app: devstral-123b type: ClusterIP ``` {% endtab %} {% endtabs %} #### Devstral-Small-2-24B-Instruct-2512 {% tabs %} {% tab title="Standalone Docker" %}

**Execution Script:** ```bash read -r -d '' STARTUP_SCRIPT <<'EOF' || true #!/bin/bash set -e # 1. Environment Setup apt update && apt install -y apt-transport-https ca-certificates curl software-properties-common curl -fsSL https://get.docker.com -o get-docker.sh && sh get-docker.sh usermod -aG docker $USER # 2. Cache Setup mkdir -p /home/ubuntu/data chown -R $USER:$USER /home/ubuntu/data # 3. Launch Devstral Container docker run -d --gpus all \ --name devstral-vllm \ -p 0.0.0.0:8000:8000 \ -v /home/ubuntu/data:/hf_cache \ -e HF_HOME=/hf_cache \ -e VLLM_ATTENTION_BACKEND=FLASHINFER \ -e VLLM_FLASHINFER_MOE_BACKEND=throughput \ -e VLLM_USE_FLASHINFER_MOE_FP8=1 \ -e SAFETENSORS_FAST_GPU=1 \ -e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 \ vllm/vllm-openai:v0.14.0 \ mistralai/Devstral-Small-2-24B-Instruct-2512 \ --port 8000 \ --max-model-len 128000 \ --max-num-seqs 512 \ --tool-call-parser mistral \ --enable-auto-tool-choice \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.95 \ --swap-space 16 \ --trust-remote-code EOF ``` {% endtab %} {% tab title="Kubernetes" %}

**1. Namespace Creation** ```bash kubectl create namespace devstral-apps ``` **2. Deployment Manifest (`devstral-k8s.yaml`)** ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: devstral-inference namespace: devstral-apps spec: replicas: 1 selector: matchLabels: app: devstral template: metadata: labels: app: devstral spec: containers: - name: vllm-engine image: vllm/vllm-openai:v0.14.0 args: - "mistralai/Devstral-Small-2-24B-Instruct-2512" - "--port" - "8000" - "--max-model-len" - "128000" - "--tensor-parallel-size" - "1" - "--tool-call-parser" - "mistral" - "--enable-auto-tool-choice" - "--gpu-memory-utilization" - "0.95" - "--trust-remote-code" env: - name: HF_HOME value: "/hf_cache" - name: VLLM_ATTENTION_BACKEND value: "FLASHINFER" resources: limits: nvidia.com/gpu: 1 volumeMounts: - name: model-cache mountPath: /hf_cache - name: dshm mountPath: /dev/shm volumes: - name: model-cache emptyDir: {} - name: dshm emptyDir: medium: Memory sizeLimit: "4Gi" --- apiVersion: v1 kind: Service metadata: name: devstral-svc namespace: devstral-apps spec: ports: - port: 80 targetPort: 8000 selector: app: devstral type: ClusterIP ``` {% endtab %} {% endtabs %} #### MiniMax-M2.5 {% tabs %} {% tab title="Standalone Docker" %}

**Automated Startup Script:** ```bash read -r -d '' STARTUP_SCRIPT <<'EOF' || true #!/bin/bash set -e # 1. System Preparation & Docker Install apt update apt install -y apt-transport-https ca-certificates curl software-properties-common curl -fsSL https://get.docker.com -o get-docker.sh sh get-docker.sh usermod -aG docker $USER # 2. Local Cache Directory mkdir -p /home/ubuntu/data chown -R $USER:$USER /home/ubuntu/data # 3. Execution (Optimized for 8x H100) docker run -d --gpus all \ --name vllm-minimax \ -p 0.0.0.0:8000:8000 Thank you for reaching out. -v /home/ubuntu/data:/hf_cache \ --ipc=host \ -e HF_HOME=/hf_cache \ -e VLLM_ATTENTION_BACKEND=FLASHINFER \ -e VLLM_FLASHINFER_MOE_BACKEND=throughput \ -e VLLM_USE_FLASHINFER_MOE_FP8=1 \ -e SAFETENSORS_FAST_GPU=1 \ vllm/vllm-openai:v0.14.0 \ MiniMaxAI/MiniMax-M2.5 \ --port 8000 \ --max-model-len 128000 \ --tensor-parallel-size 8 \ --gpu-memory-utilization 0.95 \ --trust-remote-code \ --enable-expert-parallel EOF ``` **Verification Commands** ```bash docker logs -f vllm-minimax ``` {% endtab %} {% tab title="Kubernetes" %}

**1. Namespace Creation** ```bash kubectl create namespace llm-inference ``` **2. Deployment Manifest (`minimax-deploy.yaml`)** ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: minimax-vllm namespace: llm-inference spec: replicas: 1 selector: matchLabels: app: minimax-vllm template: metadata: labels: app: minimax-vllm spec: containers: - name: vllm-container image: vllm/vllm-openai:v0.14.0 args: - "--model" - "MiniMaxAI/MiniMax-M2.5" - "--tensor-parallel-size" - "8" - "--max-model-len" - "128000" - "--gpu-memory-utilization" - "0.95" - "--trust-remote-code" - "--enable-expert-parallel" env: - name: HF_HOME value: "/hf_cache" - name: VLLM_ATTENTION_BACKEND value: "FLASHINFER" resources: limits: nvidia.com/gpu: 8 volumeMounts: - name: cache-volume mountPath: /hf_cache - name: dshm mountPath: /dev/shm volumes: - name: cache-volume emptyDir: {} - name: dshm emptyDir: medium: Memory sizeLimit: "16Gi" ``` **Service Manifest** ```yaml apiVersion: v1 kind: Service metadata: name: minimax-service namespace: llm-inference spec: selector: app: minimax-vllm ports: - protocol: TCP port: 80 targetPort: 8000 ``` *** **Verification Commands** ```bash kubectl logs -f deployment/minimax-vllm -n llm-inference ``` {% endtab %} {% endtabs %}