# Tabnine Deployment Options Tabnine can be deployed in one of the following ways: 1. Single/Multi-Tenant SaaS 2. Private cloud / On-prem installation using open-weight models 3. Private cloud / On-prem installation using private API endpoints ### Single/Multi-Tenant SaaS This deployment allows you to utilize Tabnine’s private LLM endpoints to support both Chat and Agentic workflows. #### **Models** These utilize the following families of LLMs for both Chat and Agent: * GPT * Claude * Gemini #### **Hardware Requirements** *None.* ### Private Cloud / On-Prem Installation Using Open-Weight AI Models You can also power Tabnine by supporting open-weight models that are installed on-premises or on one of the private clouds mentioned above. ### **Models** For Self-Hosted (SH) customers, your hardware needs depend on whether or not you already have any open-weight models within your infrastructure. {% hint style="warning" %} The following models will no longer be supported after version 6.2.0 (mid-May): * Tabnine-protected * Gemma 3 or lower * Qwen 2.5 or lower Accounts that use these models wont be able to upgrade to this version {% endhint %}

	Open-Weight Models that Tabnine Offers to Install On-Prem
	Devstral-Small-2-24B-Instruct-2512
	Devstral-2-123B-Instruct-2512**
	MiniMax-M2.7
	GLM-4.7
	Qwen-3-Coder-480B-A35B-Instruct
	Qwen-3-30B (Chat only)

In the absence of a pre-configured model, Tabnine provides on-premises installation services for supported models:

	Open-Weight Models that Tabnine Offers to Install On-Prem
	Devstral-Small-2-24B-Instruct-2512
	Devstral-2-123B-Instruct-2512**
	MiniMax-M2.7

{% hint style="info" %} \*\*Devstral 2 (123B parameters) is operating under a modified MIT license. If your organization's global consolidated monthly revenue is exceeding $20 million, utilizing this model requires Devstral's permission. {% endhint %} #### **Hardware Requirements** Installation requirements vary based on your specific use case. Please refer to the tables below to ensure optimal performance for both agentic workflows *and* chat features. **Agent + Chat**

Agent + Chat		≤100 Users	101-500 Users	501-1000 Users	1001-2000 Users
Devstral-Small-2-24B-Instruct-2512	Recommended	2 B200	2 B200	4 B200	8 B200
	Minimum	2 H100	3 H100	6 H100	12 H100
Devstral-2-123B-Instruct-2512	Recommended	4 B200	8 B200	16 B200	24 B200
	Minimum	4 H100	8 H100	8 B200	16 B200
MiniMax-M2.7	Recommended	2 B200	4 B200	8 B200	16 B200
	Minimum	2 H200	4 H200	8 H200	16 H200
GLM-4.7	Recommended	2 B200	4 B200	8 B200	16 B200
	Minimum	8 H100	2 B200	4 B200	8 B200
Qwen-3-Coder-480B-A35B-Instruct	Recommended	2 B200	4 B200	8 B200	16 B200
	Minimum	8 H100	2 B200	4 B200	8 B200

**Chat *****Only***

Chat Only		≤100 Users	101-500 Users	501-1000 Users	1001-2000 Users
Devstral-Small-2-24B-Instruct-2512	Recommended	2 B200	2 B200	2 B200	2 B200
	Minimum	2 H100	2 H100	2 H100	4 H100
Devstral-2-123B-Instruct-2512	Recommended	2 B200	2 B200	4 B200	8 B200
	Minimum	4 H100	4 H100	8 H100	16 H100
MiniMax-M2.7	Recommended	2 B200	2 B200	2 B200	3 B200
	Minimum	2 H200	2 H200 or 4 H100	4 H200 or 8 H100	8 H200
GLM-4.7	Recommended	2 B200	2 B200	4 B200	6 B200
	Minimum	8 H100	2 B200	2 B200	4 B200
Qwen-3-Coder-480B-A35B-Instruct	Recommended	2 B200	2 B200	4 B200	8 B200
	Minimum	8 H100	8 H100	4 B200	8 B200
Qwen-3-30B	Recommended	2 B200	2 B200	2 B200
	Minimum	2 H100	2 H100

**GPU Availability by Cloud Provider**

GPU	AWS	Azure	GCP
H100	p5.4xlarge (H100 80GB)	NC40ads_H100_v5 (H100 94GB)	a3-highgpu-1g (H100 80GB)
H200	p5en.48xlarge (8×H200 141GB)	ND96isr_H200_v5 (8×H200 141GB)	a3-ultragpu-8g (8×H200 141GB)
B200	p6-b200.48xlarge (8×B200 HBM3e)	ND128isr_NDR_GB200_v6 (4×Blackwell 192GB)	a4-highgpu-8g (8×B200 HBM3e)

{% hint style="info" %} If you wish to use an open-weight model that is not included on this list, please contact our support team for a custom assessment. {% endhint %} ### Open-Weight Model Installation #### Devstral-2-123B-Instruct-2512 {% tabs %} {% tab title="Standalone Docker" %}

**Execution Script:** ```bash read -r -d '' STARTUP_SCRIPT <<'EOF' || true #!/bin/bash set -e # 1. System & Docker Setup apt update apt install -y apt-transport-https ca-certificates curl software-properties-common curl -fsSL https://get.docker.com -o get-docker.sh && sh get-docker.sh usermod -aG docker $USER # 2. Local Model Cache mkdir -p /home/ubuntu/data chown -R $USER:$USER /home/ubuntu/data # 3. Launch devstral-2-123B docker run -d --gpus all \ --name devstral-123b \ -p 0.0.0.0:8000:8000 \ -v /home/ubuntu/data:/hf_cache \ --ipc=host \ -e HF_HOME=/hf_cache \ -e VLLM_ATTENTION_BACKEND=FLASHINFER \ -e VLLM_FLASHINFER_MOE_BACKEND=throughput \ -e VLLM_USE_FLASHINFER_MOE_FP16=1 \ -e VLLM_USE_FLASHINFER_MOE_FP8=1 \ -e VLLM_USE_FLASHINFER_MOE_FP4=1 \ -e VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 \ -e SAFETENSORS_FAST_GPU=1 \ -e VLLM_SERVER_DEV_MODE=1 \ -e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 \ vllm/vllm-openai:v0.13.0 \ mistralai/Devstral-2-123B-Instruct-2512 \ --port 8000 \ --max-model-len 128000 \ --max-num-seqs 64 \ --tool-call-parser mistral \ --enable-auto-tool-choice \ --tensor-parallel-size 8 \ --gpu-memory-utilization 0.95 \ --swap-space 16 \ --trust-remote-code EOF ``` {% endtab %} {% tab title="Kubernetes" %}

**1. Create Namespace** ```bash kubectl create namespace devstral-123b ``` **2. Deployment Manifest (`devstral-123b.yaml`)** ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: devstral-deployment namespace: devstral-123b spec: replicas: 1 selector: matchLabels: app: devstral-123b template: metadata: labels: app: devstral-123b spec: containers: - name: vllm-engine image: vllm/vllm-openai:v0.13.0 args: - "mistralai/Devstral-2-123B-Instruct-2512" - "--port" - "8000" - "--max-model-len" - "128000" - "--max-num-seqs" - "64" - "--tensor-parallel-size" - "8" - "--tool-call-parser" - "mistral" - "--enable-auto-tool-choice" - "--gpu-memory-utilization" - "0.95" - "--trust-remote-code" env: - name: HF_HOME value: "/hf_cache" - name: VLLM_ATTENTION_BACKEND value: "FLASHINFER" - name: VLLM_FLASHINFER_MOE_BACKEND value: "throughput" resources: limits: nvidia.com/gpu: 8 # Required for 123B weights requests: nvidia.com/gpu: 8 volumeMounts: - name: model-cache mountPath: /hf_cache - name: dshm mountPath: /dev/shm volumes: - name: model-cache emptyDir: {} - name: dshm emptyDir: medium: Memory sizeLimit: "24Gi" # Increased for 8x H100 NCCL comms --- apiVersion: v1 kind: Service metadata: name: devstral-svc namespace: devstral-123b spec: ports: - port: 80 targetPort: 8000 selector: app: devstral-123b type: ClusterIP ``` {% endtab %} {% endtabs %} #### Devstral-Small-2-24B-Instruct-2512 {% tabs %} {% tab title="Standalone Docker" %}

**Execution Script:** ```bash read -r -d '' STARTUP_SCRIPT <<'EOF' || true #!/bin/bash set -e # 1. Environment Setup apt update && apt install -y apt-transport-https ca-certificates curl software-properties-common curl -fsSL https://get.docker.com -o get-docker.sh && sh get-docker.sh usermod -aG docker $USER # 2. Cache Setup mkdir -p /home/ubuntu/data chown -R $USER:$USER /home/ubuntu/data # 3. Launch Devstral Container docker run -d --gpus all \ --name devstral-vllm \ -p 0.0.0.0:8000:8000 \ -v /home/ubuntu/data:/hf_cache \ -e HF_HOME=/hf_cache \ -e VLLM_ATTENTION_BACKEND=FLASHINFER \ -e VLLM_FLASHINFER_MOE_BACKEND=throughput \ -e VLLM_USE_FLASHINFER_MOE_FP8=1 \ -e SAFETENSORS_FAST_GPU=1 \ -e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 \ vllm/vllm-openai:v0.14.0 \ mistralai/Devstral-Small-2-24B-Instruct-2512 \ --port 8000 \ --max-model-len 128000 \ --max-num-seqs 512 \ --tool-call-parser mistral \ --enable-auto-tool-choice \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.95 \ --swap-space 16 \ --trust-remote-code EOF ``` {% endtab %} {% tab title="Kubernetes" %}

**1. Namespace Creation** ```bash kubectl create namespace devstral-apps ``` **2. Deployment Manifest (`devstral-k8s.yaml`)** ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: devstral-inference namespace: devstral-apps spec: replicas: 1 selector: matchLabels: app: devstral template: metadata: labels: app: devstral spec: containers: - name: vllm-engine image: vllm/vllm-openai:v0.14.0 args: - "mistralai/Devstral-Small-2-24B-Instruct-2512" - "--port" - "8000" - "--max-model-len" - "128000" - "--tensor-parallel-size" - "1" - "--tool-call-parser" - "mistral" - "--enable-auto-tool-choice" - "--gpu-memory-utilization" - "0.95" - "--trust-remote-code" env: - name: HF_HOME value: "/hf_cache" - name: VLLM_ATTENTION_BACKEND value: "FLASHINFER" resources: limits: nvidia.com/gpu: 1 volumeMounts: - name: model-cache mountPath: /hf_cache - name: dshm mountPath: /dev/shm volumes: - name: model-cache emptyDir: {} - name: dshm emptyDir: medium: Memory sizeLimit: "4Gi" --- apiVersion: v1 kind: Service metadata: name: devstral-svc namespace: devstral-apps spec: ports: - port: 80 targetPort: 8000 selector: app: devstral type: ClusterIP ``` {% endtab %} {% endtabs %} #### MiniMax-M2.7 {% tabs %} {% tab title="Standalone Docker" %}

**Automated Startup Script:** ```bash read -r -d '' STARTUP_SCRIPT <<'EOF' || true #!/bin/bash set -e # 1. System Preparation & Docker Install apt update apt install -y apt-transport-https ca-certificates curl software-properties-common curl -fsSL https://get.docker.com -o get-docker.sh sh get-docker.sh usermod -aG docker $USER # 2. Local Cache Directory mkdir -p /home/ubuntu/data chown -R $USER:$USER /home/ubuntu/data # 3. Execution (Optimized for 8x H100) docker run -d --gpus all \ --name vllm-minimax \ -p 0.0.0.0:8000:8000 Thank you for reaching out. docker run -d --gpus all \ -p 0.0.0.0:8000:8000 \ -e SAFETENSORS_FAST_GPU=1 \ -e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 \ -e VLLM_FLASHINFER_MOE_BACKEND=throughput \ -e VLLM_USE_FLASHINFER_MOE_FP16=1 \ -e VLLM_USE_FLASHINFER_MOE_FP8=1 \ vllm/vllm-openai:v0.19.0 \ --model MiniMaxAI/MiniMax-M2.7 \ --served-model-name MiniMaxAI/MiniMax-M2.7 \ --port 8000 \ --tensor-parallel-size 8 \ --enable-expert-parallel \ --max-model-len 180000 \ --max-num-seqs 128 \ --gpu-memory-utilization 0.95 \ --enable-auto-tool-choice \ --tool-call-parser minimax_m2 \ --reasoning-parser minimax_m2 \ --attention-config '{"backend": "FLASHINFER"}' \ --trust-remote-code EOF ``` **Verification Commands** ```bash docker logs -f vllm-minimax ``` {% endtab %} {% tab title="Kubernetes" %}

**1. Namespace Creation** ```bash kubectl create namespace llm-inference ``` **2. Deployment Manifest (`minimax-deploy.yaml`)** ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: minimax-vllm namespace: llm-inference spec: replicas: 1 selector: matchLabels: app: minimax-vllm template: metadata: labels: app: minimax-vllm spec: containers: - name: vllm-container image: vllm/vllm-openai:v0.14.0 args: - "--model" - "MiniMaxAI/MiniMax-M2.7" - "--tensor-parallel-size" - "8" - "--max-model-len" - "128000" - "--gpu-memory-utilization" - "0.95" - "--trust-remote-code" - "--enable-expert-parallel" env: - name: HF_HOME value: "/hf_cache" - name: VLLM_ATTENTION_BACKEND value: "FLASHINFER" resources: limits: nvidia.com/gpu: 8 volumeMounts: - name: cache-volume mountPath: /hf_cache - name: dshm mountPath: /dev/shm volumes: - name: cache-volume emptyDir: {} - name: dshm emptyDir: medium: Memory sizeLimit: "16Gi" ``` **Service Manifest** ```yaml apiVersion: v1 kind: Service metadata: name: minimax-service namespace: llm-inference spec: selector: app: minimax-vllm ports: - protocol: TCP port: 80 targetPort: 8000 ``` *** **Verification Commands** ```bash kubectl logs -f deployment/minimax-vllm -n llm-inference ``` {% endtab %} {% endtabs %} --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://docs.tabnine.com/main/welcome/readme/system-requirements/system-requirements.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.