# Tabnine Deployment Options

Tabnine can be deployed in one of the following ways:

1. Single/Multi-Tenant SaaS
2. Private cloud / On-prem installation using open-weight models
3. Private cloud / On-prem installation using private API endpoints

### Single/Multi-Tenant SaaS

This deployment allows you to utilize Tabnine’s private LLM endpoints to support both Chat and Agentic workflows.

#### **Models**

These utilize the following families of LLMs for both Chat and Agent:

* GPT
* Claude
* Gemini

#### **Hardware Requirements**

*None.*

### Private Cloud / On-Prem Installation Using Open-Weight AI Models

You can also power Tabnine by supporting open-weight models that are installed on-premises or on one of the private clouds mentioned above.

### **Models**

For Self-Hosted (SH) customers, your hardware needs depend on whether or not you already have any open-weight models within your infrastructure.

{% hint style="warning" %}
The following models will no longer be supported after version 6.2.0 (mid-May):

* Tabnine-protected
* Gemma 3 or lower
* Qwen 2.5 or lower

Accounts that use these models wont be able to upgrade to this version
{% endhint %}

<table><thead><tr><th width="58.712890625"></th><th width="383.9716796875">Open-Weight Models that Tabnine Offers to Install On-Prem</th></tr></thead><tbody><tr><td><img src="/files/mEL7ixqGTcVfbzg0M88J" alt=""></td><td>Devstral-Small-2-24B-Instruct-2512</td></tr><tr><td><img src="/files/mEL7ixqGTcVfbzg0M88J" alt=""></td><td>Devstral-2-123B-Instruct-2512**</td></tr><tr><td><img src="/files/Q3WcyouKrls19ukiF0ma" alt=""></td><td>MiniMax-M2.7</td></tr><tr><td><img src="https://sites.gitbook.com/preview/site_AIYf2/~gitbook/image?url=https%3A%2F%2F3436682446-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FY2qxVf5VTm3fmwP4B4Gx%252Fuploads%252FpnWd1QQl5f68JN0eCLnq%252FScreenshot%25202026-01-07%2520at%252012.33.37.png%3Falt%3Dmedia%26token%3D77bc312c-7f66-41e1-b4b7-0388c8cc17d3&#x26;width=300&#x26;dpr=3&#x26;quality=100&#x26;sign=3ba8b2e1&#x26;sv=2" alt=""></td><td>GLM-4.7</td></tr><tr><td><img src="/files/iIFS7SO03EeJeTeo1nRE" alt=""></td><td>Qwen-3-Coder-480B-A35B-Instruct</td></tr><tr><td><img src="/files/iIFS7SO03EeJeTeo1nRE" alt=""></td><td>Qwen-3-30B <strong>(Chat only)</strong></td></tr></tbody></table>

In the absence of a pre-configured model, Tabnine provides on-premises installation services for supported models:

<table><thead><tr><th width="58.712890625"></th><th width="383.9716796875">Open-Weight Models that Tabnine Offers to Install On-Prem</th></tr></thead><tbody><tr><td><img src="/files/mEL7ixqGTcVfbzg0M88J" alt=""></td><td><a href="#devstral-small-2-24b-instruct-2512">Devstral-Small-2-24B-Instruct-2512</a></td></tr><tr><td><img src="/files/mEL7ixqGTcVfbzg0M88J" alt=""></td><td><a href="#devstral-2-123b-instruct-2512">Devstral-2-123B-Instruct-2512</a>**</td></tr><tr><td><img src="/files/Q3WcyouKrls19ukiF0ma" alt=""></td><td><a href="#minimax-m2.5">MiniMax-M2.7</a></td></tr></tbody></table>

{% hint style="info" %}
\*\*Devstral 2 (123B parameters) is operating under a modified MIT license. If your organization's global consolidated monthly revenue is exceeding $20 million, utilizing this model requires Devstral's permission.
{% endhint %}

#### **Hardware Requirements**

Installation requirements vary based on your specific use case. Please refer to the tables below to ensure optimal performance for both agentic workflows *and* chat features.

**Agent + Chat**

<table><thead><tr><th width="60.376953125"></th><th width="162.20703125">Agent + Chat</th><th width="144.353515625"></th><th>≤100 Users</th><th>101-500 Users</th><th>501-1000 Users</th><th>1001-2000 Users</th></tr></thead><tbody><tr><td><img src="/files/mEL7ixqGTcVfbzg0M88J" alt=""></td><td>Devstral-Small-2-24B-Instruct-2512</td><td><em><strong>Recommended</strong></em></td><td><strong>2 B200</strong></td><td><strong>2 B200</strong></td><td><strong>4 B200</strong></td><td><strong>8 B200</strong></td></tr><tr><td></td><td></td><td>Minimum</td><td>2 H100</td><td>3 H100</td><td>6 H100</td><td>12 H100</td></tr><tr><td><img src="/files/mEL7ixqGTcVfbzg0M88J" alt=""></td><td>Devstral-2-123B-Instruct-2512</td><td><em><strong>Recommended</strong></em></td><td><strong>4 B200</strong></td><td><strong>8 B200</strong></td><td><strong>16 B200</strong></td><td><strong>24 B200</strong></td></tr><tr><td></td><td></td><td>Minimum</td><td>4 H100</td><td>8 H100</td><td>8 B200</td><td>16 B200</td></tr><tr><td><img src="/files/Q3WcyouKrls19ukiF0ma" alt=""></td><td>MiniMax-M2.7</td><td><em><strong>Recommended</strong></em></td><td><strong>2 B200</strong></td><td><strong>4 B200</strong></td><td><strong>8 B200</strong></td><td><strong>16 B200</strong></td></tr><tr><td></td><td></td><td>Minimum</td><td>2 H200</td><td>4 H200</td><td>8 H200</td><td>16 H200</td></tr><tr><td><img src="/files/XuniF1Qdhhx6EenqAKpF" alt=""></td><td>GLM-4.7</td><td><em><strong>Recommended</strong></em></td><td><strong>2 B200</strong></td><td><strong>4 B200</strong></td><td><strong>8 B200</strong></td><td><strong>16 B200</strong></td></tr><tr><td></td><td></td><td>Minimum</td><td>8 H100</td><td>2 B200</td><td>4 B200</td><td>8 B200</td></tr><tr><td><img src="/files/iIFS7SO03EeJeTeo1nRE" alt=""></td><td>Qwen-3-Coder-480B-A35B-Instruct</td><td><em><strong>Recommended</strong></em></td><td><strong>2 B200</strong></td><td><strong>4 B200</strong></td><td><strong>8 B200</strong></td><td><strong>16 B200</strong></td></tr><tr><td></td><td></td><td>Minimum</td><td>8 H100</td><td>2 B200</td><td>4 B200</td><td>8 B200</td></tr></tbody></table>

**Chat&#x20;*****Only***

<table><thead><tr><th width="59.955078125"></th><th width="160.408203125">Chat Only</th><th width="144.236328125"></th><th>≤100 Users</th><th>101-500 Users</th><th>501-1000 Users</th><th>1001-2000 Users</th></tr></thead><tbody><tr><td><img src="/files/31ub9jAlMwkkkNctGS1t" alt="">​​</td><td>Devstral-Small-2-24B-Instruct-2512</td><td><em><strong>Recommended</strong></em></td><td><strong>2 B200</strong></td><td><strong>2 B200</strong></td><td><strong>2 B200</strong></td><td><strong>2 B200</strong></td></tr><tr><td></td><td></td><td>Minimum</td><td>2 H100</td><td>2 H100</td><td>2 H100</td><td>4 H100</td></tr><tr><td>​​<img src="/files/31ub9jAlMwkkkNctGS1t" alt=""></td><td>Devstral-2-123B-Instruct-2512</td><td><em><strong>Recommended</strong></em></td><td><strong>2 B200</strong></td><td><strong>2 B200</strong></td><td><strong>4 B200</strong></td><td><strong>8 B200</strong></td></tr><tr><td></td><td></td><td>Minimum</td><td>4 H100</td><td>4 H100</td><td>8 H100</td><td>16 H100</td></tr><tr><td>​​<img src="/files/wZ8BlllCsbqFHmUrQGgx" alt=""></td><td>MiniMax-M2.7</td><td><em><strong>Recommended</strong></em></td><td><strong>2 B200</strong></td><td><strong>2 B200</strong></td><td><strong>2 B200</strong></td><td><strong>3 B200</strong></td></tr><tr><td></td><td></td><td>Minimum</td><td>2 H200</td><td>2 H200 or<br>4 H100</td><td>4 H200 or<br>8 H100</td><td>8 H200</td></tr><tr><td>​​<img src="/files/6U8z4vcCRc7laOqEBMrq" alt=""></td><td>GLM-4.7</td><td><em><strong>Recommended</strong></em></td><td><strong>2 B200</strong></td><td><strong>2 B200</strong></td><td>4 B200</td><td><strong>6 B200</strong></td></tr><tr><td></td><td></td><td>Minimum</td><td>8 H100</td><td>2 B200</td><td>2 B200</td><td>4 B200</td></tr><tr><td>​​<img src="/files/fASHZp1WiwYiAPoE8upe" alt=""></td><td>Qwen-3-Coder-480B-A35B-Instruct</td><td><em><strong>Recommended</strong></em></td><td><strong>2 B200</strong></td><td><strong>2 B200</strong></td><td><strong>4 B200</strong></td><td><strong>8 B200</strong></td></tr><tr><td></td><td></td><td>Minimum</td><td>8 H100</td><td>8 H100</td><td>4 B200</td><td>8 B200</td></tr><tr><td>​​<img src="/files/fASHZp1WiwYiAPoE8upe" alt=""></td><td>Qwen-3-30B</td><td><em><strong>Recommended</strong></em></td><td><strong>2 B200</strong></td><td><strong>2 B200</strong></td><td><strong>2 B200</strong></td><td></td></tr><tr><td></td><td></td><td>Minimum</td><td>2 H100</td><td>2 H100</td><td></td><td></td></tr></tbody></table>

**GPU Availability by Cloud Provider**

<table><thead><tr><th width="96.6875">GPU</th><th>AWS</th><th>Azure</th><th>GCP</th></tr></thead><tbody><tr><td>H100</td><td>p5.4xlarge (H100 80GB)</td><td>NC40ads_H100_v5 (H100 94GB)</td><td>a3-highgpu-1g (H100 80GB)</td></tr><tr><td>H200</td><td>p5en.48xlarge (8×H200 141GB)</td><td>ND96isr_H200_v5 (8×H200 141GB)</td><td>a3-ultragpu-8g (8×H200 141GB)</td></tr><tr><td>B200</td><td>p6-b200.48xlarge (8×B200 HBM3e)</td><td>ND128isr_NDR_GB200_v6 (4×Blackwell 192GB)</td><td>a4-highgpu-8g (8×B200 HBM3e)</td></tr></tbody></table>

{% hint style="info" %}
If you wish to use an open-weight model that is not included on this list, please contact our support team for a custom assessment.
{% endhint %}

### Open-Weight Model Installation

#### Devstral-2-123B-Instruct-2512

{% tabs %}
{% tab title="Standalone Docker" %} <img src="/files/SMU7Zen7drjYkhiSV28X" alt="" data-size="line"> **Execution Script:**

```bash
read -r -d '' STARTUP_SCRIPT <<'EOF' || true
#!/bin/bash
set -e

# 1. System & Docker Setup
apt update
apt install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://get.docker.com -o get-docker.sh && sh get-docker.sh
usermod -aG docker $USER

# 2. Local Model Cache
mkdir -p /home/ubuntu/data
chown -R $USER:$USER /home/ubuntu/data

# 3. Launch devstral-2-123B
docker run -d --gpus all \
  --name devstral-123b \
  -p 0.0.0.0:8000:8000 \
  -v /home/ubuntu/data:/hf_cache \
  --ipc=host \
  -e HF_HOME=/hf_cache \
  -e VLLM_ATTENTION_BACKEND=FLASHINFER \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  -e VLLM_USE_FLASHINFER_MOE_FP16=1 \
  -e VLLM_USE_FLASHINFER_MOE_FP8=1 \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 \
  -e SAFETENSORS_FAST_GPU=1 \
  -e VLLM_SERVER_DEV_MODE=1 \
  -e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 \
  vllm/vllm-openai:v0.13.0 \
  mistralai/Devstral-2-123B-Instruct-2512 \
  --port 8000 \
  --max-model-len 128000 \
  --max-num-seqs 64 \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.95 \
  --swap-space 16 \
  --trust-remote-code
EOF

```

{% endtab %}

{% tab title="Kubernetes" %} <img src="/files/wBLfANrdbgGUH8bHlB6Y" alt="" data-size="line"> **1. Create Namespace**

```bash
kubectl create namespace devstral-123b
```

**2. Deployment Manifest (`devstral-123b.yaml`)**

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: devstral-deployment
  namespace: devstral-123b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: devstral-123b
  template:
    metadata:
      labels:
        app: devstral-123b
    spec:
      containers:
      - name: vllm-engine
        image: vllm/vllm-openai:v0.13.0
        args:
          - "mistralai/Devstral-2-123B-Instruct-2512"
          - "--port"
          - "8000"
          - "--max-model-len"
          - "128000"
          - "--max-num-seqs"
          - "64"
          - "--tensor-parallel-size"
          - "8"
          - "--tool-call-parser"
          - "mistral"
          - "--enable-auto-tool-choice"
          - "--gpu-memory-utilization"
          - "0.95"
          - "--trust-remote-code"
        env:
          - name: HF_HOME
            value: "/hf_cache"
          - name: VLLM_ATTENTION_BACKEND
            value: "FLASHINFER"
          - name: VLLM_FLASHINFER_MOE_BACKEND
            value: "throughput"
        resources:
          limits:
            nvidia.com/gpu: 8 # Required for 123B weights
          requests:
            nvidia.com/gpu: 8
        volumeMounts:
          - name: model-cache
            mountPath: /hf_cache
          - name: dshm
            mountPath: /dev/shm
      volumes:
        - name: model-cache
          emptyDir: {}
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: "24Gi" # Increased for 8x H100 NCCL comms
---
apiVersion: v1
kind: Service
metadata:
  name: devstral-svc
  namespace: devstral-123b
spec:
  ports:
    - port: 80
      targetPort: 8000
  selector:
    app: devstral-123b
  type: ClusterIP

```

{% endtab %}
{% endtabs %}

#### Devstral-Small-2-24B-Instruct-2512

{% tabs %}
{% tab title="Standalone Docker" %} <img src="/files/SMU7Zen7drjYkhiSV28X" alt="" data-size="line"> **Execution Script:**

```bash
read -r -d '' STARTUP_SCRIPT <<'EOF' || true
#!/bin/bash
set -e

# 1. Environment Setup
apt update && apt install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://get.docker.com -o get-docker.sh && sh get-docker.sh
usermod -aG docker $USER

# 2. Cache Setup
mkdir -p /home/ubuntu/data
chown -R $USER:$USER /home/ubuntu/data

# 3. Launch Devstral Container
docker run -d --gpus all \
  --name devstral-vllm \
  -p 0.0.0.0:8000:8000 \
  -v /home/ubuntu/data:/hf_cache \
  -e HF_HOME=/hf_cache \
  -e VLLM_ATTENTION_BACKEND=FLASHINFER \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  -e VLLM_USE_FLASHINFER_MOE_FP8=1 \
  -e SAFETENSORS_FAST_GPU=1 \
  -e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 \
  vllm/vllm-openai:v0.14.0 \
  mistralai/Devstral-Small-2-24B-Instruct-2512 \
  --port 8000 \
  --max-model-len 128000 \
  --max-num-seqs 512 \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.95 \
  --swap-space 16 \
  --trust-remote-code
EOF
```

{% endtab %}

{% tab title="Kubernetes" %} <img src="/files/wBLfANrdbgGUH8bHlB6Y" alt="" data-size="line"> **1. Namespace Creation**

```bash
kubectl create namespace devstral-apps
```

**2. Deployment Manifest (`devstral-k8s.yaml`)**

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: devstral-inference
  namespace: devstral-apps
spec:
  replicas: 1
  selector:
    matchLabels:
      app: devstral
  template:
    metadata:
      labels:
        app: devstral
    spec:
      containers:
      - name: vllm-engine
        image: vllm/vllm-openai:v0.14.0
        args:
          - "mistralai/Devstral-Small-2-24B-Instruct-2512"
          - "--port"
          - "8000"
          - "--max-model-len"
          - "128000"
          - "--tensor-parallel-size"
          - "1"
          - "--tool-call-parser"
          - "mistral"
          - "--enable-auto-tool-choice"
          - "--gpu-memory-utilization"
          - "0.95"
          - "--trust-remote-code"
        env:
          - name: HF_HOME
            value: "/hf_cache"
          - name: VLLM_ATTENTION_BACKEND
            value: "FLASHINFER"
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
          - name: model-cache
            mountPath: /hf_cache
          - name: dshm
            mountPath: /dev/shm
      volumes:
        - name: model-cache
          emptyDir: {}
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: "4Gi"
---
apiVersion: v1
kind: Service
metadata:
  name: devstral-svc
  namespace: devstral-apps
spec:
  ports:
    - port: 80
      targetPort: 8000
  selector:
    app: devstral
  type: ClusterIP

```

{% endtab %}
{% endtabs %}

#### MiniMax-M2.7

{% tabs %}
{% tab title="Standalone Docker" %} <img src="/files/SMU7Zen7drjYkhiSV28X" alt="" data-size="line"> **Automated Startup Script:**

```bash
read -r -d '' STARTUP_SCRIPT <<'EOF' || true
#!/bin/bash
set -e

# 1. System Preparation & Docker Install
apt update
apt install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh
usermod -aG docker $USER

# 2. Local Cache Directory
mkdir -p /home/ubuntu/data
chown -R $USER:$USER /home/ubuntu/data

# 3. Execution (Optimized for 8x H100)
docker run -d --gpus all \
  --name vllm-minimax \
  -p 0.0.0.0:8000:8000 Thank you for reaching out.


docker run -d --gpus all \
  -p 0.0.0.0:8000:8000 \
  -e SAFETENSORS_FAST_GPU=1 \
  -e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  -e VLLM_USE_FLASHINFER_MOE_FP16=1 \
  -e VLLM_USE_FLASHINFER_MOE_FP8=1 \
  vllm/vllm-openai:v0.19.0 \
  --model MiniMaxAI/MiniMax-M2.7 \
  --served-model-name MiniMaxAI/MiniMax-M2.7 \
  --port 8000 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --max-model-len 180000 \
  --max-num-seqs 128 \
  --gpu-memory-utilization 0.95 \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --attention-config '{"backend": "FLASHINFER"}' \
  --trust-remote-code
EOF
```

**Verification Commands**

```bash
docker logs -f vllm-minimax
```

{% endtab %}

{% tab title="Kubernetes" %} <img src="/files/wBLfANrdbgGUH8bHlB6Y" alt="" data-size="line"> **1. Namespace Creation**

```bash
kubectl create namespace llm-inference
```

**2. Deployment Manifest (`minimax-deploy.yaml`)**

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: minimax-vllm
  namespace: llm-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: minimax-vllm
  template:
    metadata:
      labels:
        app: minimax-vllm
    spec:
      containers:
        - name: vllm-container
          image: vllm/vllm-openai:v0.14.0
          args:
            - "--model"
            - "MiniMaxAI/MiniMax-M2.7"
            - "--tensor-parallel-size"
            - "8"
            - "--max-model-len"
            - "128000"
            - "--gpu-memory-utilization"
            - "0.95"
            - "--trust-remote-code"
            - "--enable-expert-parallel"
          env:
            - name: HF_HOME
              value: "/hf_cache"
            - name: VLLM_ATTENTION_BACKEND
              value: "FLASHINFER"
          resources:
            limits:
              nvidia.com/gpu: 8
          volumeMounts:
            - name: cache-volume
              mountPath: /hf_cache
            - name: dshm
              mountPath: /dev/shm
      volumes:
        - name: cache-volume
          emptyDir: {}
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: "16Gi"
```

**Service Manifest**

```yaml
apiVersion: v1
kind: Service
metadata:
  name: minimax-service
  namespace: llm-inference
spec:
  selector:
    app: minimax-vllm
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8000
```

***

**Verification Commands**

```bash
kubectl logs -f deployment/minimax-vllm -n llm-inference
```

{% endtab %}
{% endtabs %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tabnine.com/main/welcome/readme/system-requirements/system-requirements.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
