# Private Cloud / On-Prem Installation Using Open-Weight AI Models

You can also power Tabnine by supporting open-weight models that are installed on-premises or on one of the private clouds mentioned above.

### **Models**

For Self-Hosted (SH) customers, your hardware needs depend on whether or not you already have any open-weight models within your infrastructure.

<table><thead><tr><th width="58.712890625"></th><th width="383.9716796875">Tabnine-Supported Open-Weight Models</th></tr></thead><tbody><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2FfxZflxCYxWYvBhj9fOre%2Fm-rainbow.svg?alt=media&#x26;token=03a5f129-52c9-4326-9dc0-028b400462ca" alt=""></td><td>Devstral-Small-2-24B-Instruct-2512</td></tr><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2FfxZflxCYxWYvBhj9fOre%2Fm-rainbow.svg?alt=media&#x26;token=03a5f129-52c9-4326-9dc0-028b400462ca" alt=""></td><td>Devstral-2-123B-Instruct-2512**</td></tr><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2FhoC55ogj5zmtKzmt4OqX%2Fminimax-color.png?alt=media&#x26;token=29fe9411-a413-4c87-9766-7b8ec518d133" alt=""></td><td>MiniMax-M2.5</td></tr><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2FpnWd1QQl5f68JN0eCLnq%2FScreenshot%202026-01-07%20at%2012.33.37.png?alt=media&#x26;token=77bc312c-7f66-41e1-b4b7-0388c8cc17d3" alt=""></td><td>GLM-4.7</td></tr><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2Fgit-blob-c800c1f921f0a9179574471daa68624e65ee8aa7%2FTransparent%20Qwen%20logo.png?alt=media" alt=""></td><td>Qwen-3-Coder-480B-A35B-Instruct</td></tr><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2Fgit-blob-c800c1f921f0a9179574471daa68624e65ee8aa7%2FTransparent%20Qwen%20logo.png?alt=media" alt=""></td><td>Qwen-3-30B <strong>(Chat only)</strong></td></tr></tbody></table>

If not, we will install one of the following models on-premises for you:

<table><thead><tr><th width="58.712890625"></th><th width="383.9716796875">Open-Weight Models that Tabnine Offers to Install On-Prem</th></tr></thead><tbody><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2FfxZflxCYxWYvBhj9fOre%2Fm-rainbow.svg?alt=media&#x26;token=03a5f129-52c9-4326-9dc0-028b400462ca" alt=""></td><td><a href="#devstral-small-2-24b-instruct-2512">Devstral-Small-2-24B-Instruct-2512</a></td></tr><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2FfxZflxCYxWYvBhj9fOre%2Fm-rainbow.svg?alt=media&#x26;token=03a5f129-52c9-4326-9dc0-028b400462ca" alt=""></td><td><a href="#devstral-2-123b-instruct-2512">Devstral-2-123B-Instruct-2512</a>**</td></tr><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2FhoC55ogj5zmtKzmt4OqX%2Fminimax-color.png?alt=media&#x26;token=29fe9411-a413-4c87-9766-7b8ec518d133" alt=""></td><td><a href="#minimax-m2.5">MiniMax-M2.5</a></td></tr></tbody></table>

{% hint style="info" %}
\*\*Devstral 2 (123B parameters) is operating under a modified MIT license. If your organization's global consolidated monthly revenue is exceeding $20 million, utilizing this model requires Devstral's permission.Hardware Requirements.
{% endhint %}

#### **Hardware Requirements**

There are different installation requirements, aimed to make sure users have the optimal experience when using Tabnine. Those requirements will be different for Agentic workflows or Chat.

**Agent + Chat**

<table><thead><tr><th width="58.712890625"></th><th width="173.615234375">Agent + Chat</th><th>≤100 Users — Recommended</th><th>≤100 Users — Minimal</th><th>101-500 Users — Recommended</th><th>101-500 Users — Minimal</th><th>501-1000 Users — Recommended</th><th>501-1000 Users — Minimal</th><th>1001-2000 Users — Recommended</th><th>1001-2000 Users — Minimal</th></tr></thead><tbody><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2FfxZflxCYxWYvBhj9fOre%2Fm-rainbow.svg?alt=media&#x26;token=03a5f129-52c9-4326-9dc0-028b400462ca" alt=""></td><td>Devstral-Small-2-24B-Instruct-2512</td><td>2 B200</td><td>2 H100</td><td>2 B200</td><td>3 H100</td><td>4 B200</td><td>6 H100</td><td>8 B200</td><td>12 H100</td></tr><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2FfxZflxCYxWYvBhj9fOre%2Fm-rainbow.svg?alt=media&#x26;token=03a5f129-52c9-4326-9dc0-028b400462ca" alt=""></td><td>Devstral-2-123B-Instruct-2512</td><td>4 B200</td><td>4 H100</td><td>8 B200</td><td>8 H100</td><td>16 B200</td><td>8 B200</td><td>24 B200</td><td>16 B200</td></tr><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2FhoC55ogj5zmtKzmt4OqX%2Fminimax-color.png?alt=media&#x26;token=29fe9411-a413-4c87-9766-7b8ec518d133" alt=""></td><td>MiniMax-M2.5</td><td>2 B200</td><td>2 H200</td><td>4 B200</td><td>4 H200</td><td>8 B200</td><td>8 H200</td><td>16 B200</td><td>16 H200</td></tr><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2FpnWd1QQl5f68JN0eCLnq%2FScreenshot%202026-01-07%20at%2012.33.37.png?alt=media&#x26;token=77bc312c-7f66-41e1-b4b7-0388c8cc17d3" alt=""></td><td>GLM-4.7</td><td>2 B200</td><td>8 H100</td><td>4 B200</td><td>2 B200</td><td>8 B200</td><td>4 B200</td><td>16 B200</td><td>8 B200</td></tr><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2Fgit-blob-c800c1f921f0a9179574471daa68624e65ee8aa7%2FTransparent%20Qwen%20logo.png?alt=media" alt=""></td><td>Qwen-3-Coder-480B-A35B-Instruct</td><td>2 B200</td><td>8 H100</td><td>4 B200</td><td>2 B200</td><td>8 B200</td><td>4 B200</td><td>16 B200</td><td>8 B200</td></tr></tbody></table>

**Chat Only**

<table><thead><tr><th width="58.9326171875"></th><th width="175.763671875">Chat Only</th><th>≤100 Users — Recommended</th><th>≤100 Users — Minimal</th><th>101-500 Users — Recommended</th><th>101-500 Users — Minimal</th><th>501-1000 Users — Recommended</th><th>501-1000 Users — Minimal</th><th>1001-2000 Users — Recommended</th><th>1001-2000 Users — Minimal</th></tr></thead><tbody><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2FfxZflxCYxWYvBhj9fOre%2Fm-rainbow.svg?alt=media&#x26;token=03a5f129-52c9-4326-9dc0-028b400462ca" alt=""></td><td>Devstral-Small-2-24B-Instruct-2512</td><td>2 B200</td><td>2 H100</td><td>2 B200</td><td>2 H100</td><td>2 B200</td><td>2 H100</td><td>2 B200</td><td>4 H100</td></tr><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2FfxZflxCYxWYvBhj9fOre%2Fm-rainbow.svg?alt=media&#x26;token=03a5f129-52c9-4326-9dc0-028b400462ca" alt=""></td><td>Devstral-2-123B-Instruct-2512</td><td>2 B200</td><td>4 H100</td><td>2 B200</td><td>4 H100</td><td>4 B200</td><td>8 H100</td><td>8 B200</td><td>16 H100</td></tr><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2FhoC55ogj5zmtKzmt4OqX%2Fminimax-color.png?alt=media&#x26;token=29fe9411-a413-4c87-9766-7b8ec518d133" alt=""></td><td>MiniMax-M2.5</td><td>2 B200</td><td>2 H 200</td><td>2 B200</td><td><p>2 H200 /</p><p>4 H100</p></td><td>2 B200</td><td><p>4 H200 /</p><p>8 H100</p></td><td>3 B200</td><td>8 H200</td></tr><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2FpnWd1QQl5f68JN0eCLnq%2FScreenshot%202026-01-07%20at%2012.33.37.png?alt=media&#x26;token=77bc312c-7f66-41e1-b4b7-0388c8cc17d3" alt=""></td><td>GLM-4.7</td><td>2 B200</td><td>8 H100</td><td>2 B200</td><td>2 B200</td><td>4 B200</td><td>2 B200</td><td>6 B200</td><td>4 B200</td></tr><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2Fgit-blob-c800c1f921f0a9179574471daa68624e65ee8aa7%2FTransparent%20Qwen%20logo.png?alt=media" alt=""></td><td>Qwen-3-Coder-480B-A35B-Instruct</td><td>2 B200</td><td>8 H100</td><td>2 B200</td><td>8 H100</td><td>4 B200</td><td>4 B200</td><td>8 B200</td><td>8 B200</td></tr><tr><td><img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2Fgit-blob-c800c1f921f0a9179574471daa68624e65ee8aa7%2FTransparent%20Qwen%20logo.png?alt=media" alt=""></td><td>Qwen-3-30B</td><td>2 B200</td><td>2 H100</td><td>2 B200</td><td>2 H100</td><td>2 B200</td><td>2 H100</td><td>2 B200</td><td>2 H100</td></tr></tbody></table>

**GPU Availability by Cloud Provider**

<table><thead><tr><th width="96.6875">GPU</th><th>AWS</th><th>Azure</th><th>GCP</th></tr></thead><tbody><tr><td>H100</td><td>p5.4xlarge (H100 80GB)</td><td>NC40ads_H100_v5 (H100 94GB)</td><td>a3-highgpu-1g (H100 80GB)</td></tr><tr><td>H200</td><td>p5en.48xlarge (8×H200 141GB)</td><td>ND96isr_H200_v5 (8×H200 141GB)</td><td>a3-ultragpu-8g (8×H200 141GB)</td></tr><tr><td>B200</td><td>p6-b200.48xlarge (8×B200 HBM3e)</td><td>ND128isr_NDR_GB200_v6 (4×Blackwell 192GB)</td><td>a4-highgpu-8g (8×B200 HBM3e)</td></tr></tbody></table>

{% hint style="info" %}
If you don’t have an open-weight model that is not on the list, contact us and our team will work with you.
{% endhint %}

### Open-Weight Model Installation

#### Devstral-2-123B-Instruct-2512

{% tabs %}
{% tab title="Standalone Docker" %} <img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2Fr7yJ4dKMqczlIfYY4n8G%2Fdocker%20logo.png?alt=media&#x26;token=99583e26-1e5a-4efb-90b0-f59103d53b08" alt="" data-size="line"> **Execution Script:**

```bash
read -r -d '' STARTUP_SCRIPT <<'EOF' || true
#!/bin/bash
set -e

# 1. System & Docker Setup
apt update
apt install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://get.docker.com -o get-docker.sh && sh get-docker.sh
usermod -aG docker $USER

# 2. Local Model Cache
mkdir -p /home/ubuntu/data
chown -R $USER:$USER /home/ubuntu/data

# 3. Launch devstral-2-123B
docker run -d --gpus all \
  --name devstral-123b \
  -p 0.0.0.0:8000:8000 \
  -v /home/ubuntu/data:/hf_cache \
  --ipc=host \
  -e HF_HOME=/hf_cache \
  -e VLLM_ATTENTION_BACKEND=FLASHINFER \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  -e VLLM_USE_FLASHINFER_MOE_FP16=1 \
  -e VLLM_USE_FLASHINFER_MOE_FP8=1 \
  -e VLLM_USE_FLASHINFER_MOE_FP4=1 \
  -e VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 \
  -e SAFETENSORS_FAST_GPU=1 \
  -e VLLM_SERVER_DEV_MODE=1 \
  -e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 \
  vllm/vllm-openai:v0.13.0 \
  mistralai/Devstral-2-123B-Instruct-2512 \
  --port 8000 \
  --max-model-len 128000 \
  --max-num-seqs 64 \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.95 \
  --swap-space 16 \
  --trust-remote-code
EOF

```

{% endtab %}

{% tab title="Kubernetes" %} <img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2Fltwyft3JLUiTn7mmPqae%2Fkubernetes%20logo.png?alt=media&#x26;token=eb7900fc-82f5-4711-8755-c9cc4d895c97" alt="" data-size="line"> **1. Create Namespace**

```bash
kubectl create namespace devstral-123b
```

**2. Deployment Manifest (`devstral-123b.yaml`)**

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: devstral-deployment
  namespace: devstral-123b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: devstral-123b
  template:
    metadata:
      labels:
        app: devstral-123b
    spec:
      containers:
      - name: vllm-engine
        image: vllm/vllm-openai:v0.13.0
        args:
          - "mistralai/Devstral-2-123B-Instruct-2512"
          - "--port"
          - "8000"
          - "--max-model-len"
          - "128000"
          - "--max-num-seqs"
          - "64"
          - "--tensor-parallel-size"
          - "8"
          - "--tool-call-parser"
          - "mistral"
          - "--enable-auto-tool-choice"
          - "--gpu-memory-utilization"
          - "0.95"
          - "--trust-remote-code"
        env:
          - name: HF_HOME
            value: "/hf_cache"
          - name: VLLM_ATTENTION_BACKEND
            value: "FLASHINFER"
          - name: VLLM_FLASHINFER_MOE_BACKEND
            value: "throughput"
        resources:
          limits:
            nvidia.com/gpu: 8 # Required for 123B weights
          requests:
            nvidia.com/gpu: 8
        volumeMounts:
          - name: model-cache
            mountPath: /hf_cache
          - name: dshm
            mountPath: /dev/shm
      volumes:
        - name: model-cache
          emptyDir: {}
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: "24Gi" # Increased for 8x H100 NCCL comms
---
apiVersion: v1
kind: Service
metadata:
  name: devstral-svc
  namespace: devstral-123b
spec:
  ports:
    - port: 80
      targetPort: 8000
  selector:
    app: devstral-123b
  type: ClusterIP

```

{% endtab %}
{% endtabs %}

#### Devstral-Small-2-24B-Instruct-2512

{% tabs %}
{% tab title="Standalone Docker" %} <img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2Fr7yJ4dKMqczlIfYY4n8G%2Fdocker%20logo.png?alt=media&#x26;token=99583e26-1e5a-4efb-90b0-f59103d53b08" alt="" data-size="line"> **Execution Script:**

```bash
read -r -d '' STARTUP_SCRIPT <<'EOF' || true
#!/bin/bash
set -e

# 1. Environment Setup
apt update && apt install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://get.docker.com -o get-docker.sh && sh get-docker.sh
usermod -aG docker $USER

# 2. Cache Setup
mkdir -p /home/ubuntu/data
chown -R $USER:$USER /home/ubuntu/data

# 3. Launch Devstral Container
docker run -d --gpus all \
  --name devstral-vllm \
  -p 0.0.0.0:8000:8000 \
  -v /home/ubuntu/data:/hf_cache \
  -e HF_HOME=/hf_cache \
  -e VLLM_ATTENTION_BACKEND=FLASHINFER \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  -e VLLM_USE_FLASHINFER_MOE_FP8=1 \
  -e SAFETENSORS_FAST_GPU=1 \
  -e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 \
  vllm/vllm-openai:v0.14.0 \
  mistralai/Devstral-Small-2-24B-Instruct-2512 \
  --port 8000 \
  --max-model-len 128000 \
  --max-num-seqs 512 \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.95 \
  --swap-space 16 \
  --trust-remote-code
EOF
```

{% endtab %}

{% tab title="Kubernetes" %} <img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2Fltwyft3JLUiTn7mmPqae%2Fkubernetes%20logo.png?alt=media&#x26;token=eb7900fc-82f5-4711-8755-c9cc4d895c97" alt="" data-size="line"> **1. Namespace Creation**

```bash
kubectl create namespace devstral-apps
```

**2. Deployment Manifest (`devstral-k8s.yaml`)**

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: devstral-inference
  namespace: devstral-apps
spec:
  replicas: 1
  selector:
    matchLabels:
      app: devstral
  template:
    metadata:
      labels:
        app: devstral
    spec:
      containers:
      - name: vllm-engine
        image: vllm/vllm-openai:v0.14.0
        args:
          - "mistralai/Devstral-Small-2-24B-Instruct-2512"
          - "--port"
          - "8000"
          - "--max-model-len"
          - "128000"
          - "--tensor-parallel-size"
          - "1"
          - "--tool-call-parser"
          - "mistral"
          - "--enable-auto-tool-choice"
          - "--gpu-memory-utilization"
          - "0.95"
          - "--trust-remote-code"
        env:
          - name: HF_HOME
            value: "/hf_cache"
          - name: VLLM_ATTENTION_BACKEND
            value: "FLASHINFER"
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
          - name: model-cache
            mountPath: /hf_cache
          - name: dshm
            mountPath: /dev/shm
      volumes:
        - name: model-cache
          emptyDir: {}
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: "4Gi"
---
apiVersion: v1
kind: Service
metadata:
  name: devstral-svc
  namespace: devstral-apps
spec:
  ports:
    - port: 80
      targetPort: 8000
  selector:
    app: devstral
  type: ClusterIP

```

{% endtab %}
{% endtabs %}

#### MiniMax-M2.5

{% tabs %}
{% tab title="Standalone Docker" %} <img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2Fr7yJ4dKMqczlIfYY4n8G%2Fdocker%20logo.png?alt=media&#x26;token=99583e26-1e5a-4efb-90b0-f59103d53b08" alt="" data-size="line"> **Automated Startup Script:**

```bash
read -r -d '' STARTUP_SCRIPT <<'EOF' || true
#!/bin/bash
set -e

# 1. System Preparation & Docker Install
apt update
apt install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh
usermod -aG docker $USER

# 2. Local Cache Directory
mkdir -p /home/ubuntu/data
chown -R $USER:$USER /home/ubuntu/data

# 3. Execution (Optimized for 8x H100)
docker run -d --gpus all \
  --name vllm-minimax \
  -p 0.0.0.0:8000:8000 Thank you for reaching out.


  -v /home/ubuntu/data:/hf_cache \
  --ipc=host \
  -e HF_HOME=/hf_cache \
  -e VLLM_ATTENTION_BACKEND=FLASHINFER \
  -e VLLM_FLASHINFER_MOE_BACKEND=throughput \
  -e VLLM_USE_FLASHINFER_MOE_FP8=1 \
  -e SAFETENSORS_FAST_GPU=1 \
  vllm/vllm-openai:v0.14.0 \
  MiniMaxAI/MiniMax-M2.5 \
  --port 8000 \
  --max-model-len 128000 \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code \
  --enable-expert-parallel
EOF
```

**Verification Commands**

```bash
docker logs -f vllm-minimax
```

{% endtab %}

{% tab title="Kubernetes" %} <img src="https://3436682446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FY2qxVf5VTm3fmwP4B4Gx%2Fuploads%2Fltwyft3JLUiTn7mmPqae%2Fkubernetes%20logo.png?alt=media&#x26;token=eb7900fc-82f5-4711-8755-c9cc4d895c97" alt="" data-size="line"> **1. Namespace Creation**

```bash
kubectl create namespace llm-inference
```

**2. Deployment Manifest (`minimax-deploy.yaml`)**

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: minimax-vllm
  namespace: llm-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: minimax-vllm
  template:
    metadata:
      labels:
        app: minimax-vllm
    spec:
      containers:
        - name: vllm-container
          image: vllm/vllm-openai:v0.14.0
          args:
            - "--model"
            - "MiniMaxAI/MiniMax-M2.5"
            - "--tensor-parallel-size"
            - "8"
            - "--max-model-len"
            - "128000"
            - "--gpu-memory-utilization"
            - "0.95"
            - "--trust-remote-code"
            - "--enable-expert-parallel"
          env:
            - name: HF_HOME
              value: "/hf_cache"
            - name: VLLM_ATTENTION_BACKEND
              value: "FLASHINFER"
          resources:
            limits:
              nvidia.com/gpu: 8
          volumeMounts:
            - name: cache-volume
              mountPath: /hf_cache
            - name: dshm
              mountPath: /dev/shm
      volumes:
        - name: cache-volume
          emptyDir: {}
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: "16Gi"
```

**Service Manifest**

```yaml
apiVersion: v1
kind: Service
metadata:
  name: minimax-service
  namespace: llm-inference
spec:
  selector:
    app: minimax-vllm
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8000
```

***

**Verification Commands**

```bash
kubectl logs -f deployment/minimax-vllm -n llm-inference
```

{% endtab %}
{% endtabs %}
