Monitoring Tabnine

Overview

This document will go over how Tabnine services deployed on-premise can be monitored and go over a few examples of monitoring our services locally. You can also enable Tabnine telemetry, which uses the principles shown in this document and reports the data to Tabnine’s servers.

As Tabnine’s self-hosted solution runs in a Kubernetes cluster, we rely on standard tools for our logs and metrics - logs are written to the stdout, and metrics are exposed using http endpoints in Prometheus format.

Note that as both writing logs to stdout and exposing metrics endpoints for scraping are industry standards when working in the Kubernetes ecosystem, there is an extensive collection of tools and platforms that support those formats. This document will go over the configuration options for scrapping metrics and will also provide examples for setting up a simple Prometheus server for scraping the metrics and FluentBit for the collection of the logs into a centralized endpoint.

Logs

All Tabnine services output their logs to the stdout. They are picked by and managed by Kubernetes, which allows integration with standard tools for log management and retention.

In Kubernetes, the standard way to deal with logs is to run a collection service, such as FluentD or FluentBit, which collects the logs from the pods and forwards them to a centralized location. Cloud providers usually have an official way of integrating the logs with their native logging platforms. However, they all use FluentD or FluentBit under the hood.

When Tabnine’s telemetry is enabled, we install and use FluentD to forward logs from the cluster to Tabnine’s servers.

Log messages are in the following format:

{
  "timestamp": "2023-01-15T03:46:06.861Z",
  "level": "error/warning/info/debug",
  "message": "msg content"
}

How to send logs to an external log management system

Metrics

Tabnine services export Prometheus metrics and rely on having Prometheus Operator installed on the cluster. If you are unfamiliar with how to install a Prometheus Operator please follow Prometheus Operator install article.

Enable monitoring of metrics

In order to enable Tabnine metrics monitoring, edit the following sections in values.yaml.

global:
  monitoring:
    enabled: true
    # labels -- by default. If your Promtheus server requires specific labels to be present for the monitors to be picked up, add them here
    labels: {}
    # annotations -- by default. Some platforms require specific annotations to be present, this setting will apply the annotation to all monitor objects
    annotations: {}
  tabnine:
    telemetry:
      # enabled -- Send telemetry data to Tabnine backend
      enabled: false

Now that values.yaml is updated, it is time to install the chart on the cluster.

helm upgrade --install -n tabnine --create-namespace tabnine oci://registry.tabnine.com/self-hosted/tabnine-cloud --values values.yaml

Prometheus example

Values file examples

The following example adds a release=prom-example label to all PodMonitors and ServiceMonitor created by Tabnine as part of the installation.

global:
  monitoring:
    enabled: true
    labels:
      release: prom-example
  image:
    imagePullSecrets:
      - name: regcred
  tabnine:
  [...]

Prometheus configuration file

The following configuration:

  1. Scrapes only PodMonitors and ServiceMonitors with a release=prom-example label,

  2. keeps the data for 14 days

  3. requires 50GB of storage

  4. requires 6G of RAM to operate

for full list of available configurations, please check the Prometheus (CRD) documentation

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prom-example
  namespace: monitoring
spec:
  evaluationInterval: 30s
  paused: false
  podMonitorNamespaceSelector: {}
  podMonitorSelector:
    matchLabels:
      release: prom-example
  portName: http-web
  probeNamespaceSelector: {}
  probeSelector:
    matchLabels:
      release: prom-example
  replicas: 1
  resources:
    limits:
      cpu: 1
      memory: 6G
    requests:
      cpu: 1
      memory: 6G
  retention: 14d
  routePrefix: /
  ruleNamespaceSelector: {}
  ruleSelector:
    matchLabels:
      release: prom-example
  scrapeInterval: 30s
  securityContext:
    fsGroup: 2000
    runAsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector:
    matchLabels:
      release: prom-example
  shards: 1
  storage:
    volumeClaimTemplate:
      spec:
        resources:
          requests:
            storage: 50Gi
  version: v2.42.0

Last updated