Provenance and Attribution

Minimize IP liability of third-party models

Attribution and Censorship is always checked as the last step of code generation. Custom behaviors and custom commands cannot be used to override the censorship check.

Popular LLMs like Claude 3.5 Sonnet and GPT-4o are trained on vast amounts of data, including code that may have restrictions on how it can be used, introducing the risk of IP infringement.

With Provenance and Attribution, you can drastically reduce the risk of IP liability when using models like Anthropic’s Claude, OpenAI’s GPT-4o, and Cohere’s Command R+ for software development.

Tabnine checks the code generated within our AI chat against the publicly visible code on GitHub, flags any matches it finds, and references the source repository and its license type. This critical information makes it easier for you to review code suggestions and decide if they meet your specific requirements and policies.

This feature enables you to leverage the performance gains from powerful LLMs, and minimize the likelihood of copyleft-licensed code getting into your codebase.

Walkthrough

Tabnine performs provenance tracing for all code generated by the AI chat. If a code block has origins in any publicly visible code on GitHub, a code provenance attribution section will appear below the block detailing the source and corresponding licenses.

When drilling down into detail, Provenance will show attributions by grouping content by license, displaying matching code snippets, and linking to any actual matches, like so:

Two Levels of Protection

Tabnine offers two levels of protection to minimize the risk of IP liability:

Training time protection: We have trained the Tabnine Protected 2 model exclusively on code that does not have any restrictions on use. This ensures that when using this model, every recommendation from Tabnine can be accepted without the risk of IP infringement. This level of protection is valuable for companies that have strict policies and zero tolerance for using nonpermissive code. This is a critical need, particularly for software companies and others who sell or relicense the code they produce.
Inference time protection: Tabnine informs you if the output of the LLM matches any publicly visible code on GitHub and identifies the source repo and its license type. By adding guardrails around Tabnine’s output, we minimize the risk of IP liability of third-party models.

Availability

The Provenance and Attribution is currently in private preview and is available to any Tabnine Enterprise customer. Existing Tabnine Enterprise customers should contact our Support team to request access. Once enabled, the Provenance and Attribution capability works on all available models: Anthropic, OpenAI, Cohere, Llama, Mistral, and Tabnine.

Censorship

Censorship will altogether prevent the generation of code snippets originating from non-permissive codebases. To activate, simply slide the toggle next to Censor code ⓘ.

https://docs.tabnine.com/main/welcome/readme/protection/provenance-and-attribution

When is attribution and censorship checked?

Attribution and Censorship is always checked as the last step of code generation. Custom behaviors and custom commands cannot be used to override the censorship check.

Limitations

Supported form factors: Provenance and Attribution is currently supported for AI chat. Support for inline actions and code completions will be available in 2025.
System requirements: Provenance and Attribution requires up to 10TB of free storage.
Supported languages:

Python

Kotlin

Javascript

C++

Ruby

Typescript

Scala

Java

Objective C

Swift

Rust

Pascal

Groovy

Go (Golang)

PHP

Code match criteria: At least 150 characters, and multiline.
Matching index: GitHub is fully indexed, with periodic updates ensuring coverage extends beyond the model’s training data.

Last updated 1 month ago

Was this helpful?