Provenance and Attribution
Minimize IP liability of third-party models
LLMs are trained on vast amounts of data, including code that may have restrictions on how it can be used, introducing the risk of IP infringement.
With Provenance and Attribution, you can drastically reduce the risk of IP liability when using models for software development.
Tabnine checks the code generated within our AI chat against the publicly visible code on GitHub, flags any matches it finds, and references the source repository and its license type. This critical information makes it easier for you to review code suggestions and decide if they meet your specific requirements and policies.
This feature enables you to leverage the performance gains from powerful LLMs, and minimize the likelihood of copyleft-licensed code getting into your codebase.
Attribution Service Endpoint
Tabnine’s Provenance and Attribution feature relies on a dedicated attribution service. Enterprise environments that restrict outbound traffic must allow access to the following endpoint:
https://attribution.tabnine.com
This is the endpoint used by all current versions of Tabnine for attribution checks during code generation.
Backwards compatibility
For organizations already using the previous, legacy endpoint, Tabnine continues to support:
https://api.tabnine.com/attribution
This endpoint remains active for backwards compatibility. Existing customers may continue using it, although the new endpoint is recommended for improved stability and performance.
The legacy endpoint will eventually be deprecated.
Walkthrough
Tabnine performs provenance tracing for all code generated by the AI chat. If a code block has origins in any publicly visible code on GitHub, a code provenance attribution section will appear below the block detailing the source and corresponding licenses.

When drilling down into detail, Provenance will show attributions by grouping content by license, displaying matching code snippets, and linking to any actual matches, like so:

Two Levels of Protection
Tabnine offers two levels of protection to minimize the risk of IP liability:
Training time protection: We have trained the Tabnine Protected 2 model exclusively on code that does not have any restrictions on use. This ensures that when using this model, every recommendation from Tabnine can be accepted without the risk of IP infringement. This level of protection is valuable for companies that have strict policies and zero tolerance for using non-permissive code. This is a critical need, particularly for software companies and others who sell or relicense the code they produce.
Inference time protection: Tabnine informs you if the output of the LLM matches any publicly visible code on GitHub and identifies the source repo and its license type. By adding guardrails around Tabnine’s output, we minimize the risk of IP liability of third-party models.
Availability
The Provenance and Attribution is currently in private preview and is available to any Tabnine Enterprise customer. Existing Tabnine Enterprise customers should contact our Support team to request access. Once enabled, the Provenance and Attribution capability works on all available models: Anthropic, OpenAI, Cohere, Llama, Mistral, and Tabnine.
Censorship
Censorship will altogether prevent the generation of code snippets originating from non-permissive codebases. To activate, simply slide the toggle next to Censor code ⓘ.

Censorship in Agent
Censorship exists in Agent as a tool called check_provenance. The tool is disabled by default and must be activated in the Admin Console. It is activated by the same toggle that controls Censorship for Chat.
Auto-apply for Agent will not work once Censorship is activated. If the proposed code contains a match of more than 150 characters from non-permissive code, the Agent is instructed to rewrite only the offending portion while still addressing the original task.
After Tabnine Agent edits the code, check_provenance is run again to verify that non-permissive matches have been removed before apply-code can proceed.
For efficiency reasons, check_provenance is disabled by default and must be explicitly enabled per workspace in tool control.

When is attribution and censorship checked?
Attribution and Censorship is always checked as the last step of code generation. For Tabnine Agent, when check_provenance is enabled, this check happens immediately before the apply-code tool executes. Custom behaviors and custom commands cannot be used to override the censorship check.
Limitations
Supported form factors: Provenance and Attribution is currently supported for AI chat. Support for inline actions and code completions will be available in 2025.
System requirements: Provenance and Attribution requires up to 10TB of free storage.
Supported languages:
Python
C
Kotlin
Javascript
C++
Ruby
Typescript
C#
Scala
Java
Objective C
Swift
Rust
Pascal
Groovy
Go (Golang)
F#
PHP
R
Code match criteria: At least 150 characters, and multiline.
Matching index: GitHub is fully indexed, with periodic updates ensuring coverage extends beyond the model’s training data.
Attribution Logs
Sample CSV file showing attribution logs (by default the last 14 days, which can be changed in the API):
The logs will have the following fields:
detectionId— Unique code for the logged attribution eventtimestamp— Time and DateuserId— The User IDorganizationId— The Organization IDteamId— The Team IDsource— indicates Chat, Agent, or Testmodel— e.g. Claude 3.5 Sonnet, Gemini3-Pro, GPT-5, etc. Available models will change over time.isNonPermissive— will have one of two values,TRUEorFALSEsnippet— License information plus the code snippet itselfcodeAttribution— JSON data attributing the code to sourcenewFile actions count— Amount of times a new file was createdapply actions count— Amount of times code changes were appliedinsert actions count— Amount of times new code/changes were inserted into codecopy actions count— Amount of times code was manually copied from the Agent's response
Attribution FAQ
What kind of reference database does Tabnine use and how often is it updated?
The reference database is Postgres, containing signatures and their metadata. That metadata includes the license information, commit hash, repo, and number of starts among other things. It is updated about once every quarter.
How exactly does Tabnine determine the license that a codebase is using?
We infer the license based on the one mentioned on the GitHub repo, as well as any licensing the library has.
Why is it limited to certain languages?
We started with the most popular languages (see above) expecting to expand our coverage over time. If there is a specific language missing that you would like, let us know and we will consider adding it.
Will Tabnine expand its attribution codebase?
Whenever we re-index, we also search for new repos that fit our licensing criteria.
What generated code is sent to the attribution service?
The code snippet is sent to the recitation service, which is then installed on premises. From there, we calculate the signature hashes in that snippet, then check a bloom filter to see if those signature hashes are in the attribution database. Only the signature hashes that are found in the bloom filter are sent to the attribution service in order to find the references. Note that no plain text code is ever sent to the attribution service, only the signature hashes.
Last updated
Was this helpful?
