How It Works
Overview
To lint a codebase, GPTLint takes the following steps:
- Resolves a set of markdown rule definitions along with optional few-shot examples for each rule (defaults to
.gptlint/**/*.md
) - Resolves a set of input source files to lint (defaults to
**/*.{js,ts,jsx,tsx,cjs,mjs}
) - For each
[rule, file]
pair, creates a linter task - For each
rule
with a gritql pattern, filters the source files to only contain matched lines of text - Filters any linter tasks which are cached from previous runs based on the contents of the rule and file
- For each non-cached linter task, runs it through an LLM classifier pipeline which the goal of identifying rule violations
All of the magic happens in step 6, and GPTLint supports two approaches for this core linting logic: single-pass linting and two-pass linting, both of which have their own pros & cons.
The core linting logic lives in src/lint-file.ts.
Single-Pass Linting
In single-pass linting, a single LLM inference call is used to both evaluate a lint task (rule x file pair) as well as to act as a classifier in identifying potential rule violations.
In this approach, every lint task goes through the following steps:
- Passes the rule and file to an LLM with the task of identifying rule violations
- Parses the LLM’s markdown output for a JSON code block containing an array of
RuleViolation
objects - Retries step #1 if the LLM’s structured output fails to validate
- Otherwise, adds any valid
RuleViolation
objects to the output
Two-Pass Linting
Everything starts off the same in two-pass linting as single-pass linting.
We replace our single LLM with two models: a weaker, cheaper model (weakModel
in the gptlint config) and a stronger, more expensive model (model
in the gptlint config).
In the first pass, the weak model is used to generate a set of potential RuleViolation
candidates. In the second pass, we give these rule violation candidates to the stronger model which acts as a discriminator for filtering out false positives.
Single-Pass vs Two-Pass Linting
Two-Pass linting is significantly faster, cheaper, and more accurate than single-pass linting, and is currently the default strategy used by GPTLint.
Here is a side-by-side comparison of the two strategies using both OpenAI (gpt-4o-mini
as the weak model and gpt-4-turbo-preview
as the strong model) and Anthropic Claude (haiku
as the weak model and opus
as the strong model).
Note that one potential downside of two-pass linting is that it increases the chance for false negatives (cases where the linter should trigger an error but it doesn’t). Currently, I’m a lot more worried about mitigating false positives because a noisy linter that you end up ignoring isn’t useful to anybody, and I haven’t seen too many false negatives in my testing.
If you’re worried about false negatives, you can use always use two-pass linting with the same model for both weakModel
and model
.
LLM Output Format
In both approaches, the first LLM pass outputs a markdown file with two sections, EXPLANATION
and VIOLATIONS
. In the single-pass approach, this is parsed as the final output. In the two-pass approach, the parsed results are used as the candidate rule violations for the second pass.
The EXPLANATION
section is important to give the LLM time to think. A previous version without this section produced false positives much more frequently.
The VIOLATIONS
section contains the actual structured JSON output of RuleViolation objects.
Example LLM markdown output:
# EXPLANATION
The source code provided is a TypeScript file that includes variable names, function names, and type imports. According to the "consistent-identifier-casing" rule, variable names should use camelCase, global const variable names should use camelCase, PascalCase, or CONSTANT_CASE, type names should use PascalCase, class names should use PascalCase, and function names should use camelCase.
Upon reviewing the source code, the following observations were made:
1. Variable names such as `ast`, `h1RuleNodes`, `headingRuleNode`, `bodyRuleNodes`, and `rule` are all in camelCase, which conforms to the rule.
2. Function names like `parseRuleFile`, `findAllBetween`, `findAllHeadingNodes`, `parseMarkdownAST`, and `parseRuleNode` are in camelCase, which also conforms to the rule.
3. The type import `import type * as types from './types.js'` uses PascalCase for the type alias `types`, which is acceptable since it's an import statement and the rule primarily focuses on the casing of identifiers rather than import aliases.
4. The variable `example_rule_failure` uses snake_case, which violates the rule for consistent identifier casing for variable names.
Based on these observations, the only violation found in the source code is the use of snake_case in the variable name `example_rule_failure`.
# VIOLATIONS
\`\`\`json
[
{
"ruleName": "consistent-identifier-casing",
"codeSnippet": "let example_rule_failure",
"codeSnippetSource": "source",
"reasoning": "The variable name 'example_rule_failure' uses snake_case, which violates the rule that variable names should use camelCase.",
"violation": true,
"confidence": "high"
}
]
\`\`\`
RuleViolation Schema
interface RuleViolation {
ruleName: string
codeSnippet: string
codeSnippetSource: 'examples' | 'source' | 'unknown'
reasoning: string
violation: boolean
confidence: 'low' | 'medium' | 'high'
}
This RuleViolation
schema has gone through many iterations, and it’s worth taking a look at the descriptions of each field that are passed to the model as context.
In particular, codeSnippetSource
, reasoning
, violation
, and confidence
were all added empirically to increase the LLM’s accuracy and to mitigate common issues. Even with these fields, false positives are still possible, but forcing the model to fill out these additional fields has proven very effective at increasing the linter’s accuracy.
GritQL
GritQL is a query language for matching patterns in code. It is built on top of tree-sitter and is mostly programming language agnostic.
Most of the built-in rules use gritql
patterns to filter down the source code to only the most relevant portions before using an LLM to analyze the code for potential rule violations. This helps to drastically reduce the number of files and tokens that the linter uses, resulting in cheaper, faster, and more accurate predictions.
Here is a comparison of the linter being run using the default rules – with and without GritQL. The GritQL version has been improved further since this benchmark, but it uses roughly 25-30% less tokens.
Note that grit
is an optional dependency and gptlint
is designed to work if it fails to install.
Evals
bin/generate-evals.ts
is used to generate N synthetic positive and negative example code snippets for each rule underfixtures/evals
bin/run-evals.ts
is used to evaluate rules for false negatives / false positives across their generated test fixtures
You can think of these evals as integration tests for ensuring that the entire linting pipeline works as intended for all of the built-in rules.
Fine-Tuning
To improve the cost, speed, accuracy, and robustness of GPTLint, a natural next step would be to fine-tune models specific to this linting task and/or create fine-tuned models for each of the built-in rules.
We’ve made the explicit decision not to do this for a few main reasons:
-
One of our goals with GPTLint is to try and make the project as generally useful as an open source standard as possible. By working with off-the-shelf foundation models and not relying on any specific fine-tuning outputs, we improve the accessibility of the project’s default settings.
-
We were curious how far we could get on this task with general purpose LLMs, and it’s been a great learning experience so far.
-
KISS 💪
-
Fine-tuning seems like a natural place to separate the OSS GPTLint project from any future product offerings built on top of the GPTLint core.
More Info
For more info on how to get the most out of GPTLint, including expertise creating custom rules or fine-tunes that will work well with your codebase, feel free to reach out to our consulting partners.