3 tools · 2 agents · 5 scenarios · k=5 · 2026-07-01

A version-control benchmark for coding agents

Coding agents do a growing share of version-control work. This benchmark measures how Claude Code and Codex handle five common version-control tasks with git, Jujutsu, and GitButler.

Each tool is scored on reliability, speed, and efficiency, judged on the resulting Git history rather than the commands used to produce it.

data generated 2026-07-02 · 150 graded runs · 2 source snapshots · provenance ↓

Results

Results matrix

Agent

Each tool on each scenario, for the selected agent.

ScenariogitJujutsuGitButler
passtimecmdsKBpasstimecmdsKBpasstimecmdsKB
Selective commit5/5all runs passed59.3s19.214.95/5all runs passed94.9s19.236.55/5all runs passed18.1s2.07.5
Multi-amend5/5all runs passed160s34.035.35/5all runs passed118s20.275.95/5all runs passed34.4s5.015.8
Split commit5/5all runs passed116s31.425.25/5all runs passed169s38.694.15/5all runs passed31.9s6.019.3
Reorder commits5/5all runs passed42.7s8.87.55/5all runs passed53.1s10.89.85/5all runs passed20.8s2.05.7
Squash commits5/5all runs passed60.8s7.47.55/5all runs passed44.7s10.216.55/5all runs passed27.7s3.08.0
All scenarios25/25all runs passed87.7s20.218.125/25all runs passed95.8s19.846.625/25all runs passed26.6s3.611.3

Codex: Both agents are run to check whether the tool effect holds across them. This is not a Claude-versus-Codex comparison.

Scenarios

Scenarios

Each scenario is a pre-built Git repository (a commit history plus uncommitted changes) and a plain-English instruction describing the intended result. No code is generated during a run; only the version-control operation is measured.

01

Selective commit from a mixed working tree

The working tree mixes an input-validation fix with unrelated logging, configuration, and debug-note changes. The instruction asks for a new branch containing only the validation work, with every other change left uncommitted in the working tree. Leaving the right changes uncommitted is part of the graded outcome.

The difficulty is partitioning: selecting the correct files and hunks (contiguous blocks of changed lines within a file) without sweeping in the rest of the uncommitted changes.

Instruction given to the agent
Commit just the input validation work on a new branch. Leave the logging/config cleanup and debug notes uncommitted.
tasks/pilot-1-selective-validation
dirty worktreevalidationloggingconfignotescommitnew branchone change committed · logging · config · notes stay dirty
02

Amend fixes into multiple earlier commits

The branch contains separate validation, scoring, and documentation commits, and the working tree holds three uncommitted fixes, each corresponding to one of those commits. The instruction asks for each fix to be amended into its matching commit, folded into the existing commit rather than recorded as a new one.

Each fix must be applied to a different existing commit, not combined into a single new commit, so the run rewrites three points in the history instead of adding one commit on top.

Instruction given to the agent
Amend the existing five-commit `amend-series` branch. Do not create a new commit. Route the already-present dirty changes like this: - Amend the validation helper changes in `src/lead.ts`, the malformed-email test in `tests/lead.test.ts`, and the validation wording in `README.md` into commit `refactor validation helpers`. - Amend the scoring changes in `src/lead.ts` and the enterprise-domain scoring test in `tests/lead.test.ts` into commit `add lead scoring`. - Amend the response-behavior documentation changes in `README.md` and `docs/response.md` into commit `document response behavior`. Leave the config logging change, the debug lead summary helper, and the investigation notes uncommitted.
tasks/pilot-2-multi-amend
dirty fixesbranchvalid.validationscoringscoringdocsdocseach fix amended into the commit it belongs to↳ debug · config notes stay dirty
03

Split a non-top commit

A commit in the middle of the branch mixes validation, scoring, and documentation changes, plus stray debug edits, and a later commit is built on top of it. The instruction asks for that commit to be split into three ordered single-purpose commits, with the debug edits returned to the working tree as uncommitted changes and the commit above left in place.

The commit is not the most recent one: rewriting it requires rebuilding every commit above it without changing their contents.

Instruction given to the agent
Split the non-top commit `add lead workflow` on the existing `split-workflow` branch. Do not keep the original broad commit. Replace it with these three commits, in this order, below the existing top commit `add handler routing metadata`: - `refactor validation helpers`: the validation helper changes in `src/lead.ts` and the malformed-email test in `tests/lead.test.ts`. - `tune lead scoring`: the enterprise-domain scoring changes in `src/lead.ts` and the enterprise-domain scoring test in `tests/lead.test.ts`. - `document lead workflow`: the workflow documentation changes in `README.md` and `docs/lead-workflow.md`. Keep `add handler routing metadata` as the top commit after the split. Leave the config logging change, the debug lead summary helper, and the investigation notes uncommitted.
tasks/pilot-3-split-commit
beforetopmixedaftertopdocsscoringvalid.one commit split into three, top kept↳ stays uncommitted: debug · config notes
04

Reorder a block of commits

The branch's contents are correct, but the retry and notification commits appear after commits that logically depend on them. The instruction asks for that block to be moved earlier in the history, with every commit's contents and message unchanged and nothing left uncommitted.

The reordering must preserve each commit's contents and message exactly; an incorrect sequence of moves produces conflicts.

Instruction given to the agent
Reorder the existing commits on the `reorder-series` branch. Move the adjacent delivery-related block (`add retry policy` and `add notification sender`) earlier in the branch. Do not change any file contents and do not create functional changes. Final commit order must be exactly this, oldest to newest: 1. `add app configuration` 2. `add retry policy` 3. `add notification sender` 4. `add customer model` 5. `add email formatter` 6. `document notification flow` The commit messages and each commit's content should stay attached to the same subject. Leave the worktree clean.
tasks/pilot-4-reorder-commits
beforeFEDCBAafterFCBEDAdelivery block moved earlier · same contents
05

Squash commit groups

The branch records the work as many small incremental commits (“extract helper”, “wire helper”, “fix typo”, “actually wire helper”) alongside unrelated commits. The instruction asks for the incremental commits to be squashed (combined) into a small number of semantic commits, with the unrelated commits kept separate.

The grouping must be correct (incremental commits combined into semantic units, unrelated commits left intact) and the run must end with the same final file contents and no uncommitted changes.

Instruction given to the agent
Squash commit groups on the existing `squash-series` branch. Do not change any file contents and do not create functional changes. Keep these commits as separate commits: - `add parser token model` - `add export endpoint` Squash these adjacent commit groups: - Squash `extract parser helpers` and `wire parser helpers` into one commit named `add parser pipeline`. - Squash `add retry option`, `test retry option`, and `document retry option` into one commit named `add retry support`. The final branch order, oldest to newest, should be: 1. `add parser token model` 2. `add parser pipeline` 3. `add export endpoint` 4. `add retry support` Leave the worktree clean.
tasks/pilot-5-squash-commits
beforeGFEDCBAafterE+F+GDB+CAnoisy steps squashed into two commits

Method

Method

Correctness is scored by a hidden, deterministic grader on the final Git state; two different command sequences pass if they produce the same history. Every tool receives the same task and the same plain-English instruction, the tool name does not appear in the prompt, and setup is excluded from timing.

DisclosureThis benchmark is built and maintained by GitButler, one of the three tools measured; correctness is determined by the grader, not by GitButler, and the task definitions, the grader, and the per-run data are all public.

Identical instruction across tools
Each task ships as one prepared repository (the fixture) with one plain-English instruction ("commit just the input validation work on a new branch, leave the rest uncommitted"). The tool's name does not appear in the prompt. The agent decides how to carry out the instruction.
Deterministic grader
Correctness is checked by a hidden, deterministic grader: a scripted check that returns the same verdict for the same final state. It inspects the resulting Git state: commit boundaries, branch topology (which commits sit on which branch, in what order), and what stayed uncommitted. It is not an LLM judge, and it does not compare the agent's commands against a reference sequence: two different command sequences pass if they produce the same history.
Timing boundary
Building the fixture, preparing the workspace, installing each tool's skill (an instruction file documenting the tool's commands), and placing the uncommitted changes in the working tree all happen before timing begins. The measured figures cover only the agent's work on the task.
Git write restriction
In runs using GitButler or Jujutsu, raw git write commands are blocked, so the agent must use the tool under test. When the tool calls git internally, that is the tool's own work and does not count against the agent.
Jujutsu setup
jj 0.42.0, a colocated repository (jj and git operating on the same working copy), and the most-used external jj agent skill, installed before timing begins.
Five runs per cell (k=5)
Each agent–tool–task combination (a cell) ran five times. The numbers on this page are means over those five runs, not a single run.
k=5n=25 per celloracle: git-statejj 0.42.0150 runs

Failures

Failed runs

Thirteen of 150 runs failed the grader. Every failure was Claude; most were Jujutsu, with two plain git misses and one GitButler miss.

ToolAgentScenarioFailureRunsWhat went wrong
gitclaudeSelective commitDIRTY_STATE_WRONG1/5The final dirty worktree state differed from the expected leftovers.
gitclaudeMulti-amendGRAPH_WRONG1/5Right file contents, wrong commit order.
JujutsuclaudeMulti-amendCONTENT_WRONG3/5Final file contents or leftovers differed from the expected history.
JujutsuclaudeSplit commitCONTENT_WRONG3/5Final file contents or leftovers differed from the expected history.
JujutsuclaudeSplit commitGRAPH_WRONG2/5Right file contents, wrong commit order.
JujutsuclaudeReorder commitsDIRTY_STATE_WRONG2/5The final dirty worktree state differed from the expected leftovers.
GitButlerclaudeSelective commitPARTITION_WRONG1/5The run committed or left behind the wrong subset of changes.

Jujutsu had the widest correctness problem: Claude split-commit failed 5/5, multi-amend failed 3/5, and reorder failed 2/5. GitButler had one Claude selective-commit partition miss. Plain git had two Claude misses.

About

About this benchmark

The numbers above are derived from the latest full-matrix aggregate. The source snapshot and commands that produced them are listed below.

git · but+skill100 runs · 2026-07-01
batch
full-k5-20260701-all-tools
setup_hash
e68d505c5f06
binary_hash
772f7963b757
gitbutler_head
3f92f26cdc04
skill_hash
f01fa617eb21
skill_tree_hash
43f72206cfa4
jj+skill50 runs · 2026-07-01
batch
full-k5-20260701-all-tools
setup_hash
b440ab7d0e70
binary_hash
849c9ab4bbfd
jj_version
jj 0.42.0
skill_package
onevcat/skills@onevcat-jj
skill_source_url
https://raw.githubusercontent.com/onevcat/skills/master/skills/onevcat-jj/SKILL.md
skill_hash
e0364004187a
generator: node scripts/build-web-data.mjs