3 tools · 2 agents · 5 scenarios · k=5 · 2026-07-01

A version-control benchmark for coding agents

Coding agents do a growing share of version-control work. This benchmark measures how Claude Code and Codex handle five common version-control tasks with git, Jujutsu, and GitButler.

Each tool is scored on reliability, speed, and efficiency, judged on the resulting Git history rather than the commands used to produce it.

data generated 2026-07-02 · 150 graded runs · 2 source snapshots · provenance ↓

Results

Results matrix

Agent

Each tool on each scenario, for the selected agent.

Scenario	git				Jujutsu				GitButler
Scenario	pass	time	cmds	KB	pass	time	cmds	KB	pass	time	cmds	KB
Selective commit	5/5all runs passed	59.3s	19.2	14.9	5/5all runs passed	94.9s	19.2	36.5	5/5all runs passed	18.1s	2.0	7.5
Multi-amend	5/5all runs passed	160s	34.0	35.3	5/5all runs passed	118s	20.2	75.9	5/5all runs passed	34.4s	5.0	15.8
Split commit	5/5all runs passed	116s	31.4	25.2	5/5all runs passed	169s	38.6	94.1	5/5all runs passed	31.9s	6.0	19.3
Reorder commits	5/5all runs passed	42.7s	8.8	7.5	5/5all runs passed	53.1s	10.8	9.8	5/5all runs passed	20.8s	2.0	5.7
Squash commits	5/5all runs passed	60.8s	7.4	7.5	5/5all runs passed	44.7s	10.2	16.5	5/5all runs passed	27.7s	3.0	8.0
All scenarios	25/25all runs passed	87.7s	20.2	18.1	25/25all runs passed	95.8s	19.8	46.6	25/25all runs passed	26.6s	3.6	11.3

5/5all runs passed pass rate (reliability first): a wrong history fails regardless of speed
bold = best among tools that passed every run
muted = the tool did not pass all five runs here
KB is comparable within one agent only

Codex: Both agents are run to check whether the tool effect holds across them. This is not a Claude-versus-Codex comparison.

Scenarios

Scenarios

Each scenario is a pre-built Git repository (a commit history plus uncommitted changes) and a plain-English instruction describing the intended result. No code is generated during a run; only the version-control operation is measured.

01

Selective commit from a mixed working tree

The working tree mixes an input-validation fix with unrelated logging, configuration, and debug-note changes. The instruction asks for a new branch containing only the validation work, with every other change left uncommitted in the working tree. Leaving the right changes uncommitted is part of the graded outcome.

The difficulty is partitioning: selecting the correct files and hunks (contiguous blocks of changed lines within a file) without sweeping in the rest of the uncommitted changes.

Instruction given to the agent

Commit just the input validation work on a new branch. Leave the logging/config cleanup and debug notes uncommitted.

tasks/pilot-1-selective-validation ↗

02

Amend fixes into multiple earlier commits

The branch contains separate validation, scoring, and documentation commits, and the working tree holds three uncommitted fixes, each corresponding to one of those commits. The instruction asks for each fix to be amended into its matching commit, folded into the existing commit rather than recorded as a new one.

Each fix must be applied to a different existing commit, not combined into a single new commit, so the run rewrites three points in the history instead of adding one commit on top.

Instruction given to the agent

Amend the existing five-commit `amend-series` branch. Do not create a new commit. Route the already-present dirty changes like this: - Amend the validation helper changes in `src/lead.ts`, the malformed-email test in `tests/lead.test.ts`, and the validation wording in `README.md` into commit `refactor validation helpers`. - Amend the scoring changes in `src/lead.ts` and the enterprise-domain scoring test in `tests/lead.test.ts` into commit `add lead scoring`. - Amend the response-behavior documentation changes in `README.md` and `docs/response.md` into commit `document response behavior`. Leave the config logging change, the debug lead summary helper, and the investigation notes uncommitted.

tasks/pilot-2-multi-amend ↗

03

Split a non-top commit

A commit in the middle of the branch mixes validation, scoring, and documentation changes, plus stray debug edits, and a later commit is built on top of it. The instruction asks for that commit to be split into three ordered single-purpose commits, with the debug edits returned to the working tree as uncommitted changes and the commit above left in place.

The commit is not the most recent one: rewriting it requires rebuilding every commit above it without changing their contents.

Instruction given to the agent

Split the non-top commit `add lead workflow` on the existing `split-workflow` branch. Do not keep the original broad commit. Replace it with these three commits, in this order, below the existing top commit `add handler routing metadata`: - `refactor validation helpers`: the validation helper changes in `src/lead.ts` and the malformed-email test in `tests/lead.test.ts`. - `tune lead scoring`: the enterprise-domain scoring changes in `src/lead.ts` and the enterprise-domain scoring test in `tests/lead.test.ts`. - `document lead workflow`: the workflow documentation changes in `README.md` and `docs/lead-workflow.md`. Keep `add handler routing metadata` as the top commit after the split. Leave the config logging change, the debug lead summary helper, and the investigation notes uncommitted.

tasks/pilot-3-split-commit ↗

04

Reorder a block of commits

The branch's contents are correct, but the retry and notification commits appear after commits that logically depend on them. The instruction asks for that block to be moved earlier in the history, with every commit's contents and message unchanged and nothing left uncommitted.

The reordering must preserve each commit's contents and message exactly; an incorrect sequence of moves produces conflicts.

Instruction given to the agent

Reorder the existing commits on the `reorder-series` branch. Move the adjacent delivery-related block (`add retry policy` and `add notification sender`) earlier in the branch. Do not change any file contents and do not create functional changes. Final commit order must be exactly this, oldest to newest: 1. `add app configuration` 2. `add retry policy` 3. `add notification sender` 4. `add customer model` 5. `add email formatter` 6. `document notification flow` The commit messages and each commit's content should stay attached to the same subject. Leave the worktree clean.

tasks/pilot-4-reorder-commits ↗

05

Squash commit groups

The branch records the work as many small incremental commits (“extract helper”, “wire helper”, “fix typo”, “actually wire helper”) alongside unrelated commits. The instruction asks for the incremental commits to be squashed (combined) into a small number of semantic commits, with the unrelated commits kept separate.

The grouping must be correct (incremental commits combined into semantic units, unrelated commits left intact) and the run must end with the same final file contents and no uncommitted changes.

Instruction given to the agent

Squash commit groups on the existing `squash-series` branch. Do not change any file contents and do not create functional changes. Keep these commits as separate commits: - `add parser token model` - `add export endpoint` Squash these adjacent commit groups: - Squash `extract parser helpers` and `wire parser helpers` into one commit named `add parser pipeline`. - Squash `add retry option`, `test retry option`, and `document retry option` into one commit named `add retry support`. The final branch order, oldest to newest, should be: 1. `add parser token model` 2. `add parser pipeline` 3. `add export endpoint` 4. `add retry support` Leave the worktree clean.

tasks/pilot-5-squash-commits ↗

Method

Method

Correctness is scored by a hidden, deterministic grader on the final Git state; two different command sequences pass if they produce the same history. Every tool receives the same task and the same plain-English instruction, the tool name does not appear in the prompt, and setup is excluded from timing.

DisclosureThis benchmark is built and maintained by GitButler, one of the three tools measured; correctness is determined by the grader, not by GitButler, and the task definitions, the grader, and the per-run data are all public.

Identical instruction across tools: Each task ships as one prepared repository (the fixture) with one plain-English instruction ("commit just the input validation work on a new branch, leave the rest uncommitted"). The tool's name does not appear in the prompt. The agent decides how to carry out the instruction.
Deterministic grader: Correctness is checked by a hidden, deterministic grader: a scripted check that returns the same verdict for the same final state. It inspects the resulting Git state: commit boundaries, branch topology (which commits sit on which branch, in what order), and what stayed uncommitted. It is not an LLM judge, and it does not compare the agent's commands against a reference sequence: two different command sequences pass if they produce the same history.
Timing boundary: Building the fixture, preparing the workspace, installing each tool's skill (an instruction file documenting the tool's commands), and placing the uncommitted changes in the working tree all happen before timing begins. The measured figures cover only the agent's work on the task.
Git write restriction: In runs using GitButler or Jujutsu, raw git write commands are blocked, so the agent must use the tool under test. When the tool calls git internally, that is the tool's own work and does not count against the agent.
Jujutsu setup: jj 0.42.0, a colocated repository (jj and git operating on the same working copy), and the most-used external jj agent skill, installed before timing begins.
Five runs per cell (k=5): Each agent–tool–task combination (a cell) ran five times. The numbers on this page are means over those five runs, not a single run.

k=5n=25 per celloracle: git-statejj 0.42.0150 runs

Failures

Failed runs

Thirteen of 150 runs failed the grader. Every failure was Claude; most were Jujutsu, with two plain git misses and one GitButler miss.

Tool	Agent	Scenario	Failure	Runs	What went wrong
git	claude	Selective commit	DIRTY_STATE_WRONG	1/5	The final dirty worktree state differed from the expected leftovers.
git	claude	Multi-amend	GRAPH_WRONG	1/5	Right file contents, wrong commit order.
Jujutsu	claude	Multi-amend	CONTENT_WRONG	3/5	Final file contents or leftovers differed from the expected history.
Jujutsu	claude	Split commit	CONTENT_WRONG	3/5	Final file contents or leftovers differed from the expected history.
Jujutsu	claude	Split commit	GRAPH_WRONG	2/5	Right file contents, wrong commit order.
Jujutsu	claude	Reorder commits	DIRTY_STATE_WRONG	2/5	The final dirty worktree state differed from the expected leftovers.
GitButler	claude	Selective commit	PARTITION_WRONG	1/5	The run committed or left behind the wrong subset of changes.

Jujutsu had the widest correctness problem: Claude split-commit failed 5/5, multi-amend failed 3/5, and reorder failed 2/5. GitButler had one Claude selective-commit partition miss. Plain git had two Claude misses.

About

About this benchmark

The numbers above are derived from the latest full-matrix aggregate. The source snapshot and commands that produced them are listed below.

git · but+skill100 runs · 2026-07-01

batch: full-k5-20260701-all-tools
setup_hash: e68d505c5f06
binary_hash: 772f7963b757
gitbutler_head: 3f92f26cdc04
skill_hash: f01fa617eb21
skill_tree_hash: 43f72206cfa4

jj+skill50 runs · 2026-07-01

batch: full-k5-20260701-all-tools
setup_hash: b440ab7d0e70
binary_hash: 849c9ab4bbfd
jj_version: jj 0.42.0
skill_package: onevcat/skills@onevcat-jj
skill_source_url: https://raw.githubusercontent.com/onevcat/skills/master/skills/onevcat-jj/SKILL.md
skill_hash: e0364004187a

generator: node scripts/build-web-data.mjs

derived results.json ↗benchmark source ↗