Modern code review: a case study at Google
Dec 7, 2023
Caitlin Sadowski, Emma Söderberg, Luke Church, Michal Sipko, Alberto Bacchelli | 2018 | Paper
Benchmarks for Code Review
It’s handy as we talk about code review to use some benchmarks that anchor our expectations for review performance in data. The best that I know of are in a 2018 paper from Google: “Modern Code Review: A Case Study at Google”. I find these helpful when breaking down qualiative feedback about reviews as if you can get the data, you can start to get a feel for where improvements in your review process can come from.
One way of seeing code review is that it has a tendency to be wasteful - the best reviews are fast and lightweight, not a blocking step that stops value from being released to production. We should be leaning heavily on automated tools for the small stuff, and instead using code review to educate ourselves, share knowledge, manage complexity and keep that maintenance burden under control.
I think whats interesting as I reflect on these bencharks is how fast and light the review should be and how the number of comments decreases with experience - we want to be making very few, crucial observations only. This implies that most stuff is probably not worth saying - instead, go fix your tools to say it faster for you. Bear in mind that these are output, not input metrics. If you’re commenting a lot, finding reviews slow or relying on a narrow group of reviewers, perhaps that points to a problem elsewhere in your setup that you need to go and solve upstream of the code review. Anyway, enough babble, here’s the benchmarks, some implied heuristics and a bit of interpretation.
Reviews requested: median 3 review requests per week, 80th pecentile <=7
A key aspect of modern code review is its pace. At Google, the median developer authors about 3 changes per week, with 80% authoring fewer than 7 changes.
Reviews completed: median 4, 80th percentile <= 10. The median dev should review about as often as you submit.
In terms of reviews performed, the median is 4 per week, with 80% reviewing fewer than 10 changes weekly. So this suggests that reviews are not/should not be concentrated, but distributed across the team so the workload is shared fairly evenly - the median developer should be reviewing code about as often as they author a PR. This might be a nice heuristic for most teams - one that it’s easy to observe and qualify in your team.
Review speed: median 4 hours, 1 hour for small changes, 5 hours for large ones
The speed of receiving initial feedback on a review at Google is impressive: under an hour for small changes and about 5 hours for larger ones. The overall median time for the entire review process is under 4 hours. This contrasts sharply with (at a guess) most folks experience of review today. For example, the paper gives datapoints from Microsoft, where median approval times range from 14.7 to 24 hours - from experience I’d say this is probably the typical experience at most companies today. This suggests that most places can do a bit better than they are today, and there’s some nice productivity tokens to be won (at least in 2018!).
Review size: 35% of changes modify a single file, 90% modify <= 10 files
Smaller review sizes enable quicker turnaround. At Google, over 35% of changes modify just one file, and about 90% modify fewer than 10 files. Notably, over 10% of changes adjust a single line, with a median of 24 lines modified. This is significantly lower than the median change sizes at companies like AMD (44 lines) or Lucent (263 lines). A guess: I think this probably changes a lot with maturity of the codebase that you are working in, so is likely to be highly contextual (so I’d suggest that you use a range here for appropriate review size, adjust it through time and treat it is a guide, not a stick to beat folks with).
Number of comments: 1 comment:100 lines, scales linearly with review size and number of reviewers up to 1250 LoC
Following intuition, there is a correlation between the number of reviewers and the average number of comments at Google: more reviewers typically mean more comments. Noticeably, the paper describes the 1:100 or 12.5 per 1250 LoC as the peak comment ratio. Past thie change sizeThe paper points out that changes larger than this often contain auto-generated code or large deletions, resulting in a lower average number of comments.
I think this is a bit more of a leap, but this relationship suggests we can use comments as a proxy for review rigor (well, Google can, not sure about the rest of us). If your comment rate feels intuitively like it’s in a good spot (1 per 100 lines) then this might point out some sources of waste - if you have more reviewers on smaller reviews you may generate more comments, but are unlikely to move the needle on quality. Inversely, where you want more rigor, add more reviewers and then make sure you get more comments (if you’re adding reviewers but the comment rate is not changing, this is probably a wasteful interruption). We want our reviews to be fast and slight, so perhaps the comment rate is the most useful heuristic here.
Number of reviewers: median is 1, 75% of changes have a single reviewer, 99% <=5
Google’s data shows that fewer than 25% of changes have more than one reviewer, and over 99% have at most five, with a median of one reviewer per change. Following intuition, larger changes tend to have slightly more reviewers, so reviewer count should probably follow change complexity up and down the curve, though it is an s-curve and over that 1250 line peak comment rate we should hit our ceiling.
Review comments vs tenure: PR comment rate should match the median once tenure exceeds 1 year, in the first year comment rates are double
There’s a bit of insight towards the end of the paper into the relationship between the number of comments received on code changes and the tenure of the PR author at Google. Google found a steep decline in the average number of comments per change as tenure increases in the first year, stabilizing around 2 comments per change thereafter (from 6). After the first year the comment rate moves in line with the change size up to the 1250 line peak we described earlier. Developers at Google who have started within the past year typically have more than twice as many comments per change. There’s also a relationship between the number of comments with a question mark and tenure. Following intuition, reviews are used for knowledge transfer to the reviewer less after the first year. Noticeably this also implies that review rigor should remain stable even for grizzly veterans or review buddies; there should be no free passes through code review.
Review rate vs tenure: as tenure grows you edit more than you review but both rates keep increasing, changing most rapidly in the first year and stabilising thereafter
As tenure increases at Google there are three key trends: the breadth of edits (number of files engaged with) increases, they review more (though noticeably files edited increases faster than files reviewed) and cummulative contributiins (reviews and edits) keeps increasing.