Nov 1, 2024
Nice paper on AI assisted code review at Google. Three call outs that I thought were interesting (as I imagine that we’re about to be hit by a tidal wave of commercial applications of this idea):
(1) One of the issues is that the required training dataset varies by best practice - the currency of knowledge really matters. So for example the underlying model was trained on data prior to ‘22, but the canonical source of python type definitions has shifted about a fair bit from Python 3.9 onwards. If you’re generically trying to train on all best practices, you’re likely to miss this type of subtlety as best practices at the micro level are continually evolving. So you need a good, customisable suppression mechanism for noisy/changing rules to keep this type of thing helpful and you need to continually re-train your model on fresh data.
(2) There’s a nice chart about halfway through the paper that talks through how much better this approach is than a linter. I think this is key as it’s all very well building a lovely (and expensive) ML model to do this work, but if we can cheaply and efficiently achieve the same effect with linters that we configure once and apply everywhere, (if you are not a Google scale eng team*) then why would you bother? The paper finds that about 2/3 of the suggested changes wouldn’t be possible with existing (or easy to develop) lint rules (the paper notes that this is based on expert opinion at Google, not prototyping) but there’s a noticeable concentration in the interventions that AutoCommenter makes (about 90% of review comments posted were related to the top 85 rules). An ML based approach is going to be better in certain scenarios (like suggesting naming changes), however, it feels like a lot of these scenarios are going to be more subjective. So if it’s AND vs XOR I wonder if what you get back from the ML approach incrementally over a linter is going to be very noisy in practice. I think it’s safe to say that for a relatively long horizon a good linting setup is likely to be a lot more effective, cheaper/faster/everywhere and the best place for everyone to start.
(3) About 40% of the comments made by the AutoCommenter were resolved in subsequent changes before merging to main - this is taken as the efficacy of the AI review. The paper does not provide a benchmark of how this compares to human reviews but that feels ballpark right to me - comment rate is typically about 1:100 LoC, average PR size is normally in the 200-400 line bracket, so you’d be making changes on 1-2 comments received per review - feels about right.
- This is key, clearly at Google scale this is valuable, throughout the paper Google team have carefully measured that this has a positive impact (e.g. using a phased rollout which included A/B testing AutoCommenter with all of their engineering team).