The Effects of Generative AI on High Skilled Work: Evidence from Three Field Experiments with Software Developers

Nov 12, 2024

TL;DR: A study of ~5,000 engineers across Microsoft, Accenture, and a Fortune 100 company finds Github Copilot boosts weekly PRs by 26.08% (SE: 10.2%) - but the effect varies widely, with a 95% confidence interval from 5.88% to 46.28%. Adoption patterns show junior and newly tenured engineers are more likely to use Copilot (up to 9.5% higher). 30-40% of engineers didn’t use it at all.

This is a paper I saw posted about a bit during the summer that looks at the productivity impact of Github Copilot for organisations. This means it has a larger sample size than most papers of a similar ilk that I’ve seen (~ 5000 engineers, different organisations in different settings and engineering teams spread across the world).

The study looks at counts of pull requests, how many commits there are per pull request and build failure rates. The hypothesis is that if the build failure rate is held constant but the number of PRs increases then that means that Copilot is having a positive impact on productivity. Counting PRs is a naive measurement of productivity (like measuring doctors based on the number of prescriptions written instead of looking at patient outcomes), but we all know that measuring eng productivity is hard so let’s park this concern and roll with it.

When looking at the whole population of engineers in the study (roughly 1500 at Microsoft, 300 at Accenture and 3000 at an anonymous Fortune 100 company), the effects of Copilot on PRs at the whole org scale were positive and statistically significant - Copilot enabled engineers authored about 26.08% more PRs than a control group weekly during the study period. However, the study period was short - the experiment lasted for as little as 1 month after initial adoption. The standard error for the measurement was also large at 10.2%. This means that a 95% confidence interval for the productivity increase would be 5.88% to 46.28%. That’s pretty wide!

The paper hypothesises that one reason for this variance is the heterogeneity of the take-up and usage of Copilot. At Microsoft, the researchers were given a bit more metadata about the population and could break out the engineers in the survey by tenure and seniority, and this gives us a bit more insight than normal into how Copilot is performing for different profiles of engineer in the org.

Newly tenured engineers were 9.5% more likely to adopt Copilot (84.3% vs 74.8%). Junior engineers were 5.3% more likely to adopt Copilot (82.1% vs 76.8%). Short tenure engineers were 4.3% more likely to accept suggestions, but interestingly the gap between junior and senior engineers was smaller - 1.8% difference in acceptance rates. Shorter tenure engineers were also more likely to be using Copilot 1 month after adoption. We tend to think of the impact of Copilot as uniform but it’s not, it varies significantly based on experience. We can also speculate that today’s tools might decline in usefulness over time as we all become more experienced.

A few thoughts on all this:

The paper suggests that the take-up, impact and retention of these types of services is uneven amongst populations of engineers. This matches what we’ve all seen and heard at the ground level. This means a per-seat pricing model where we don’t adjust what we pay based on usage is likely a poor fit. We shouldn’t be paying for every engineer in the org to have a seat. A better approach would be a type of task-completion payment model (e.g. something along the lines of paying per accepted completion). This seems to be the emergent trend. One to watch perhaps (and in the meantime, trim those seat numbers to save some $*).
I wonder what longer-term measures of retention are like for Copilot? I imagine revenue retention is strong as I think orgs won’t have wised up to this heterogenous usage profile yet, but I wonder what the usage/engagement of users is like over longer time periods? For me, the experiment periods in the paper are a bit too short to draw sweeping conclusions about productivity impact.
Interestingly 30-40% of engineers offered Copilot in the study didn’t use it at all. It would be interesting to understand why; the initial adoption rate in the study feels low given how integrated everything is (there may be timing effects here - the experiments ran through H2 22 and H1 23). Anecdotally, it’s quite common on the ground to have engineers say they’ve turned it off or found it less useful over time.
A harsh observation BUT, at least based on the numbers and the measurement method chosen in the paper, the surveyed org at Accenture appears to be noticeably weaker than the others. It would be interesting to know why, as the returns to closing the gap between the Accenture and Microsoft engineering orgs would be quite large. Perhaps those exec jobs are safe for a while yet.
Speculative, but maybe the UX for this type of Copilot tooling should vary along the experience curve. So, thinking out loud, junior engineers/new joiners could work in a very AI-led setting that gradually peels itself back back decreasing the amount of suggestions it’s making to help folks accelerate along that curve. Perhaps the decrease could be driven by the confidence in the suggested completion being accepted. I think what we’re seeing at the moment is a one-size fits all approach and I think this is probably one of the drivers of lower-than-expected adoption and engagement.

*It’s hard to get detailed stats on usage out of Github, but there are a few API endpoints to help.

The Effects of Generative AI on High Skilled Work: Evidence from Three Field Experiments with Software Developers