A few months ago we talked about GitHub CoPilot and the controversy it created in the open source community. Since then a lawsuit has been filed against Microsoft, GitHub, and OpenAI (creators of the underlying technology). OSPOs are increasingly being asked whether AI-assisted code is safe to use. The answer, of course, is an unsatisfying maybe.
Our take here at OSPOCO is that the copyright and license risk associated with using AI-assisted code generation systems has increased, particularly for users of GitHub's CoPilot product. Competing products, such as Amazon's CodeWhisperer AI-assisted code writing tool, present substantially less risk due to differences in how CodeWhisperer was trained and how it presents its suggestions.
What's going on with the CoPilot lawsuit?
When we wrote our original article about GitHub CoPilot, there had only been an announced "investigation" into CoPilot. Evidently the investigation was concluded quickly, because within a week of our article the same group announced that their investigation had matured into a lawsuit.
The CoPilot lawsuit is odd. The complaint alleges a number of different miscellaneous causes of action, like breach of contract, torts, and privacy violations. Conspicuously missing is any allegation of copyright infringement. To make a comparison, it's as if a knight went into battle, but decided to fight using a bunch of butter knives instead of a sword. The plaintiffs are avoiding using the most powerful and effective tool in their arsenal, probably because they don't think they can use it effectively in this case.
The weakness of the complaint has not been lost on Microsoft and OpenAI's lawyers. In Microsoft and OpenAI's responses to the lawsuit, they convincingly argue that this case should be dismissed for many reasons, including the plaintiffs' apparent inability to effectively claim copyright infringement. Due to a legal doctrine called preemption, you can't use a bunch of ancillary arguments to effectively allege copyright infringement without meeting the standard for actual copyright infringement.
Danger on the horizon
You might think that the weakness of the existing CoPilot lawsuit would be good news, but it paradoxically raises the risk of using CoPilot right now. If this was a strong case, anyone similarly situated would probably wait it out to see what the court would do. But this lawsuit won't settle the issue. There are copyright owners who are able and willing to make much more powerful and effective arguments. Instead of biding their time, the existing lawsuit may provoke plaintiffs who are better-positioned to come forward and make their case.
Overfitting and bad facts
To further complicate the issue, there was a study published last week about the likelihood of "memorized" results in large datasets. The study was specifically about the memorization of images in generative image models, but the findings are likely analogous to what would happen with text (especially code).
The study paper made a big splash because the authors were able to provoke the Stable Diffusion model to regenerate almost-perfect copies of about 90 images. On its face, it sounded bad. It looks bad too - see the screenshot of "found" images below.
The nuance that was lost comes from all the caveats in the paper. The researches spent hundreds of thousands of dollars in compute time to make the model reproduce copies of just 0.0003% of the source images. They were also only about to provoke the regeneration of highly duplicated images by reconstructing the exact known parameters used to train the model. Even with this "head start," the researchers still had to generate hundreds of possible duplicates and use a specialized process to find the reported matches.
The implication for AI-assisted code generation is obvious: models can sometimes be used to recreate copyrighted material. That is especially true for open source code, which tends to encourage reuse of code. Reuse in this case means duplication, making regeneration of copyrighted code that much more likely.
Alternatives and training
CoPilot has received all the attention, but there are other alternatives that have implemented some smart controls. One notable alternative is Amazon CodeWhisperer. CodeWhisperer is different from CoPilot in that it has a post-generation filter that looks at each suggestion made to see if it matches a piece of known open source code. If it does, the product creates an alert, telling the developer, "This piece of code appears similar to code from [project]." The provided context lets the developer make an informed decision as to whether to use that particular suggestion or not.
Over the long term, the next iteration of all these models - for images and text - will likely include a filter to prevent duplicated inputs from being added to the model. When duplicates are removed from the training datasets, the chance that any meaningful memorization will take place will drop to almost nothing.
So, what should OSPOs think about CoPilot?
Getting back to the question we asked at the beginning, it appears risky to endorse use of GitHub CoPilot right now. But there are options, like CodeWhisperer, and the whole situation is likely to be temporary. Within a relatively short time we will likely have AI-assisted code generation services that are notably safer. And sometime in the next decade, we might have legal certainty.