VanL_a_futuristic_friendly_cartoonlike_robot_typing_on_a_comput_fd32c693-8150-496e-9a4a-f3444a7ddd62

Open Source is Coming for AI

by VanL

AI is the new hot topic for open source program offices. We previously discussed licensing for AI models, and how many models are restricted to non-commercial use. But open source is coming for AI. Thankfully, the lessons learned managing open source apply to managing AI as well.

The non-commercial licensing in AI is contrary to the broader trend in the industry. Open source is part of all modern software development. Developers are trained with open source and they have come to expect it. That is why we predicted that it wouldn't be long before there were open source AI models designed to compete against the closed models provided by current vendors.

We didn't have to wait long.

New permissively licensed models

For a long time the only large language models with open licensing were the EleutherAI GPT-J series. But just in the past week we have two new models: "Dolly" from Databricks (see also the model card on Hugggingface ), and RedPajama from Together AI (see also the data on GitHub ). Both of these models have been trained to engage in chat-style interaction like ChatGPT.

What is especially exciting, however, is the licensing and data availability. Dolly is released under CC-BY-SA and RedPajama under the Apache License. These models also include weights and, in the case of RedPajama, terabytes of training data that organizations can use to refine their own internal models.

Recommendations for organizations

Just as with open source, AI is coming in "from the bottom" of the organization. Likely your developers are already using AI tools, at least to explore. This is the time to start working with your developers to help avoid the sort of possible risks that can come from the integration of other code. A few suggestions for action:

1. Get a policy document out quickly

If you start out by acknowledging AI tool use and its potential, it will help you get into a collaborative frame with developers and others who will want to use these tools. Create a policy document that recognizes that this is a fast-moving area and allows exploratory use. Mark the draft as a living document that will be updated regularly.

For example, OpenAI is close to rolling out a feature similar to Amazon CodeWhisperer - If the generated source code is similar to any input source code, they will alert the user to that fact. They are also reportedly working on providing automatic license and attribution information. This will make it less risky to use Codex/CoPilot–but it will also make the end user bear the liability for noncompliance after having been warned of the existence of the licensed code. This upcoming development should prompt an update of some internal AI policies.

2. Prioritize AI-related requests

Right now there are too many unknowns to grant broad exemptions for AI tools - but there are likely many smaller-scale acceptable uses for AI tools. State in your policy that you will respond quickly to questions -- and then do so. Each question you receive will be valuable in helping you identify the real uses for AI inside your organization.

3. Keep your eyes on terms of service

We previously wrote about OpenAI's Terms of Service and how they can restrain your organization. But that is not the only potential problem out there.

For example, at one point we were hearing reports of leaked trade secrets due to employee use of ChatGPT. We have a new public report of a leak of some Samsung data–sufficient that Samsung is reportedly restricting the size of what can be sent to the public ChatGPT site. (See https://mashable.com/article/samsung-chatgpt-leak-details.)

But if you read ChatGPT's terms of service, this is expected. User information provided through an OpenAI API is “opt-in” for use by OpenAI. However, any information provided through a “non-API service,” including ChatGPT, is retained and used for fine-tuning their language models.

4. Structure your code for flexibility

Right now many companies are using OpenAI's APIs to get started with AI. OpenAI's APIs are easy to use and relatively inexpensive. But over the long run, there isn't going to be just one model your company needs. Nor will there be just one organization.

As your teams start to think about how to integrate AI into your offerings, also spend time structuring their work so that they can "plug in" and cross-validate different backends. For example, creating a service that takes the prompt and processes it using the model and returns the response. You can have multiple services, one for each backend. Not only is there engineering value in being able to swap between them, but this structure can help you develop your own work. You can use them as training oracles that can help you learn by observation of the behavior of the trained model.

5. Gather and track information

The key to staying on top of AI, just like open source, will be keeping track of what your organization is using. We don't yet have "AI scanners" like we have code scanners, but even just keeping an internal spreadsheet of AI uses will help you when there is the inevitable security or licensing question.