Computer laser digitizing an object, futuristic

Your OSPO Toolkit: Scanning

by VanL

The core administrative function of an Open Source Program Office is making sure you know what open source software your organization is using. Every other function relies on this basic knowledge. If you don't know what software you are using, you can't comply with the licenses, you can't respond to security issues, and you can't engage with the larger community. So how do you get that information? In a word, scanning.

Software Composition Analysis

Software Composition Analysis (SCA) is the process of analyzing your source code to determine and catalog the open source projects you are using. This definition is deliberately broad. There are lots of different types of software that can all be considered "scanning." But they all have a common goal: to provide you with a list of open source projects that are included in or used by your code.

The best-known SCA tool in the OSS world is Black Duck. Black Duck (now offered by Synopsys) was the first vendor to regularize the process of scanning and has become - especially for some lawyers in the M&A space - a generic term for an open source scan. In the intervening years, however, a number of competitors have sprung up, including open source options like Fossology and Scancode as well as commercial vendors like Fossa, Mend, ScanOSS, and JFrog XRay. Security companies have also started to produce SCA tools, like Snyk and CodeSentry. This is not a complete list. New tools come out frequently.

Because there is so much source code in any product, SCA tools all start with some sort of automated scan. But not all scans are the same. There are lots of different categories of tools, all with their strengths and weaknesses. The major varieties are:

License scanning: These tools are engineered to detect copyright statements and license texts. Once legally-significant language is found, the tool tries to identify the source code files to which the license or copyright statements apply. The strength of these tools is that they can work for any type of source code and don't rely on having seen the code before. They can find copyright statements in any type of code. However, these tools can't detect copied-and-pasted code and they require licensing and copyright statements to be present in the code to work. There are very few false positives from these tools, but humans still need to verify the results because copyright and license statements may be organized in a way that the software doesn't understand. Fossology and Scancode are examples of this kind of tool.

Dependency scanning: These tools use various languages' built-in package or module tools to identify code. For example, a scanner might look at Maven files for Java, package.json files for Javascript, and requirements.txt files for Python. The tool then uses some sort of existing database to associate packages with their associated licenses. The benefit of these tools is that they are very fast and - for supported languages - they can work recursively, identifying information about not just dependencies, but dependencies of dependencies. These tools can't detect copied-and-pasted code, however, they require existing knowledge of open source library licenses, and they don't work for some languages (notably C and C++). Most modern tools are of this sort, because they are fast enough to run on every commit and they have relatively few false negatives or positives. Fossa and JFrog XRay are examples of this kind of tool.

Snippet scanning: Snippet scanning tools perform an exhaustive scan of the source code, creating code "fingerprints" that are matched against a knowledge base. Snippet scanning tools can identify both whole files as well as "fuzzy" matches of three lines or more. These tools are the only ones that can reliably find copied-and-pasted code and are the only ones that work on non-package-managed languages like C and C++. However, these tools can be extraordinarily slow, they require pre-existing databases that associate code fingerprints with existing code, and they have very "noisy" output with lots of false positives. Extensive human review is frequently needed to review and clear the output of these tools. ScanOSS and Black Duck Codeprint/Snippet analysis are examples of this kind of tool.

Binary scanning: All of the products above are focused on source code scanning. Binary scanning tools are optimized for searching compiled code, not source code. They may work using any of the strategies above - looking for copyright text in the executable files, looking at manifests, or identifying used projects based upon an analysis of the executable patterns left by the source code. Binary scanners are useful when all you have is a binary file, but they are much less sensitive than source code scanners. CodeSentry and Black Duck binary scanning are examples of this kind of tool.

There are also a number of container-specific scanning tools. These tools usually use one or more of the above scanning strategies, but perform the scans on multiple layers of the container. Beware, though, as containers can include lots of unscanned files.

The output of these scanners is a list of files or packages used in your software, sometimes called a Software Bill of Materials (SBOM). This SBOM is one of the concrete artifacts that you can use to identify software vulnerabilities that need mitigation and to provide to others to demonstrate partial compliance with open source licenses.

So what is best? It depends upon the programming languages you are using, the level of access you have to the code, the level of risk you are willing to take on, and the amount of time available for review. M&A attorneys, for example, will almost always insist on snippet scanning tools, as those allow for the most comprehensive clearance of IP issues. For other use cases, particularly for languages with package managers, dependency scans offer good integration with many CI/CD (continuous integration/continuous deployment) toolchains.