Jump To Top


Linux Foundation unveils new permissive license for open data collaboration

Elevate your enterprise data technology and strategy at Transform 2021.

The Linux Foundation has announced a new permissive license designed to help foster collaboration around open data for artificial intelligence (AI) and machine learning (ML) projects.

It has often been said that data is the new oil, but for AI and ML projects in particular, having access to expansive and diverse data sets is key to reducing bias and building powerful models capable of all manner of intelligent tasks. To machines, data is a little like “experience” is to humans — the more of it you have, the better decisions you are likely to make.

With CDLA-Permissive-2.0, the Linux Foundation is building on its previous efforts to encourage data-sharing efforts through licensing arrangements that clearly define how the data — and any derivative data sets — can and can’t be used.

Data pools

The Linux Foundation first introduced the Community Data License Agreement (CDLA) back in 2017 to entice organizations to open up their vast pools of (underused) data to third-parties. There were two original licenses, a sharing license with a “copyleft” reciprocal commitment borrowed from the open source software sphere, stipulating that any derivative data sets built from the original data set much be shared under a similar license; and a permissive license (1.0) without any such obligations in place (similar to how someone might define “true” open source software).

Licenses are basically legal documents that outline how a piece of work (in this case data sets) can be used or modified, but oftentimes specific phrases, ambiguities, or exceptions can be enough to make companies run a mile if they think that releasing content under a specific license could cause them problems somewhere down the line. And that is where the CDLA-Permissive-2.0 license comes into play — it’s essentially a rewrite of version 1.0, but it’s shorter and simpler to follow. But more than that, it has removed certain provisions that were deemed unnecessary or burdensome, and which may have hindered broader use of the license.

For example, version 1.0 of the license included obligations that data recipients preserve attribution notices in the data sets. For context, attribution notices or statements are standard in the software sphere, where a company that releases software built on open source components have to credit the creators of these components in their own software license. But according to the Linux Foundation, feedback it received from the community, and lawyers representing companies involved in open data projects, indicated that there were” challenges involved with associating attributions with data (or versions of data sets).

So while data source attribution is still an option, and might make sense for specific projects particularly where transparency is paramount, it is no longer a condition for businesses looking to share data under the new permissive license. The chief obligation that remains is that the main community data license agreement text is included with the new data sets.

Data vs software

This also helps to highlight how transposing a concept from a software license onto a data set license doesn’t always make sense, partly because laws and regulations usually treat data differently to software and other similar creative content.

“Data is different from software,” Linux Foundation’s VP of compliance and legal Steve Winslow told VentureBeat. “Open source software is typically made of copyrightable works, where authorship is important. By contrast, data may frequently have little or no applicable intellectual property rights, and authorship and attribution are often less important.”

But isn’t attribution still desirable, even if it’s not always going to be applicable or relevant? According to Winslow, enforcing data attribution could have some negative consequences in terms of organizations’ willingness to collaborate around data.

“Some data recipients may still choose to attribute the data, to show that the data is trustworthy based on its source,” Winslow said. “But it will be their call, and not a requirement, as it could impose limitations on how to organize and analyze the data or force unintended burdens on data collaboration.”

For example, let’s assume data from multiple contributors — which could run into the thousands — is pooled into a single data set. If the data set is only ever used in that combined form, then attribution isn’t a huge burden. But if the data set is subsequently split into subsets which are redistributed separately or combined with a different data set, then that creates a whole lot of work in terms of determining which attributions apply to the new data set. In short, things can rapidly descend into confusion and chaos.


Several companies have already revealed plans to make their existing open data sets available under the new CDLA-Permissive-2.0 license, including Microsoft’s research arm which will now transition some of its open data sets, including Hippocorpus, Public Perception of Artificial Intelligence, Xbox Avatars Descriptions, Dual Word Embeddings, and GPS Trajectory.

Elsewhere, IBM’s Project CodeNet, NOAA JFK, Airline Reporting Carrier On-Time Performance, PubLayNet, and Fashion-MNIST will also transition to the new license.


  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Source: Read Full Article