Linux Foundation Launches New License Agreement for Open Source Datasets

July 20, 2021

991

The Linux Foundation has announced a new permissive license to help foster collaboration around open data for AI and machine learning projects. IBM and Microsoft have already made their datasets available under CDLA- permissive 2.0, the company said.

“The open source licensing and collaboration model has made AI accessible to everyone, and formalised a two-way street for organisations to use and contribute to projects with others helping accelerate applied AI research. CDLA-Permissive-2.0 is a major milestone in achieving that type of success in the Data domain, providing an open source license specific to data that enables access, sharing and using data among individuals and organisations. The LF AI & Data community appreciates the clarity and simplicity CDLA-Permissive-2.0 provides,” says Dr Ibrahim Haddad, executive director of LF AI & Data in a press statement.

CDLA- permissive 2.0

According to LF, in October 2017, it launched CDLA 1.0, intended to provide clear and explicit rights for recipients of data under CDLA to use, share and modify the data for any purpose. The license permitted using the results from analysed data to create AI and ML models, without sharing the data itself.

CDLA-permissive 2.0 is said to be short with only obligation to make when sharing data is to make available the text of this agreement with shared data. It is said to have eliminated confusing terms.

“The IBM Center for Open Source Data and AI Technologies (CODAIT) will begin to re-license its public datasets hosted here using the CDLA-Permissive 2.0, starting with Project CodeNet, a large-scale dataset with 14 million code samples developed to drive algorithmic innovations in AI for code tasks like code translation, code similarity, code classification, and code search,” reads the statement.

LF also had listed out data sets available from Microsoft Research under the new license.

The Hippocorpus dataset, which comprises diary-like short stories about recalled and imagined events to help examine the cognitive processes of remembering and imagining and their traces in language;
The Public Perception of Artificial Intelligence data set, comprising analyses of text corpora over time to reveal trends in beliefs, interest, and sentiment about a topic;
The Xbox Avatars Descriptions data set, a corpus of descriptions of Xbox avatars created by actual gamers;
A Dual Word Embeddings data set, trained on Bing queries, to facilitate information retrieval about documents; and
A GPS Trajectory data set, containing 17,621 trajectories with a total distance of about 1.2 million kilometers and a total duration of 48,000+ hours.