Deep Search Open Source Toolbox From IBM Research

0
724

In an effort to accelerate the pace of scientific discovery, we are for the first time making a portion of the IBM Deep Search Experience available through an open source framework. Documents form the foundation of every organisation, including financial accounts, technical specifications, legal briefs, research papers, and slide decks. These papers contain a wealth of useful information, but because they are typically in an unstructured format that cannot easily be converted to a database, it can be difficult to search through their contents. In fact, according to IDC, by 2025, 80 percent of all data will be unstructured globally.

Scientists and organisations can already search through mounds of unstructured data thanks to IBM Research’s Deep Search. But today, with the launch of Deep Search for Scientific Discovery (DS4SD), an open-source toolbox for scientific research and enterprises, we’re making deep search even more adaptable and accessible. The release of DS4SD represents the next step toward our ultimate aim of creating an Open Science Hub for Accelerated Discovery after the release of the Generative Toolkit for Scientific Discovery (GT4SD) in March.

We are now openly disclosing our automatic document conversion service, a crucial part of the Deep Search Experience, to aid in the achievement of this objective. Users can interactively upload documents to check the conversion quality of a document. The drag-and-drop interface of DS4SD makes it very simple for non-experts to use. Additionally, we’re introducing deepsearch-toolkit, a Python library that enables programmatic bulk document uploading and conversion. Users can command the toolkit to upload documents from a folder, convert them, and then analyse the text, tables, and figures inside.

Through our Python package, data scientists and engineers have access to the new toolbox, which interacts and integrates with current services. We welcome developer community contributions since the toolkit is open source. Unstructured data has a lot to offer scientific inquiry. Take IBM’s Project Photoresist as an illustration: In 2020, we discovered and created a brand-new photoacid generating molecule for semiconductor production using Deep Search.

We were looking for a more suitable alternative because these generators entail environmental dangers. By the end of 2020, we were able to discover three potential photoacid producers because to Deep Search’s ability to consume data up to 1,000 times faster and screen it up to 100 times faster than a manual alternative. We significantly sped up the discovery process with our end-to-end, AI-powered workflow, scaling and handling the issue at a rate that human scientists are just unable to match2.

Deep Search is an AI-powered method of gathering, converting, curating, and finally searching through large document collections for information that is too specialised for standard search engines to handle. It then processes these papers using specialised natural language processing, computer vision, and machine learning techniques to produce searchable knowledge graphs. The generated datasets can be used to assist firms in developing models and locating significant trends that guide their choices. For instance, they might take into account the CEO turnover and financial performance of a target acquisition during the previous five years. Healthcare, climate science, and materials research are just a few of the intriguing fields where Deep Search has exciting applications. Deep Search also makes it simpler to start searching massive document collections.

Before, users had to supply their own data or documents for Deep Search to search. More than 364 million public documents, including academic articles and patents, have now been included. Commercial Deep Search users can instantly begin searching this data and incrementally adding their own data. Our automatic document conversion service’s public launch is just the beginning for DS4SD. Future releases will include new features including AI models and high-quality data sources.

LEAVE A REPLY

Please enter your comment!
Please enter your name here