DeepMind today in partnership with the European Molecular Biology Laboratory (EMBL), a laboratory for the life sciences, announced to open source protein structure dataset. The research firm claims it as the most complete and accurate database for proteins expressed by human genome. This is said to cover all ~20,000 proteins expressed by the human genome, and the data will be freely and openly available to the scientific community, the organisations said in press conference.
The database and artificial intelligence system provide structural biologists with powerful new tools for examining a protein’s three-dimensional structure.
The AlphaFold Protein Structure Database builds on this innovation and the discoveries of generations of scientists, from the early pioneers of protein imaging and crystallography. The database expands the accumulated knowledge of protein structures, more than doubling the number of high-accuracy human protein structures available to researchers. Advancing the understanding of these building blocks of life, which underpin every biological process in every living thing, will help enable researchers across a huge variety of fields to accelerate their work, the researchers noted.
“We used AlphaFold to generate the most complete and accurate picture of the human proteome. We believe this represents the most significant contribution AI has made to advancing scientific knowledge to date, and is a great illustration of the sorts of benefits AI can bring to society,” said DeepMind Founder and CEO Demis Hassabis, PhD.
The ability to predict a protein’s shape computationally from its amino acid sequence — rather than determining it experimentally through years of painstaking, laborious and often costly techniques is already helping scientists.
“AlphaFold was trained using data from public resources built by the scientific community so it makes sense for its predictions to be public. Sharing AlphaFold predictions openly and freely will empower researchers everywhere to gain new insights and drive discovery,” said EMBL Director General Edith Heard.
In addition to the human proteome, the database launches with ~350,000 structures including 20 biologically-significant organisms such as E.coli, fruit fly, mouse, zebrafish, malaria parasite and tuberculosis bacteria.
The research firm said, the database and system will be periodically updated as we continue to invest in future improvements to AlphaFold, and over the coming months we plan to vastly expand the coverage to almost every sequenced protein known to science — over 100 million structures covering most of the UniProt reference database.