The Datasets
Living datasets backed by automated experiments and open source methods.
The Open Datasets Initiative is a new paradigm for funding, collecting, and sharing large, high-fidelity datasets in biology.
We are working to accelerate community-driven use of automated labs to pioneer robust data collection methods with the goal of curating high-fidelity, AI- ready biological datasets. The goal of our datasets is to enable scientists to create powerful predictive models that can advance the life sciences field.
We are doing this by:
1. Establishing collaborative teams of scientists, machine learning specialists, and automation experts to identify the most important datasets that should be collected in life science.
2. Developing peer-reviewed proposals for high-throughput measurement techniques to robustly and scalably collect data.
3. Funding the collection of open datasets and providing project management.
Why we need predictive models
Life science is on the cusp of a monumental transformation. “AI capabilities are growing rapidly, and now is the time to develop broader predictive models that can provide answers to unanswered questions at every size scale of biology: from molecules, to whole cells, to the behavior of cells at the macroscale… the next century will resemble a coordinated, whole-field effort to divide biology into a series of prediction tasks and then solve those tasks, one-by-one.” Read more perspectives from our founder Erika in her essay What Biology Can Learn from Physics.
Learn more → Read the whole article
Why we need datasets
Machine learning and AI hold promise for research, but progress is often hindered by insufficient, poorly curated, and privately held biology datasets. Reproducing life science experiments is also difficult, requiring resources and time, with varying workflows across labs. This scarcity of data and the lack of standardized experiments impede the creation of large, high-fidelity datasets for machine learning. In this talk, Erika delves into these challenges. She describes the need to identify important datasets to be collected and how we have started to create automated measurement techniques to robustly scale data and fund the collection of open datasets.
Learn more → Watch the whole talk
Want to know more about our dataset creation process?
OUR DATASETS
Our datasets are massive, living, and open.
We bring together scientists, machine learning specialists, and automation experts to identify the most important datasets that should be collected in life science.
We review a multitude of diverse dataset ideas submitted by scientists worldwide, organize working groups to create dataset proposals, and run a peer-review process before developing automated measurement techniques to robustly and scalably collect data.
ACTIVE DATASET
Protein Sequence to Function
ACTIVE DATASET
Protein Sequence to Expression
ACTIVE DATASET (COMING SOON)
Microbe Genome
to Phenome
LEARN MORE
Datasets in Incubation
Questions? Check out our FAQs page.
Interested in getting involved?
Email us at datasets@alignbio.org
Supported by
A philanthropic initiative founded by Eric and Wendy Schmidt.
A civic engagement initiative by Ken Griffin.