The Datasets

Living datasets backed by automated experiments and open source methods.

The Open Datasets Initiative is a new paradigm for funding, collecting, and sharing large, high-fidelity datasets in biology.

We are working to accelerate community-driven use of automated labs to pioneer robust data collection methods with the goal of curating high-fidelity, AI- ready biological datasets. The goal of our datasets is to enable scientists to create powerful predictive models that can advance the life sciences field.

We are doing this by:

1. Establishing collaborative teams of scientists, machine learning specialists, and automation experts to identify the most important datasets that should be collected in life science.

2. Developing peer-reviewed proposals for high-throughput measurement techniques to robustly and scalably collect data.

3. Funding the collection of open datasets and providing project management.

Why we need predictive models

Life science is on the cusp of a monumental transformation. “AI capabilities are growing rapidly, and now is the time to develop broader predictive models that can provide answers to unanswered questions at every size scale of biology: from molecules, to whole cells, to the behavior of cells at the macroscale… the next century will resemble a coordinated, whole-field effort to divide biology into a series of prediction tasks and then solve those tasks, one-by-one.” Read more perspectives from our founder Erika in her essay What Biology Can Learn from Physics.

Learn more → Read the whole article

Why we need datasets

Machine learning and AI hold promise for research, but progress is often hindered by insufficient, poorly curated, and privately held biology datasets. Reproducing life science experiments is also difficult, requiring resources and time, with varying workflows across labs. The scarcity of data and the lack of standardized experiments impede the creation of large, high-fidelity datasets for machine learning. We identify the most important datasets that should be collected in life science, create automated measurement techniques to robustly scale data, and fund the collection of open datasets.

Learn more about our process → Datasets in Detail

Have a dream dataset or predictive model?

OUR DATASETS

Our datasets are massive, living, and open.

We bring together scientists, machine learning specialists, and automation experts to identify the most important datasets that should be collected in life science.

We review a multitude of diverse dataset ideas submitted by scientists worldwide, organize working groups to create dataset proposals, and run a peer-review process before developing automated measurement techniques to robustly and scalably collect data.

ACTIVE DATASET

Protein Sequence to Function

ACTIVE DATASET (COMING SOON)

Protein Sequence to Expression

LEARN MORE

Datasets in Incubation

Datasets Progress

Interested in getting involved?

Email us at datasets@alignbio.org

Supported by

A philanthropic initiative founded by Eric and Wendy Schmidt.

A civic engagement initiative by Ken Griffin.