The Datasets

Living datasets backed by automated experiments and open source methods.

The Open Datasets Initiative is a new paradigm for funding, collecting, and sharing large, high-fidelity datasets in biology.

We are working to accelerate community-driven use of automated labs to pioneer robust data collection methods with the goal of curating high-fidelity, AI- ready biological datasets. The goal of our datasets is to enable scientists to create powerful predictive models that can advance the life sciences field.

We are doing this by:

1. Establishing collaborative teams of scientists, machine learning specialists, and automation experts to identify the most important datasets that should be collected in life science.

2. Developing peer-reviewed proposals for high-throughput measurement techniques to robustly and scalably collect data.

3. Funding the collection of open datasets and providing project management.

Why we need predictive models

Life science is on the cusp of a monumental transformation. “AI capabilities are growing rapidly, and now is the time to develop broader predictive models that can provide answers to unanswered questions at every size scale of biology: from molecules, to whole cells, to the behavior of cells at the macroscale… the next century will resemble a coordinated, whole-field effort to divide biology into a series of prediction tasks and then solve those tasks, one-by-one.” Read more perspectives from our founder Erika in her essay What Biology Can Learn from Physics.

Learn more → Read the whole article

Why we need datasets

Machine learning and AI hold promise for research, but progress is often hindered by insufficient, poorly curated, and privately held biology datasets. Reproducing life science experiments is also difficult, requiring resources and time, with varying workflows across labs. This scarcity of data and the lack of standardized experiments impede the creation of large, high-fidelity datasets for machine learning. In this talk, Erika delves into these challenges. She describes the need to identify important datasets to be collected and how we have started to create automated measurement techniques to robustly scale data and fund the collection of open datasets.

Learn more → Watch the whole talk

Want to know more about our dataset creation process?

Visit our Datasets in Detail page

OUR DATASETS

Our datasets are massive, living, and open.

We bring together scientists, machine learning specialists, and automation experts to identify the most important datasets that should be collected in life science.

We review a multitude of diverse dataset ideas submitted by scientists worldwide, organize working groups to create dataset proposals, and run a peer-review process before developing automated measurement techniques to robustly and scalably collect data.