The Datasets

Living datasets backed by automated experiments and open source methods.

Frequently Asked Questions

  • If you have an idea but aren’t yet sure how to expand it to fit the full 1-2 page proposal, we still want to hear from you! You can fill out the Idea Submission Form or email us directly at datasets@alignbio.org. The goal is to ultimately be able to reach consensus around a set of ideas and create full proposals and working teams dedicated to each project.

  • We will implement a review panel of experts representing relevant life sciences and machine learning disciplines reviews of proposals. This panel will work with members of The Open Datasets Initiative, alongside information gathered from discussions and workshops to determine the most important datasets that are also feasible to collect using open-source, automated methods.

  • If you submit an idea, or list of ideas, via our Idea Submission Form, we will see if consensus can be formed around your idea. If we find that it is, we will inform you and extend an invitation to participate in the working group to develop methods for the datasets collection.

  • We will first surface proposals to folks within Align to Innovate/Open Datasets Initiative and designated review panel members. If we find proposals that complement each other, we may choose to connect the submitters of those proposals early on through email and/or online meetings for initial discussions and further brainstorming. When we find promising proposals, we will also ask submitters to further define and refine their proposals with more details. We will also hold at least one workshop with people from Align to Innovate/Open Datasets Initiative, review panel members, some proposal submitters, and others we think would help further the discussion; this/these workshop(s) will involve both larger group discussions and smaller breakout sessions about submitted proposal ideas and overarching themes. They will help us to define unanswered questions and to flesh out details to lead us to definitive datasets to collect, methods to implement, and timelines to adhere to.

  • Yes! If you have resources or samples that can be contributed to the datasets, please contact us to contribute to the initiative!

  • Yes! Upon the completion of a dataset collection, sample contributors will be given priority access to their samples’ data under a 1 year (can be extended to 2 years upon request) embargo before adding these data to the shared public database.

  • Align takes a multi-pronged approach to biosecurity. Instead of a one-size-fits-all biosecurity policy, we tailor the biosecurity approach on a dataset-by-dataset basis.

    Bake biosecurity analysis into the ideation & review process

    Align commits to not developing datasets with outsized dual-use potential. Throughout the ideation and review process, we assess the potential risks of data collection and the resulting predictive models.

    Leverage capabilities of partners and vendors

    Align uses built-in tools to analyze DNA sequences to identify regulated or dangerous pathogen sequences.

    User verification and data use authorization agreements

    Users of our datasets undergo a verification process and must sign a data use authorization agreement to ensure prospective users are legitimate researchers for engaging in beneficial research

  • The Open Datasets Initiative is a new paradigm for funding, collecting, and sharing large, high-fidelity datasets in biology. This is a collaborative, community effort meant to determine, collect, and share datasets that are important for both the life sciences and machine learning communities. These datasets will be collected with automation through shared, open protocols, to ensure both transparency of the process and fidelity and repeatability of measurements. Datasets will ultimately be made fully public. The Open Datasets Initiative, through Align to Innovate, facilitates the collection and sharing of these methods and datasets; we do not provide grants for researchers to do independent projects, but instead help bring researchers together to accomplish these goals.

  • The Open Datasets Initiative is currently a two year program (2023-2025), with the possibility to continue if the model works. If successful, we will continue to expand upon the initial datasets, making them into truly living datasets, as well as adding new datasets as we are able.

  • An open dataset is a collection of data that is publicly available to all. The goal of having an open dataset is to remove barriers to accessing data for the creation of new machine learning models.

  • For our initial dataset(s) collection (2023-2024), we will likely focus primarily on experiments involving proteins, microbes (in particular, bacteria, microbial-related cell-free systems, and possibly yeast), and prokaryotic genomics. However, what we do is determined both by the ideas we receive and the feasibility of methods development and automatable execution. In addition, while we are doing an initial push for proposals, we will be accepting ideas on a rolling basis for possible future experiments. Thus, we welcome all ideas relating to datasets important to both biology and machine learning, regardless of area, system, or organism.

  • We plan to collect two or three datasets in 2024, though the number of initial datasets depends on ideas and methods. We are doing an initial push for proposals in Q2/3 of 2023, but will be accepting proposals on a rolling basis after this; these ideas may be implemented later in 2024 and at the start of 2025. If this model is successful we hope to expand these datasets with more funding and add more in years to come!

  • Anyone can contribute their initial ideas through our Idea Submission Form regardless of country or affiliation. We only ask that you be creative, thoughtful, and willing to engage in open collaboration.

  • Yes, you can submit as many dataset ideas as you wish!

Have additional questions? Contact us at datasets@alignbio.org