The Datasets
Living datasets backed by automated experiments and open source methods.
DATASET
Protein Sequence to Function
Protein functions, such as enzymatic activities, binding interactions, and membrane transport, exist as islands in the “archipelago” of the protein function landscape. Machine learning (ML) algorithms have tried to bridge this gap, but are still unable to find a general solution for predicting any protein’s function from its DNA sequence.
A general solution for predicting any protein function from sequence would catalyze a transformation in the field of biology.
We propose to develop an experimental platform and unified data ontology for collecting datasets from different functional ‘islands’ to build predictive models for individual protein functions. The experimental strategy uses a pooled, growth-based assay measured with DNA sequencing to create a simple, yet adaptable system that can be easily expanded to encompass new functions.
READ
Growth-Based Assays for Measuring Function (Summary)
Growth-Based Assays for Measuring Function (Full Proposal)
OUR STRATEGY
We view protein functions as islands in an archipelago. Our strategy is to align dataset creation across different protein function ‘islands’ to enable a generalized “sequence → function” predictive model.
Models developed from this data will initially succeed at predicting protein function within a single ‘island’, an individual family of proteins with a single function.
As the datasets grow and more islands are sampled, the models will become more generalized and capable of predicting the function of protein sequences that are increasingly distant from those that have been directly measured.
HOW ARE WE DOING IT?
We are utilizing massively pooled, growth-based assays enabling 100,000-500,000 data points per experiment at a cost of ~$0.05 per protein.
Growth-based assays link any protein-of-interest’s function with the ability for a cell containing it to grow.
The plasmid library of barcoded proteins is created and transformed into host cells. Then the cells are pooled, grown and challenged. The resulting cells are then sequenced, counting the number of barcodes present, which can be translated back into a measurement of quantitative function.
By simply changing elements of the plasmid’s gene circuit, the same methods and analyses can be used to interrogate new protein functions.
OUR TIMELINE
Our timeline for the first five protein function datasets.
Active Team Members
-
Pete Kelly
PROGRAM DIRECTOR
Open Datasets Initiative
-
Dana Cortade
TECHNICAL PROJECT MANAGER
Open Datasets Initiative
-
David Ross
PROPOSAL CO-LEADER
Living Measurement Systems Foundry, National Institute of Standards and Technology (NIST)
-
Erika DeBenedictis
PROPOSAL CO-LEADER
Biodesign Lab, The Francis Crick Institute and Align to innovate
-
Simon d'Oelsnitz
PROPOSAL CO-LEADER
Harvard Medical School, Harvard University
-
Anjali Chadha
PROTEASE SPECIFICITY
Biodesign Lab, The Francis Crick Institute
-
Adam Winnifrith
PROTEASE SPECIFICITY
Biodesign Lab, The Francis Crick Institute
-
Geoffrey Taghon
TRANSCRIPTION FACTOR BINDING
Living Measurement Systems Foundry, National Institute of Standards and Technology (NIST)
Reviewers
Hassan Kane - Medium Biosciences
Han Spinner - Harvard Medical School Department of Systems Biology
Stephan Lane - Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
Chloe Hsu - University of California Berkeley
Ben Lehner FRS FMedSciHead of Generative and Synthetic Genomics, Wellcome Sanger Institute, Cambridge, UK; ICREA Professor, Systems and Synthetic Biology, CRG, Barcelona, ES; Honorary Professor of Biochemistry, University of Cambridge
Kevin K. Yang - Microsoft
Benjamin Scott- Global Institute for Food Security, University of Saskatchewan, Saskatoon, SK, Canada
Additional Proposal Contributors
Oliver Hayes, Biodesign Lab, The Francis Crick Institute
Mark Dörr, University of Greifswald
Stefan Born, Technische Universität Berlin
Subject Matter Experts
Craig Markin, University of Manchester
Henning Redestig, International Flavors & Fragrances
Tianhao Yu, University of Illinois at Urbana-Champaign
Janet Matsen, Benchling
Amelia Taylor
Talk to us! Here’s how to participate:
Email us at datasets@alignbio.org