The Datasets

Living datasets backed by automated experiments and open source methods.

DATASET

Protein Sequence to Expression

Protein expression is a fundamental process in biotechnology- crucial for academic research, human health applications, and the bioeconomy. However, predicting which proteins will express well in different organisms remains a significant challenge. Current approaches often rely on laborious experimental trial-and-error methods.

A generalizable model for predicting protein expression from sequence information would revolutionize fields ranging from basic scientific research to biomanufacturing and pharmaceutical development.

READ

Can protein expression be “solved”? (A review of the current state of protein expression data)

A strategy for scalable data collection of soluble protein expression in diverse hosts (Our proposed experimental approach)

We propose to develop a large-scale, standardized dataset enabling quantitative prediction of expression levels from protein sequence data across multiple organisms. This dataset will serve as a foundation for machine learning models that can accurately predict soluble protein expression, addressing a major bottleneck in biotechnology and protein engineering.

OUR STRATEGY

Protein expression is the result of the complex interplay of intrinsic and extrinsic factors. Our strategy is to create a comprehensive dataset linking protein sequences to expression levels in various organisms to enable the development of accurate, generalizable predictive models for protein expression.

Initially, we will focus on expression data in E. coli and P. pastoris, two widely used expression hosts. As the dataset grows, we will expand to include more diverse organisms, more sequences, and more experimental conditions to capture a broader range of expression behaviors.

HOW ARE WE DOING IT?

We are using a combination of high-throughput pooled methods and targeted validation to collect expression data on over 1 million unique protein open reading frames. Our approach includes:

  1. Pooled methods: Label-free proteomics and SortSeq

  2. Validation methods: HiBiT and mass spectrometry

This strategy allows us to generate a large, diverse dataset while ensuring data quality and reproducibility. Data will be freely available via API access, provided in a standardized format, and designed with ML utility in mind to promote widespread use and collaboration.

OUR TIMELINE

Initial timelines for the protein expression dataset project:

2024: Establish experimental protocols and data collection pipeline in E. coli
2025: Expand data collection to P. pastoris and begin large-scale data generation
2026+: Introduce additional validation methods and start expanding to additional bacterial and eukaryotic hosts

Project Team

Kasia Baranowski

TECHNICAL PROJECT MANAGER

Team Member Spotlight

Kasia attended graduate school at Harvard where she studied the mycobacterial cell wall in the lab of Eric Rubin. Since then, she’s worked in synthetic biology as a yeast strain engineer (Zymergen), a CRISPR platform biologist (Inscripta) and a scientific project manager (Cemvita). At Align to Innovate, Kasia spearheads the Protein Sequence to Expression dataset. She wants you to stop wasting time trying to express proteins in the wrong host and knows the expression dataset and subsequent models will help you out!

Proposal Contributors

Eli Bixby - Cradle Bio

Swati Choudhary - Formerly at Shiru, Inc.

Ranjani Varadan - Formerly at Shiru, Inc.

Elise de Reus - Cradle Bio

Aljaž Gaber - University of Ljubljana, Faculty of Chemistry and Chemical Technology

Sebastian Jaaks-Kraatz - Friedrich Miescher Institute for Biomedical Research

Michael C. Jewett - Department of Bioengineering, Stanford University

Ben Lehner - Center for Genomic Regulation

Hector Garcia Martin - Lawrence Berkeley National Laboratory, DOE Joint BioEnergy Institute, DOE Agile BioFoundry

Evangelos-Marios Nikolados - School of Biological Sciences, University of Edinburgh

Diego A. Oyarzún - School of Informatics & School of Biological Sciences, University of Edinburgh

Christopher J. Petzold - Lawrence Berkeley National Laboratory

Christopher R. Reynolds - Eden Genetics Ltd

David Ross - National Institute of Standards and Technology

Howard Salis - Pennsylvania State University

Devin Scannell - Independent

Rachel Sevey - Independent Graphic Design and Data Visualization Specialist

Data Collection Partners

Talk to us! Here’s how to participate:

Email us at datasets@alignbio.org