Mark Gerstein

Mark Gerstein

Yale University

H-index: 193

North America-United States

Professor Information

University

Yale University

Position

Professor of Biomedical Informatics

Citations(all)

213759

Citations(since 2020)

73095

Cited By

168946

hIndex(all)

193

hIndex(since 2020)

106

i10Index(all)

595

i10Index(since 2020)

422

Email

University Profile Page

Yale University

Research & Interests List

Bioinformatics

Top articles of Mark Gerstein

A Variational Graph Partitioning Approach to Modeling Protein Liquid-liquid Phase Separation

Protein Liquid-Liquid Phase Separation (LLPS) plays an essential role in cellular processes and is known to be associated with various diseases. However, our understanding of this enigmatic phenomena remains limited. In this work, we propose a graph-neural-network(GNN)-based interpretable machine learning approach to study the intricate nature of protein structure-function relationships associated with LLPS. For many protein properties of interest, information relevant to the property is expected to be confined to local domains. For LLPS proteins, the presence of intrinsically disordered regions (IDR)s in the molecule is arguably the most important information; an adaptive GNN model which preferentially shares information within such units and avoids mixing in information from other parts of the molecule may thus enhance the prediction of LLPS proteins. To allow for the accentuation of domain restricted information, we propose a novel graph-based model with the ability to partition each protein graph into task-dependent subgraphs. Such a model is designed not only to achieve better predictive performance but also to be highly interpretable, and thus have the ability to suggest novel biological insights. In addition to achieving state-of-the-art results on the prediction of LLPS proteins from protein structure for both regulator and scaffold proteins, we examine the properties of the graph partitions identified by our model, showing these to be consistent with the annotated IDRs believed to be largely responsible for LLPS. Moreover, our method is designed in a generic way such that it can be applied to other graph-based predictive tasks with …

Authors

Gaoyuan Wang,Jonathan Warrell,Suchen Zheng,Mark Gerstein

Journal

bioRxiv

Published Date

2024

Disentangled Wasserstein Autoencoder for T-Cell Receptor Engineering

In protein biophysics, the separation between the functionally important residues (forming the active site or binding surface) and those that create the overall structure (the fold) is a well-established and fundamental concept. Identifying and modifying those functional sites is critical for protein engineering but computationally non-trivial, and requires significant domain knowledge. To automate this process from a data-driven perspective, we propose a disentangled Wasserstein autoencoder with an auxiliary classifier, which isolates the function-related patterns from the rest with theoretical guarantees. This enables one-pass protein sequence editing and improves the understanding of the resulting sequences and editing actions involved. To demonstrate its effectiveness, we apply it to T-cell receptors (TCRs), a well-studied structure-function case. We show that our method can be used to alter the function of TCRs without changing the structural backbone, outperforming several competing methods in generation quality and efficiency, and requiring only 10\% of the running time needed by baseline models. To our knowledge, this is the first approach that utilizes disentangled representations for TCR engineering.

Authors

Tianxiao Li,Hongyu Guo,Filippo Grazioli,Mark Gerstein,Martin Renqiang Min

Journal

Advances in Neural Information Processing Systems

Published Date

2024/2/13

Single-cell genomics and regulatory networks for 388 human brains

Single-cell genomics is a powerful tool for studying heterogeneous tissues such as the brain. Yet, little is understood about how genetic variants influence cell-level gene expression. Addressing this, we uniformly processed single-nuclei, multi-omics datasets into a resource comprising >2.8M nuclei from the prefrontal cortex across 388 individuals. For 28 cell types, we assessed population-level variation in expression and chromatin across gene families and drug targets. We identified >550K cell-type-specific regulatory elements and >1.4M single-cell expression-quantitative-trait loci, which we used to build cell-type regulatory and cell-to-cell communication networks. These networks manifest cellular changes in aging and neuropsychiatric disorders. We further constructed an integrative model accurately imputing single-cell expression and simulating perturbations; the model prioritized ∼250 disease-risk genes and drug targets with associated cell types.Summary Figure

Authors

Prashant S Emani,Jason J Liu,Declan Clarke,Matthew Jensen,Jonathan Warrell,Chirag Gupta,Ran Meng,Che Yu Lee,Siwei Xu,Cagatay Dursun,Shaoke Lou,Yuhang Chen,Zhiyuan Chu,Timur Galeev,Ahyeon Hwang,Yunyang Li,Pengyu Ni,Xiao Zhou,PsychENCODE Consortium,Trygve E Bakken,Jaroslav Bendl,Lucy Bicks,Tanima Chatterjee,Lijun Cheng,Yuyan Cheng,Yi Dai,Ziheng Duan,Mary Flaherty,John F Fullard,Michael Gancz,Diego Garrido-Martín,Sophia Gaynor-Gillett,Jennifer Grundman,Natalie Hawken,Ella Henry,Gabriel E Hoffman,Ao Huang,Yunzhe Jiang,Ting Jin,Nikolas L Jorstad,Riki Kawaguchi,Saniya Khullar,Jianyin Liu,Junhao Liu,Shuang Liu,Shaojie Ma,Michael Margolis,Samantha Mazariegos,Jill Moore,Jennifer R Moran,Eric Nguyen,Nishigandha Phalke,Milos Pjanic,Henry Pratt,Diana Quintero,Ananya S Rajagopalan,Tiernon R Riesenmy,Nicole Shedd,Manman Shi,Megan Spector,Rosemarie Terwilliger,Kyle J Travaglini,Brie Wamsley,Gaoyuan Wang,Yan Xia,Shaohua Xiao,Andrew C Yang,Suchen Zheng,Michael J Gandal,Donghoon Lee,Ed S Lein,Panos Roussos,Nenad Sestan,Zhiping Weng,Kevin P White,Hyejung Won,Matthew J Girgenti,Jing Zhang,Daifeng Wang,Daniel Geschwind,Mark Gerstein

Journal

bioRxiv

Published Date

2024/3/19

Binding profiles for 954 Drosophila and C. elegans transcription factors reveal tissue specific regulatory relationships

A catalog of transcription factor (TF) binding sites in the genome is critical for deciphering regulatory relationships. Here we present the culmination of the modERN (model organism Encyclopedia of Regulatory Networks) consortium that systematically assayed TF binding events in vivo in two major model organisms, Drosophila melanogaster (fly) and Caenorhabditis elegans (worm). We describe key features of these datasets, comprising 604 TFs identifying 3.6M sites in the fly and 350 TFs identifying 0.9 M sites in the worm. Applying a machine learning model to these data identifies sets of TFs with a prominent role in promoting 10 target gene expression in specific cell types. TF binding data are available through the ENCODE Data Coordinating Center and at https://epic.gs.washington.edu/modERNresource, which provides access to processed and summary data, as well as widgets to probe cell type-specific TF-target relationships. These data are a rich resource that should fuel investigations into TF function during development.

Authors

MIchelle Kudron,Louis Gewirtzman,Alec Victorsen,Bridget C Lear,Jiahao Gao,Jinrui Xu,Swapna Samanta,Emily Frink,Adri Tran-Pearson,Cau Huynh,Dionne Vafeados,Ann Hammonds,William FIsher,Martha Wall,Greg Wesseling,Vanessa Hernandez,Zhichun Lin,Mary Kasparian,Kevin P White,Ravi Allada,Mark Gerstein,LaDeana Hillier,Susan E Celniker,Valerie Reinke,Robert H Waterston

Journal

bioRxiv

Published Date

2024

Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science

Intelligent agents powered by large language models (LLMs) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines. While their capabilities are promising, they also introduce novel vulnerabilities that demand careful consideration for safety. However, there exists a notable gap in the literature, as there has been no comprehensive exploration of these vulnerabilities. This position paper fills this gap by conducting a thorough examination of vulnerabilities in LLM-based agents within scientific domains, shedding light on potential risks associated with their misuse and emphasizing the need for safety measures. We begin by providing a comprehensive overview of the potential risks inherent to scientific LLM agents, taking into account user intent, the specific scientific domain, and their potential impact on the external environment. Then, we delve into the origins of these vulnerabilities and provide a scoping review of the limited existing works. Based on our analysis, we propose a triadic framework involving human regulation, agent alignment, and an understanding of environmental feedback (agent regulation) to mitigate these identified risks. Furthermore, we highlight the limitations and challenges associated with safeguarding scientific agents and advocate for the development of improved models, robust benchmarks, and comprehensive regulations to address these issues effectively.

Authors

Xiangru Tang,Qiao Jin,Kunlun Zhu,Tongxin Yuan,Yichi Zhang,Wangchunshu Zhou,Meng Qu,Yilun Zhao,Jian Tang,Zhuosheng Zhang,Arman Cohan,Zhiyong Lu,Mark Gerstein

Journal

arXiv preprint arXiv:2402.04247

Published Date

2024/2/6

Less-is-more: selecting transcription factor binding regions informative for motif inference

Numerous statistical methods have emerged for inferring DNA motifs for transcription factors (TFs) from genomic regions. However, the process of selecting informative regions for motif inference remains understudied. Current approaches select regions with strong ChIP-seq signal for a given TF, assuming that such strong signal primarily results from specific interactions between the TF and its motif. Additionally, these selection approaches do not account for non-target motifs, i.e. motifs of other TFs; they presume the occurrence of these non-target motifs infrequent compared to that of the target motif, and thus assume these have minimal interference with the identification of the target. Leveraging extensive ChIP-seq datasets, we introduced the concept of TF signal ‘crowdedness’, referred to as C-score, for each genomic region. The C-score helps in highlighting TF signals arising from non-specific interactions …

Authors

Jinrui Xu,Jiahao Gao,Pengyu Ni,Mark Gerstein

Journal

Nucleic Acids Research

Published Date

2024/2/28

Transcriptional Determinism and Stochasticity Contribute to the Complexity of Autism Associated SHANK Family Genes

Precision of transcription is critical because transcriptional dysregulation is disease causing. Traditional methods of transcriptional profiling are inadequate to elucidate the full spectrum of the transcriptome, particularly for longer and less abundant mRNAs. SHANK3 is one of the most common autism causative genes. Twenty-four Shank3 mutant animal lines have been developed for autism modeling. However, their preclinical validity has been questioned due to incomplete Shank3 transcript structure. We applied an integrative approach combining cDNA-capture and long-read sequencing to profile the SHANK3 transcriptome in human and mice. We unexpectedly discovered an extremely complex SHANK3 transcriptome. Specific SHANK3 transcripts were altered in Shank3 mutant mice and postmortem brains tissues from individuals with ASD. The enhanced SHANK3 transcriptome significantly improved the detection rate for potential deleterious variants from genomics studies of neuropsychiatric disorders. Our findings suggest the stochastic transcription of genome associated with SHANK family genes.

Authors

Xiaona Lu,Pengyu Ni,Paola Suarez-Meade,Ma Yu,Emily Niemitz Forrest,Guilin Wang,Yi Wang,Alfredo Quinones-Hinojosa,Mark Gerstein,Yong-hui Jiang

Journal

bioRxiv

Published Date

2024

-QVAE: A Quantum Variational Autoencoder utilizing Regularized Mixed-state Latent Representations

A major challenge in near-term quantum computing is its application to large real-world datasets due to scarce quantum hardware resources. One approach to enabling tractable quantum models for such datasets involves compressing the original data to manageable dimensions while still representing essential information for downstream analysis. In classical machine learning, variational autoencoders (VAEs) facilitate efficient data compression, representation learning for subsequent tasks, and novel data generation. However, no model has been proposed that exactly captures all of these features for direct application to quantum data on quantum computers. Some existing quantum models for data compression lack regularization of latent representations, thus preventing direct use for generation and control of generalization. Others are hybrid models with only some internal quantum components, impeding direct training on quantum data. To bridge this gap, we present a fully quantum framework, -QVAE, which encompasses all the capabilities of classical VAEs and can be directly applied for both classical and quantum data compression. Our model utilizes regularized mixed states to attain optimal latent representations. It accommodates various divergences for reconstruction and regularization. Furthermore, by accommodating mixed states at every stage, it can utilize the full-data density matrix and allow for a "global" training objective. Doing so, in turn, makes efficient optimization possible and has potential implications for private and federated learning. In addition to exploring the theoretical properties of -QVAE, we demonstrate its …

Authors

Gaoyuan Wang,Jonathan Warrell,Prashant S Emani,Mark Gerstein

Journal

arXiv preprint arXiv:2402.17749

Published Date

2024/2/27

academic-engine

Useful Links