Heng Li

Heng Li

Harvard University

H-index: 72

North America-United States

About Heng Li

Heng Li, With an exceptional h-index of 72 and a recent h-index of 66 (since 2020), a distinguished researcher at Harvard University, specializes in the field of Computational Biology, Bioinformatics, Genomics.

His recent articles reflect a diverse array of research interests and contributions to the field:

Exploring gene content with pangenome gene graphs

Full resolution HLA and KIR genes annotation for human genome assemblies

Protein-to-genome alignment with miniprot

Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph

Evaluation of haplotype-aware long-read error correction with hifieval

Efficient and accurate KIR and HLA genotyping with massively parallel sequencing data

compleasm: a faster and more accurate reimplementation of BUSCO

A draft human pangenome reference

Heng Li Information

University

Harvard University

Position

Dana-Farber Cancer Institute &

Citations(all)

218609

Citations(since 2020)

127603

Cited By

137778

hIndex(all)

72

hIndex(since 2020)

66

i10Index(all)

101

i10Index(since 2020)

92

Email

University Profile Page

Harvard University

Heng Li Skills & Research Interests

Computational Biology

Bioinformatics

Genomics

Top articles of Heng Li

Exploring gene content with pangenome gene graphs

Authors

Heng Li,Maximillian Marin,Maha Reda Farhat

Journal

arXiv preprint arXiv:2402.16185

Published Date

2024/2/25

Motivation The gene content regulates the biology of an organism. It varies between species and between individuals of the same species. Although tools have been developed to identify gene content changes in bacterial genomes, none is applicable to collections of large eukaryotic genomes such as the human pangenome. Results We developed pangene, a computational tool to identify gene orientation, gene order and gene copy-number changes in a collection of genomes. Pangene aligns a set of input protein sequences to the genomes, resolves redundancies between protein sequences and constructs a gene graph with each genome represented as a walk in the graph. It additionally finds subgraphs that encodes gene content changes. Applied to the human pangenome, pangene identifies known gene-level variations and reveals complex haplotypes that are not well studied before. Pangene also works with high-quality bacterial pangenome and reports similar numbers of core and accessory genes in comparison to existing tools. Availability and implementation Source code at https://github.com/lh3/pangene; pre-built pangene graphs can be downloaded from https://zenodo.org/records/8118576 and visualized at http://pangene.liheng.org.

Full resolution HLA and KIR genes annotation for human genome assemblies

Authors

Ying Zhou,Li Song,Heng Li

Journal

bioRxiv

Published Date

2024

The HLA (Human Leukocyte Antigen) genes and the KIR (Killer cell Immunoglobulin-like Receptor) genes are critical to immune responses and are associated with many immune-related diseases. Located in highly polymorphic regions, they are hard to be studied with traditional short-read alignment-based methods. Although modern long-read assemblers can often assemble these genes, using existing tools to annotate HLA and KIR genes in these assemblies remains a non-trivial task. Here, we describe Immuannot, a new computation tool to annotate the gene structures of HLA and KIR genes and to type the allele of each gene. Applying Immuannot to 56 regional and 212 whole-genome assemblies from previous studies, we annotated 9,931 HLA and KIR genes and found that almost half of these genes, 4,068, had novel sequences compared to the current Immuno Polymorphism Database (IPD). These novel gene sequences were represented by 2,664 distinct alleles, some of which contained non-synonymous variations resulting in 92 novel protein sequences. We demonstrated the complex haplotype structures at the two loci and reported the linkage between HLA/KIR haplotypes and gene alleles. We anticipate that Immuannot will speed up the discovery of new HLA/KIR alleles and enable the association of HLA/KIR haplotype structures with clinical outcomes in the future.

Protein-to-genome alignment with miniprot

Authors

Heng Li

Journal

Bioinformatics

Published Date

2023/1/1

Motivation Protein-to-genome alignment is critical to annotating genes in non-model organisms. While there are a few tools for this purpose, all of them were developed over 10 years ago and did not incorporate the latest advances in alignment algorithms. They are inefficient and could not keep up with the rapid production of new genomes and quickly growing protein databases. Results Here, we describe miniprot, a new aligner for mapping protein sequences to a complete genome. Miniprot integrates recent techniques such as k-mer sketch and vectorized dynamic programming. It is tens of times faster than existing tools while achieving comparable accuracy on real data. Availability and implementation https://github.com/lh3/miniport.

Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph

Authors

Haoyu Cheng,Mobin Asri,Julian Lucas,Sergey Koren,Heng Li

Journal

ArXiv

Published Date

2023/6/6

Despite recent advances in the length and the accuracy of long-read data, building haplotype-resolved genome assemblies from telomere to telomere still requires considerable computational resources. In this study, we present an efficient de novo assembly algorithm that combines multiple sequencing technologies to scale up population-wide telomere-to-telomere assemblies. By utilizing twenty-two human and two plant genomes, we demonstrate that our algorithm is around an order of magnitude cheaper than existing methods, while producing better diploid and haploid assemblies. Notably, our algorithm is the only feasible solution to the haplotype-resolved assembly of polyploid genomes.

Evaluation of haplotype-aware long-read error correction with hifieval

Authors

Yujie Guo,Xiaowen Feng,Heng Li

Journal

Bioinformatics

Published Date

2023/10/18

Summary The PacBio High-Fidelity (HiFi) sequencing technology produces long reads of 99% in accuracy. It has enabled the development of a new generation of de novo sequence assemblers, which all have sequencing error correction (EC) as the first step. As HiFi is a new data type, this critical step has not been evaluated before. Here, we introduced hifieval, a new command-line tool for measuring over- and under-corrections produced by EC algorithms. We assessed the accuracy of the EC components of existing HiFi assemblers on the CHM13 and the HG002 datasets and further investigated the performance of EC methods in challenging regions such as homopolymer regions, centromeric regions, and segmental duplications. Hifieval will help HiFi assemblers to improve EC and assembly quality in the long run. Availability and implementation The source code is …

Efficient and accurate KIR and HLA genotyping with massively parallel sequencing data

Authors

Li Song,Gali Bai,X Shirley Liu,Bo Li,Heng Li

Journal

Genome Research

Published Date

2023/6/1

Killer cell immunoglobulin like receptor (KIR) genes and human leukocyte antigen (HLA) genes play important roles in innate and adaptive immunity. They are highly polymorphic and cannot be genotyped with standard variant calling pipelines. Compared with HLA genes, many KIR genes are similar to each other in sequences and may be absent in the chromosomes. Therefore, although many tools have been developed to genotype HLA genes using common sequencing data, none of them work for KIR genes. Even specialized KIR genotypers could not resolve all the KIR genes. Here we describe T1K, a novel computational method for the efficient and accurate inference of KIR or HLA alleles from RNA-seq, whole-genome sequencing, or whole-exome sequencing data. T1K jointly considers alleles across all genotyped genes, so it can reliably identify present genes and distinguish homologous genes, including …

compleasm: a faster and more accurate reimplementation of BUSCO

Authors

Neng Huang,Heng Li

Journal

Bioinformatics

Published Date

2023/10/1

Motivation Evaluating the gene completeness is critical to measuring the quality of a genome assembly. An incomplete assembly can lead to errors in gene predictions, annotation, and other downstream analyses. Benchmarking Universal Single-Copy Orthologs (BUSCO) is a widely used tool for assessing the completeness of genome assembly by testing the presence of a set of single-copy orthologs conserved across a wide range of taxa. However, BUSCO is slow particularly for large genome assemblies. It is cumbersome to apply BUSCO to a large number of assemblies. Results Here, we present compleasm, an efficient tool for assessing the completeness of genome assemblies. Compleasm utilizes the miniprot protein-to-genome aligner and the conserved orthologous genes from BUSCO. It is 14 times faster than BUSCO for human assemblies and reports a more accurate …

A draft human pangenome reference

Authors

Wen-Wei Liao,Mobin Asri,Jana Ebler,Daniel Doerr,Marina Haukness,Glenn Hickey,Shuangjia Lu,Julian K Lucas,Jean Monlong,Haley J Abel,Silvia Buonaiuto,Xian H Chang,Haoyu Cheng,Justin Chu,Vincenza Colonna,Jordan M Eizenga,Xiaowen Feng,Christian Fischer,Robert S Fulton,Shilpa Garg,Cristian Groza,Andrea Guarracino,William T Harvey,Simon Heumos,Kerstin Howe,Miten Jain,Tsung-Yu Lu,Charles Markello,Fergal J Martin,Matthew W Mitchell,Katherine M Munson,Moses Njagi Mwaniki,Adam M Novak,Hugh E Olsen,Trevor Pesout,David Porubsky,Pjotr Prins,Jonas A Sibbesen,Jouni Sirén,Chad Tomlinson,Flavia Villani,Mitchell R Vollger,Lucinda L Antonacci-Fulton,Gunjan Baid,Carl A Baker,Anastasiya Belyaeva,Konstantinos Billis,Andrew Carroll,Pi-Chuan Chang,Sarah Cody,Daniel E Cook,Robert M Cook-Deegan,Omar E Cornejo,Mark Diekhans,Peter Ebert,Susan Fairley,Olivier Fedrigo,Adam L Felsenfeld,Giulio Formenti,Adam Frankish,Yan Gao,Nanibaa’A Garrison,Carlos Garcia Giron,Richard E Green,Leanne Haggerty,Kendra Hoekzema,Thibaut Hourlier,Hanlee P Ji,Eimear E Kenny,Barbara A Koenig,Alexey Kolesnikov,Jan O Korbel,Jennifer Kordosky,Sergey Koren,HoJoon Lee,Alexandra P Lewis,Hugo Magalhães,Santiago Marco-Sola,Pierre Marijon,Ann McCartney,Jennifer McDaniel,Jacquelyn Mountcastle,Maria Nattestad,Sergey Nurk,Nathan D Olson,Alice B Popejoy,Daniela Puiu,Mikko Rautiainen,Allison A Regier,Arang Rhie,Samuel Sacco,Ashley D Sanders,Valerie A Schneider,Baergen I Schultz,Kishwar Shafin,Michael W Smith,Heidi J Sofia,Ahmad N Abou Tayoun,Françoise Thibaud-Nissen,Francesca Floriana Tricomi,Justin Wagner,Brian Walenz,Jonathan MD Wood,Aleksey V Zimin,Guillaume Bourque,Mark JP Chaisson,Paul Flicek,Adam M Phillippy,Justin M Zook,Evan E Eichler,David Haussler,Ting Wang,Erich D Jarvis,Karen H Miga,Erik Garrison,Tobias Marschall,Ira M Hall,Heng Li,Benedict Paten

Journal

Nature

Published Date

2023/5/11

Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural …

The complete sequence of a human Y chromosome

Authors

Arang Rhie,Sergey Nurk,Monika Cechova,Savannah J Hoyt,Dylan J Taylor,Nicolas Altemose,Paul W Hook,Sergey Koren,Mikko Rautiainen,Ivan A Alexandrov,Jamie Allen,Mobin Asri,Andrey V Bzikadze,Nae-Chyun Chen,Chen-Shan Chin,Mark Diekhans,Paul Flicek,Giulio Formenti,Arkarachai Fungtammasan,Carlos Garcia Giron,Erik Garrison,Ariel Gershman,Jennifer L Gerton,Patrick GS Grady,Andrea Guarracino,Leanne Haggerty,Reza Halabian,Nancy F Hansen,Robert Harris,Gabrielle A Hartley,William T Harvey,Marina Haukness,Jakob Heinz,Thibaut Hourlier,Robert M Hubley,Sarah E Hunt,Stephen Hwang,Miten Jain,Rupesh K Kesharwani,Alexandra P Lewis,Heng Li,Glennis A Logsdon,Julian K Lucas,Wojciech Makalowski,Christopher Markovic,Fergal J Martin,Ann M Mc Cartney,Rajiv C McCoy,Jennifer McDaniel,Brandy M McNulty,Paul Medvedev,Alla Mikheenko,Katherine M Munson,Terence D Murphy,Hugh E Olsen,Nathan D Olson,Luis F Paulin,David Porubsky,Tamara Potapova,Fedor Ryabov,Steven L Salzberg,Michael EG Sauria,Fritz J Sedlazeck,Kishwar Shafin,Valery A Shepelev,Alaina Shumate,Jessica M Storer,Likhitha Surapaneni,Angela M Taravella Oill,Françoise Thibaud-Nissen,Winston Timp,Marta Tomaszkiewicz,Mitchell R Vollger,Brian P Walenz,Allison C Watwood,Matthias H Weissensteiner,Aaron M Wenger,Melissa A Wilson,Samantha Zarate,Yiming Zhu,Justin M Zook,Evan E Eichler,Rachel J O’Neill,Michael C Schatz,Karen H Miga,Kateryna D Makova,Adam M Phillippy

Journal

Nature

Published Date

2023/9/14

The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure that includes long palindromes, tandem repeats and segmental duplications, –. As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished,. Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029-base-pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, showing the complete ampliconic structures of gene families TSPY, DAZ and RBMY; 41 additional protein-coding genes, mostly from the TSPY family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region. We have combined T2T-Y with a …

Pangenome graph construction from genome alignments with Minigraph-Cactus

Authors

Glenn Hickey,Jean Monlong,Jana Ebler,Adam M Novak,Jordan M Eizenga,Yan Gao,Tobias Marschall,Heng Li,Benedict Paten

Journal

Nature biotechnology

Published Date

2023/5/10

Pangenome references address biases of reference genomes by storing a representative set of diverse haplotypes and their alignment, usually as a graph. Alternate alleles determined by variant callers can be used to construct pangenome graphs, but advances in long-read sequencing are leading to widely available, high-quality phased assemblies. Constructing a pangenome graph directly from assemblies, as opposed to variant calls, leverages the graph’s ability to represent variation at different scales. Here we present the Minigraph-Cactus pangenome pipeline, which creates pangenomes directly from whole-genome alignments, and demonstrate its ability to scale to 90 human haplotypes from the Human Pangenome Reference Consortium. The method builds graphs containing all forms of genetic variation while allowing use of current mapping and genotyping tools. We measure the effect of the quality and …

Galba: genome annotation with miniprot and AUGUSTUS

Authors

Tomáš Brůna,Heng Li,Joseph Guhlin,Daniel Honsel,Steffen Herbold,Mario Stanke,Natalia Nenasheva,Matthis Ebel,Lars Gabriel,Katharina J Hoff

Journal

BMC bioinformatics

Published Date

2023/8/31

BackgroundThe Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes.ResultsVarious gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments.ConclusionsOur pipeline addresses the critical need for accurate gene …

Neotelomeres and Telomere-Spanning Chromosomal Arm Fusions in Cancer Genomes Revealed by Long-Read Sequencing

Authors

Kar-Tong Tan,Michael K Slevin,Mitchell L Leibowitz,Max Garrity-Janger,Heng Li,Matthew Meyerson

Journal

bioRxiv

Published Date

2023/12/1

Alterations in the structure and location of telomeres are key events in cancer genome evolution. However, previous genomic approaches, unable to span long telomeric repeat arrays, could not characterize the nature of these alterations. Here, we applied both long-read and short-read genome sequencing to assess telomere repeat-containing structures in cancers and cancer cell lines. Using long-read genome sequences that span telomeric repeat arrays, we defined four types of telomere repeat variations in cancer cells: neotelomeres where telomere addition heals chromosome breaks, chromosomal arm fusions spanning telomere repeats, fusions of neotelomeres, and peri-centromeric fusions with adjoined telomere and centromere repeats. Analysis of lung adenocarcinoma genome sequences identified somatic neotelomere and telomere-spanning fusion alterations. These results provide a framework for systematic study of telomeric repeat arrays in cancer genomes, that could serve as a model for understanding the somatic evolution of other repetitive genomic elements.

AGC: compact representation of assembled genomes with fast queries and updates

Authors

Sebastian Deorowicz,Agnieszka Danek,Heng Li

Journal

Bioinformatics

Published Date

2023/3/1

Motivation High-quality sequence assembly is the ultimate representation of complete genetic information of an individual. Several ongoing pangenome projects are producing collections of high-quality assemblies of various species. Each project has already generated assemblies of hundreds of gigabytes on disk, greatly impeding the distribution of and access to such rich datasets. Results Here, we show how to reduce the size of the sequenced genomes by 2–3 orders of magnitude. Our tool compresses the genomes significantly better than the existing programs and is much faster. Moreover, its unique feature is the ability to access any contig (or its part) in a fraction of a second and easily append new samples to the compressed collections. Thanks to this, AGC could be useful not only for backup or transfer purposes but also for routine analysis of pangenome sequences in …

Genome assembly in the telomere-to-telomere era

Authors

Heng Li,Richard Durbin

Published Date

2024/4/22

Genome sequences largely determine the biology and encode the history of an organism, and de novo assembly — the process of reconstructing the genome sequence of an organism from sequencing reads — has been a central problem in bioinformatics for four decades. Until recently, genomes were typically assembled into fragments of a few megabases at best, but now technological advances in long-read sequencing enable the near-complete assembly of each chromosome — also known as telomere-to-telomere assembly — for many organisms. Here, we review recent progress on assembly algorithms and protocols, with a focus on how to derive near-telomere-to-telomere assemblies. We also discuss the additional developments that will be required to resolve remaining assembly gaps and to assemble non-diploid genomes.

De novo reconstruction of satellite repeat units from sequence data

Authors

Yujie Zhang,Justin Chu,Haoyu Cheng,Heng Li

Journal

Genome Research

Published Date

2023/11/1

Satellite DNA are long tandemly repeating sequences in a genome and may be organized as high-order repeats (HORs). They are enriched in centromeres and are challenging to assemble. Existing algorithms for identifying satellite repeats either require the complete assembly of satellites or only work for simple repeat structures without HORs. Here we describe Satellite Repeat Finder (SRF), a new algorithm for reconstructing satellite repeat units and HORs from accurate reads or assemblies without prior knowledge on repeat structures. Applying SRF to real sequence data, we show that SRF could reconstruct known satellites in human and well-studied model organisms. We also find satellite repeats are pervasive in various other species, accounting for up to 12% of their genome contents but are often underrepresented in assemblies. With the rapid progress in genome sequencing, SRF will help the annotation …

The Human Pangenome Project: a global resource to map genomic diversity

Authors

Ting Wang,Lucinda Antonacci-Fulton,Kerstin Howe,Heather A Lawson,Julian K Lucas,Adam M Phillippy,Alice B Popejoy,Mobin Asri,Caryn Carson,Mark JP Chaisson,Xian Chang,Robert Cook-Deegan,Adam L Felsenfeld,Robert S Fulton,Erik P Garrison,Nanibaa’A Garrison,Tina A Graves-Lindsay,Hanlee Ji,Eimear E Kenny,Barbara A Koenig,Daofeng Li,Tobias Marschall,Joshua F McMichael,Adam M Novak,Deepak Purushotham,Valerie A Schneider,Baergen I Schultz,Michael W Smith,Heidi J Sofia,Tsachy Weissman,Paul Flicek,Heng Li,Karen H Miga,Benedict Paten,Erich D Jarvis,Ira M Hall,Evan E Eichler,David Haussler,Human Pangenome Reference Consortium

Published Date

2022/4/21

The human reference genome is the most widely used resource in human genetics and is due for a major update. Its current structure is a linear composite of merged haplotypes from more than 20 people, with a single individual comprising most of the sequence. It contains biases and errors within a framework that does not represent global human genomic variation. A high-quality reference with global representation of common variants, including single-nucleotide variants, structural variants and functional elements, is needed. The Human Pangenome Reference Consortium aims to create a more sophisticated and complete human reference genome with a graph-based, telomere-to-telomere representation of global genomic diversity. Here we leverage innovations in technology, study design and global partnerships with the goal of constructing the highest-possible quality human pangenome reference. Our goal …

The complete sequence of a human genome

Authors

Sergey Nurk,Sergey Koren,Arang Rhie,Mikko Rautiainen,Andrey V Bzikadze,Alla Mikheenko,Mitchell R Vollger,Nicolas Altemose,Lev Uralsky,Ariel Gershman,Sergey Aganezov,Savannah J Hoyt,Mark Diekhans,Glennis A Logsdon,Michael Alonge,Stylianos E Antonarakis,Matthew Borchers,Gerard G Bouffard,Shelise Y Brooks,Gina V Caldas,Nae-Chyun Chen,Haoyu Cheng,Chen-Shan Chin,William Chow,Leonardo G de Lima,Philip C Dishuck,Richard Durbin,Tatiana Dvorkina,Ian T Fiddes,Giulio Formenti,Robert S Fulton,Arkarachai Fungtammasan,Erik Garrison,Patrick GS Grady,Tina A Graves-Lindsay,Ira M Hall,Nancy F Hansen,Gabrielle A Hartley,Marina Haukness,Kerstin Howe,Michael W Hunkapiller,Chirag Jain,Miten Jain,Erich D Jarvis,Peter Kerpedjiev,Melanie Kirsche,Mikhail Kolmogorov,Jonas Korlach,Milinn Kremitzki,Heng Li,Valerie V Maduro,Tobias Marschall,Ann M McCartney,Jennifer McDaniel,Danny E Miller,James C Mullikin,Eugene W Myers,Nathan D Olson,Benedict Paten,Paul Peluso,Pavel A Pevzner,David Porubsky,Tamara Potapova,Evgeny I Rogaev,Jeffrey A Rosenfeld,Steven L Salzberg,Valerie A Schneider,Fritz J Sedlazeck,Kishwar Shafin,Colin J Shew,Alaina Shumate,Ying Sims,Arian FA Smit,Daniela C Soto,Ivan Sović,Jessica M Storer,Aaron Streets,Beth A Sullivan,Françoise Thibaud-Nissen,James Torrance,Justin Wagner,Brian P Walenz,Aaron Wenger,Jonathan MD Wood,Chunlin Xiao,Stephanie M Yan,Alice C Young,Samantha Zarate,Urvashi Surti,Rajiv C McCoy,Megan Y Dennis,Ivan A Alexandrov,Jennifer L Gerton,Rachel J O’Neill,Winston Timp,Justin M Zook,Michael C Schatz,Evan E Eichler,Karen H Miga,Adam M Phillippy

Journal

Science

Published Date

2022/4/1

Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion–base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies.

Semi-automated assembly of high-quality diploid human reference genomes

Authors

Erich D Jarvis,Giulio Formenti,Arang Rhie,Andrea Guarracino,Chentao Yang,Jonathan Wood,Alan Tracey,Francoise Thibaud-Nissen,Mitchell R Vollger,David Porubsky,Haoyu Cheng,Mobin Asri,Glennis A Logsdon,Paolo Carnevali,Mark JP Chaisson,Chen-Shan Chin,Sarah Cody,Joanna Collins,Peter Ebert,Merly Escalona,Olivier Fedrigo,Robert S Fulton,Lucinda L Fulton,Shilpa Garg,Jennifer L Gerton,Jay Ghurye,Anastasiya Granat,Richard E Green,William Harvey,Patrick Hasenfeld,Alex Hastie,Marina Haukness,Erich B Jaeger,Miten Jain,Melanie Kirsche,Mikhail Kolmogorov,Jan O Korbel,Sergey Koren,Jonas Korlach,Joyce Lee,Daofeng Li,Tina Lindsay,Julian Lucas,Feng Luo,Tobias Marschall,Matthew W Mitchell,Jennifer McDaniel,Fan Nie,Hugh E Olsen,Nathan D Olson,Trevor Pesout,Tamara Potapova,Daniela Puiu,Allison Regier,Jue Ruan,Steven L Salzberg,Ashley D Sanders,Michael C Schatz,Anthony Schmitt,Valerie A Schneider,Siddarth Selvaraj,Kishwar Shafin,Alaina Shumate,Nathan O Stitziel,Catherine Stober,James Torrance,Justin Wagner,Jianxin Wang,Aaron Wenger,Chuanle Xiao,Aleksey V Zimin,Guojie Zhang,Ting Wang,Heng Li,Erik Garrison,David Haussler,Ira Hall,Justin M Zook,Evan E Eichler,Adam M Phillippy,Benedict Paten,Kerstin Howe,Karen H Miga,Human Pangenome Reference Consortium

Journal

Nature

Published Date

2022/11/17

The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society,. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals,. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal …

CoLoRd: Compressing long reads

Authors

Marek Kokot,Adam Gudyś,Heng Li,Sebastian Deorowicz

Journal

Nature methods

Published Date

2022/4

The cost of maintaining exabytes of data produced by sequencing experiments every year has become a major issue in today’s genomic research. In spite of the increasing popularity of third-generation sequencing, the existing algorithms for compressing long reads exhibit a minor advantage over the general-purpose gzip. We present CoLoRd, an algorithm able to reduce the size of third-generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyses.

Identifying and correcting repeat-calling errors in nanopore sequencing of telomeres

Authors

Kar-Tong Tan,Michael K Slevin,Matthew Meyerson,Heng Li

Journal

Genome Biology

Published Date

2022/8/26

Nanopore long-read sequencing is an emerging approach for studying genomes, including long repetitive elements like telomeres. Here, we report extensive basecalling induced errors at telomere repeats across nanopore datasets, sequencing platforms, basecallers, and basecalling models. We find that telomeres in many organisms are frequently miscalled. We demonstrate that tuning of nanopore basecalling models leads to improved recovery and analysis of telomeric regions, with minimal negative impact on other genomic regions. We highlight the importance of verifying nanopore basecalls in long, repetitive, and poorly defined regions, and showcase how artefacts can be resolved by improvements in nanopore basecalling models.

See List of Professors in Heng Li University(Harvard University)

Heng Li FAQs

What is Heng Li's h-index at Harvard University?

The h-index of Heng Li has been 66 since 2020 and 72 in total.

What are Heng Li's top articles?

The articles with the titles of

Exploring gene content with pangenome gene graphs

Full resolution HLA and KIR genes annotation for human genome assemblies

Protein-to-genome alignment with miniprot

Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph

Evaluation of haplotype-aware long-read error correction with hifieval

Efficient and accurate KIR and HLA genotyping with massively parallel sequencing data

compleasm: a faster and more accurate reimplementation of BUSCO

A draft human pangenome reference

...

are the top articles of Heng Li at Harvard University.

What are Heng Li's research interests?

The research interests of Heng Li are: Computational Biology, Bioinformatics, Genomics

What is Heng Li's total number of citations?

Heng Li has 218,609 citations in total.

What are the co-authors of Heng Li?

The co-authors of Heng Li are Goncalo Abecasis, Richard Durbin, Benjamin Neale, Xiaoliang Sunney Xie, Nick Patterson.

    Co-Authors

    H-index: 207
    Goncalo Abecasis

    Goncalo Abecasis

    University of Michigan-Dearborn

    H-index: 153
    Richard Durbin

    Richard Durbin

    University of Cambridge

    H-index: 146
    Benjamin Neale

    Benjamin Neale

    Harvard University

    H-index: 128
    Xiaoliang Sunney Xie

    Xiaoliang Sunney Xie

    Peking University

    H-index: 118
    Nick Patterson

    Nick Patterson

    Harvard University

    academic-engine

    Useful Links