Genotyping structural variants in TOPMed using pangenome graphs

Abstract

Structural variants (SVs) are significant components of genetic diversity and have been associated with diseases, but the technological challenges surrounding their representation and identification make them difficult to study relative to point mutations. Still, thousands of SVs have been characterized, and catalogs continue to improve with new technologies. In parallel, variation graphs have been proposed to represent human pangenomes, offering reduced reference bias and better mapping accuracy than linear reference genomes. We contend that variation graphs provide an effective means for leveraging SV catalogs for short-read SV genotyping experiments. We extended vg (a software toolkit for working with variation graphs) to support SV genotyping. We show that it is capable of genotyping insertions, deletions and inversions, even in the presence of small errors in the location of the SVs breakpoints. We benchmarked vg against state-of-the-art SV genotypers using three high-quality sequence-resolved SV catalogs generated by recent studies ranging up to 97,368 variants in size. We find that vg systematically produces the best genotype predictions in all datasets. It is also capable of fine-tuning the breakpoints of a large proportion of SV using graph augmentation from the mapped reads. The genotyping pipeline was optimized and written in the WDL workflow language to facilitate the analysis of large cohorts. The workflow is available in Dockstore and can be run on different cloud computing platforms. Within the DataSTAGE project, we have already genotyped SVs in a subset of the TOPMed dataset to estimate variant frequencies in a diverse population and provide genotypes for association studies within TOPMed.

Date

Feb 12, 2020

Event

GSP-TOPMed Analysis Workshop

Location

New York City, NY, USA

Links

PDF