Population-based Detection of Structural Variants in Normal and Aberrant Genomes

Abstract

Structural Variants (SVs) are key mutational elements in evolution and disease. Their detection from high-throughput sequencing (HTS) data has improved substantially but the fraction of false positive predictions remains high. Few efforts have been directed toward problematic regions of the genome such as low-mappability or repeat-enriched regions, which are known to be rich in SVs. We propose to overcome these limitations by using a large set of experiments and a population-based approach for the identification of abnormal regions. Even in low-mappability regions of the genome, abnormal variation from the reference samples will allow the identification of SVs. Most methods for structural variant detection based on HTS data only address the main GC-content and mappability biases but the remaining technical variation might reduce the analysis power. By comparing properly mapped read coverage across multiple samples we demonstrate the presence of a hitherto unknown technical bias in whole-genome sequence data. Because this bias varies between genomic regions and experiments, we show that a general normalization is not sufficient to correct this systematic sample-specific variation and develop a targeted approach. After normalization, we test each region in each sample by computing a Z-test-like score adjusted for multiple-hypothesis testing. We validate our approach on a cancer resequencing project, using normal-tumor concordance across 100 whole-genome sample pairs as well as comparison with SNP-array calls, and a twin study dataset. Compared to other state-of-the-art methods, our population-based approach detects consistently more concordant events with similar specificity. Very few regions of the genome were excluded from the analysis and a number of SVs were detected in regions of low-mappability. Moreover many partial events, likely due to normal contamination in tumor samples or tumor heterogeneity, are exclusively detected by our approach. By integrating over multiple datasets, our approach is liberated from the constraint of assuming a uniform coverage, having to rely on consecutive bins, or requiring complete copy-number changes.

Date
Location
Vancouver, Canada
Links