Build Status codecov

What is PopSV ?

PopSV is a Copy-Number Variation (CNV) detection method from high-throughput sequencing. Abnormal Read-Depth signal is detected by using a population of samples as reference. Thanks to this population view, the whole genome can be robustly interrogated, including regions of low mappability. Moreover, any divergence from the reference samples are detected, even if the signal is incomplete, e.g. tumoral aberrations or SV involving repeats.

The manuscript presenting the methods and an application to epilepsy patients was published in PLoS Genetics. More details about the publications.

Getting started

Installation

This install command requires devtools package which can be easily installed with :

install.packages("devtools")

Some Bioconductor packages are also necessary and not installed automatically. Running the following command should be sufficient :

source("http://bioconductor.org/biocLite.R")
biocLite(c("BSgenome.Hsapiens.UCSC.hg19", "Rsamtools", "DNAcopy", "rtracklayer"))

To use hg38 instead of hg19 install BSgenome.Hsapiens.UCSC.hg38.

Then, run the following to install the latest development version:

devtools::install_github("jmonlong/PopSV")

If you get an error, you can try the following instead:

devtools::install_git("git://github.com/jmonlong/PopSV.git")

R 3.1 or higher is required.

Usage

PopSV package is used as any R package, by simply loading the library and using the provided funtions.

library(PopSV)
...

There are a few functions to run the CNV analysis and others to explore the results.

I don’t want to think, how can I run it quickly ?

After quickly configuring batchtools for your computing cluster, you can run the automated pipeline (files in the scripts folder of the GitHub repo):

## Load package and wrapper
library(PopSV)
source("automatedPipeline-batchtools.R")
## Prepare file paths and bins
bam.files = read.table("bams.tsv", as.is=TRUE, header=TRUE)
files.df = init.filenames(bam.files, code="example")
save(files.df, file="files.RData")
bin.size = 1e3
bins.df = fragment.genome.hp19(bin.size)
save(bins.df, file="bins.RData")
## Run PopSV
res.GCcounts = autoGCcounts("files.RData", "bins.RData")
cnvs.df = autoNormTest("files.RData", "bins.RData")

In this example bams.tsv is a tab-delimited file with a column sample (with the sample names) and a column bam (with the path to each BAM file). The BAM files must be sorted and indexed.

In practice I run these commands in the login node of our HPC cluster because it sends jobs to the cluster. I also have this in a screen so I can detach the screen, disconnect from the server and let it run on the background.

Workflow

PopSV workflow

First the genome is fragmented and reads mapping in each bin are counted for each sample and GC corrected (1). Next, coverage of the sample is normalized (2) and each bin is tested by computing a Z-score (3), estimating p-values (4) and identifying abnormal regions (5).

A quick description of the different analysis steps and their corresponding functions can be found in this page.

Running PopSV on computing clusters

PopSV workflow uses BatchJobs package or its successor the batchtools package to send computations to a cluster. It needs some configuration but then it saves a lot of time and the pipeline can be run easily. For more information on how to configure these package for your HPC system, go to this page.

Questions ?

Questions about reference samples, bin size, exome or other ? Have a look at the FAQ.

Otherwise, feel free to post an issue on GitHub or to send me an email.