PopSV is a Copy-Number Variation (CNV) detection method from high-throughput sequencing. Abnormal Read-Depth signal is detected by using a population of samples as reference. Thanks to this population view, the whole genome can be robustly interrogated, including regions of low mappability. Moreover, any divergence from the reference samples are detected, even if the signal is incomplete, e.g. tumoral aberrations or SV involving repeats.
This install command requires devtools package which can be easily installed with :
Some Bioconductor packages are also necessary and not installed automatically. Running the following command should be sufficient :
source("http://bioconductor.org/biocLite.R") biocLite(c("BSgenome.Hsapiens.UCSC.hg19", "Rsamtools", "DNAcopy", "rtracklayer"))
To use hg38 instead of hg19 install
Then, run the following to install the latest development version:
If you get an error, you can try the following instead:
R 3.1 or higher is required.
PopSV package is used as any R package, by simply loading the library and using the provided funtions.
There are a few functions to run the CNV analysis and others to explore the results.
## Load package and wrapper library(PopSV) source("automatedPipeline-batchtools.R") ## Prepare file paths and bins bam.files = read.table("bams.tsv", as.is=TRUE, header=TRUE) files.df = init.filenames(bam.files, code="example") save(files.df, file="files.RData") bin.size = 1e3 bins.df = fragment.genome.hp19(bin.size) save(bins.df, file="bins.RData") ## Run PopSV res.GCcounts = autoGCcounts("files.RData", "bins.RData") cnvs.df = autoNormTest("files.RData", "bins.RData")
In this example
bams.tsv is a tab-delimited file with a column sample (with the sample names) and a column bam (with the path to each BAM file). The BAM files must be sorted and indexed.
In practice I run these commands in the login node of our HPC cluster because it sends jobs to the cluster. I also have this in a screen so I can detach the screen, disconnect from the server and let it run on the background.
First the genome is fragmented and reads mapping in each bin are counted for each sample and GC corrected (1). Next, coverage of the sample is normalized (2) and each bin is tested by computing a Z-score (3), estimating p-values (4) and identifying abnormal regions (5).
A quick description of the different analysis steps and their corresponding functions can be found in this page.
PopSV workflow uses BatchJobs package or its successor the batchtools package to send computations to a cluster. It needs some configuration but then it saves a lot of time and the pipeline can be run easily. For more information on how to configure these package for your HPC system, go to this page.
Questions about reference samples, bin size, exome or other ? Have a look at the FAQ.