Cancer genes and CNA hotspots
Apr 3 2019 genome data RThis is an updated version of an old private post where I had prepared some R objects with cancer genes and CNA hotspots. I used this to quickly annotate copy number results in cancer projects. The file was almost 3 years old so here is an updated version (as of today, Apr 3 2019).
Candidate Cancer Gene Database (CCGD)
The Candidate Cancer Gene Database (CCGD) was developed to disseminate the results of transposon-based forward genetic screens in mice that identify candidate cancer genes.
I downloaded the version available on Apr 3 2019. There is information about each study that reports a gene as a cancer driver. I’m mostly interested about the list of cancer driver. As a secondary information, I will save the cancer type(s) and the predicted effect for each gene.
For each study, the predicted effect is either Not Determined, Gain or Loss. I define an effect field with the major “determined” effect. The effects column contains all the predicted effect with the number of supporting studies. It looks like this:
gene | effect | effects |
---|---|---|
ABCD3 | Gain | Not Determined(3),Gain(1) |
ADAM19 | Gain | Gain(1) |
AAK1 | Loss | Not Determined(6),Loss(1) |
ABHD2 | Loss | Not Determined(7),Loss(1) |
A1CF | Not Determined | Not Determined(3) |
A4GALT | Not Determined | Not Determined(1) |
In total, there are 9488 cancer driver genes in this list. For most of them we don’t know the effect.
effect | gene |
---|---|
Not Determined | 8141 |
Loss | 1151 |
Gain | 196 |
Cancer Gene Census
The gene list can be downloaded from the COSMIC website. The user must register and login to download. I’m using version 88 on GRCh38.
The cancer Gene Census is an ongoing effort to catalogue those genes for which mutations have been causally implicated in cancer. The original census and analysis was published in Nature Reviews Cancer.
Here the genes are either oncogene, TSG or fusion (or a combination of those). I’ll also save the tumor types where somatic mutations were observed. It looks like this:
gene | role | cgcTumor |
---|---|---|
A1CF | oncogene | melanoma |
ABI1 | TSG, fusion | AML |
ABL1 | oncogene, fusion | CML, ALL, T-ALL |
ABL2 | oncogene, fusion | AML |
ACKR3 | oncogene, fusion | lipoma |
ACSL3 | fusion | prostate |
In total, there are 723 cancer driver genes in this list. The role are distributed as follows:
role | gene |
---|---|
fusion | 363 |
oncogene | 315 |
TSG | 315 |
Merge the gene lists
I merged the two gene lists into a driver.genes data.frame:
gene | effect | effects | ccgdTumor | role | cgcTumor | ccgd | cgc |
---|---|---|---|---|---|---|---|
ACKR3 | NA | NA | NA | oncogene, fusion | lipoma | FALSE | TRUE |
ACSL3 | NA | NA | NA | fusion | prostate | FALSE | TRUE |
A4GALT | Not Determined | Not Determined(1) | Blood Cancer | NA | NA | TRUE | FALSE |
AAAS | Not Determined | Not Determined(1) | Colorectal Cancer | NA | NA | TRUE | FALSE |
A1CF | Not Determined | Not Determined(3) | Liver Cancer | oncogene | melanoma | TRUE | TRUE |
ACSL6 | Not Determined | Not Determined(2) | Blood Cancer | fusion | AML, AEL | TRUE | TRUE |
“Effect” vs “Role” ?
I would expect a loss for a tumor suppressor, and a gain of function for oncogenes. Are the two databases consistent ?
effect | role | gene |
---|---|---|
Loss | TSG | 42 |
Loss | TSG, fusion | 21 |
Loss | fusion | 19 |
Loss | oncogene, fusion | 17 |
Gain | oncogene, fusion | 12 |
Loss | oncogene | 12 |
Gain | oncogene | 8 |
Loss | oncogene, TSG | 6 |
Gain | oncogene, TSG, fusion | 5 |
Gain | TSG | 5 |
Gain | fusion | 4 |
Loss | oncogene, TSG, fusion | 4 |
Gain | TSG, fusion | 3 |
Gain | oncogene, TSG | 2 |
Kind of:
- Most of the Loss effects are TSG.
- Most of the Gain effects are oncogene.
- However several Loss are also oncogene, etc.
Known CNA hotspots
Zack et al identified hotspots of somatic CNA from ~5,000 tumors across 11 cancer types. They called CNA from the SNP-array in TCGA. I downloaded Supp Table 2, the pan-cancer regions of significant somatic CNA, and cleaned up the xls file into a csv file.
In total there are 140 CNA hotspots.
type | region |
---|---|
gain | 70 |
loss | 70 |
Caution: this is hg19!
Saving the R objects
I saved the driver.genes data.frame and the cna.zack.hg19.gr GRanges object into a .RData
file .
It’s available there.