Hippocamplus My Second Memory

Cancer genes and CNA hotspots

This is an updated version of an old private post where I had prepared some R objects with cancer genes and CNA hotspots. I used this to quickly annotate copy number results in cancer projects. The file was almost 3 years old so here is an updated version (as of today, Apr 3 2019).

Candidate Cancer Gene Database (CCGD)

The Candidate Cancer Gene Database (CCGD) was developed to disseminate the results of transposon-based forward genetic screens in mice that identify candidate cancer genes.

I downloaded the version available on Apr 3 2019. There is information about each study that reports a gene as a cancer driver. I’m mostly interested about the list of cancer driver. As a secondary information, I will save the cancer type(s) and the predicted effect for each gene.

For each study, the predicted effect is either Not Determined, Gain or Loss. I define an effect field with the major “determined” effect. The effects column contains all the predicted effect with the number of supporting studies. It looks like this:

gene effect effects
ABCD3 Gain Not Determined(3),Gain(1)
ADAM19 Gain Gain(1)
AAK1 Loss Not Determined(6),Loss(1)
ABHD2 Loss Not Determined(7),Loss(1)
A1CF Not Determined Not Determined(3)
A4GALT Not Determined Not Determined(1)

In total, there are 9488 cancer driver genes in this list. For most of them we don’t know the effect.

effect gene
Not Determined 8141
Loss 1151
Gain 196

Cancer Gene Census

The gene list can be downloaded from the COSMIC website. The user must register and login to download. I’m using version 88 on GRCh38.

The cancer Gene Census is an ongoing effort to catalogue those genes for which mutations have been causally implicated in cancer. The original census and analysis was published in Nature Reviews Cancer.

Here the genes are either oncogene, TSG or fusion (or a combination of those). I’ll also save the tumor types where somatic mutations were observed. It looks like this:

gene role cgcTumor
A1CF oncogene melanoma
ABI1 TSG, fusion AML
ABL1 oncogene, fusion CML, ALL, T-ALL
ABL2 oncogene, fusion AML
ACKR3 oncogene, fusion lipoma
ACSL3 fusion prostate

In total, there are 723 cancer driver genes in this list. The role are distributed as follows:

role gene
fusion 363
oncogene 315
TSG 315

Merge the gene lists

I merged the two gene lists into a driver.genes data.frame:

gene effect effects ccgdTumor role cgcTumor ccgd cgc
ACKR3 NA NA NA oncogene, fusion lipoma FALSE TRUE
ACSL3 NA NA NA fusion prostate FALSE TRUE
A4GALT Not Determined Not Determined(1) Blood Cancer NA NA TRUE FALSE
AAAS Not Determined Not Determined(1) Colorectal Cancer NA NA TRUE FALSE
A1CF Not Determined Not Determined(3) Liver Cancer oncogene melanoma TRUE TRUE
ACSL6 Not Determined Not Determined(2) Blood Cancer fusion AML, AEL TRUE TRUE

“Effect” vs “Role” ?

I would expect a loss for a tumor suppressor, and a gain of function for oncogenes. Are the two databases consistent ?

effect role gene
Loss TSG 42
Loss TSG, fusion 21
Loss fusion 19
Loss oncogene, fusion 17
Gain oncogene, fusion 12
Loss oncogene 12
Gain oncogene 8
Loss oncogene, TSG 6
Gain oncogene, TSG, fusion 5
Gain TSG 5
Gain fusion 4
Loss oncogene, TSG, fusion 4
Gain TSG, fusion 3
Gain oncogene, TSG 2

Kind of:

  • Most of the Loss effects are TSG.
  • Most of the Gain effects are oncogene.
  • However several Loss are also oncogene, etc.

Known CNA hotspots

Zack et al identified hotspots of somatic CNA from ~5,000 tumors across 11 cancer types. They called CNA from the SNP-array in TCGA. I downloaded Supp Table 2, the pan-cancer regions of significant somatic CNA, and cleaned up the xls file into a csv file.

In total there are 140 CNA hotspots.

type region
gain 70
loss 70

Caution: this is hg19!

Saving the R objects

I saved the driver.genes data.frame and the cna.zack.hg19.gr GRanges object into a .RData file . It’s available there.