Hippocamplus My Second Memory

Gencode exploration



Focusing on autosomes/X/Y, there are 57,783 “genes” of different types:

I merge the rare types into a other class and some RNAs.

gene_type.f n
protein_coding 20332
pseudogene 13931
other 7417
lincRNA 7114
RNA 5934
miRNA 3055


The largest annotated genes span more than 2 Mbp:

seqnames gene_name gene_type size.Mbp
chr7 CNTNAP2 protein_coding 2.304638
chr9 PTPRD protein_coding 2.298478
chrX DMD protein_coding 2.241765
chr3 LSAMP protein_coding 2.194861
chr11 DLG2 protein_coding 2.172912

The smallest protein-coding annotated genes less than 100 bp:

seqnames gene_name gene_type size.bp
chr2 AC011308.1 protein_coding 59
chr12 AC055736.1 protein_coding 66
chr16 PIH1 protein_coding 93
chr2 AC012360.2 protein_coding 96
chr5 AC008914.1 protein_coding 102


Using non-overlapping windows of 1 Mb the gene density looks like this:

  • Chr 19, 17 and 11 have more protein-coding genes than the rest.
  • Chr Y has more pseudogene compared to other classes.



Focusing on autosomes/X/Y, there are 1,196,256 “exons” from different types of genes:

For the rest of the analysis, I use only exons from protein-coding genes.

Number per gene

mean.nb.exon median.nb.exon max.nb.exon
52.9 31 1696

The gene with the most exons is the Titin gene:

gene_name exon
TTN 1696
SYNE1 1377
NEB 1225
CACNA1G 1139
CACNA1C 1098


The average exon size is 232 bp.

The first exon seems to be slightly larger than the others. I used genes with at least 10 exons to be sure it’s not due to large single-exon genes.

The largest annotated exons are more than 20 Kbp long:

seqnames gene_name size.kb
chr8 TRAPPC9 29.06
chr5 MCC 27.20
chr12 GRIN2B 24.41
chr19 MUC16 21.69
chr2 ABI2 20.55

The smallest are just 1 bp !?

seqnames gene_name size.bp
chr2 ALK 1
chr3 ACAD11 1
chr4 PPA2 1
chr5 PAM 1
chr5 GALNT10 1