Hippocamplus My Second Memory

Gencode exploration

Genes

Number

Focusing on autosomes/X/Y, there are 57,783 “genes” of different types:

I merge the rare types into a other class and some RNAs.

gene_type.f n
protein_coding 20332
pseudogene 13931
other 7417
lincRNA 7114
RNA 5934
miRNA 3055

Size

The largest annotated genes span more than 2 Mbp:

seqnames gene_name gene_type size.Mbp
chr7 CNTNAP2 protein_coding 2.304638
chr9 PTPRD protein_coding 2.298478
chrX DMD protein_coding 2.241765
chr3 LSAMP protein_coding 2.194861
chr11 DLG2 protein_coding 2.172912

The smallest protein-coding annotated genes less than 100 bp:

seqnames gene_name gene_type size.bp
chr2 AC011308.1 protein_coding 59
chr12 AC055736.1 protein_coding 66
chr16 PIH1 protein_coding 93
chr2 AC012360.2 protein_coding 96
chr5 AC008914.1 protein_coding 102

Density

Using non-overlapping windows of 1 Mb the gene density looks like this:

  • Chr 19, 17 and 11 have more protein-coding genes than the rest.
  • Chr Y has more pseudogene compared to other classes.

Exons

Number

Focusing on autosomes/X/Y, there are 1,196,256 “exons” from different types of genes:

For the rest of the analysis, I use only exons from protein-coding genes.

Number per gene

mean.nb.exon median.nb.exon max.nb.exon
52.9 31 1696

The gene with the most exons is the Titin gene:

gene_name exon
TTN 1696
SYNE1 1377
NEB 1225
CACNA1G 1139
CACNA1C 1098

Size

The average exon size is 232 bp.

The first exon seems to be slightly larger than the others. I used genes with at least 10 exons to be sure it’s not due to large single-exon genes.

The largest annotated exons are more than 20 Kbp long:

seqnames gene_name size.kb
chr8 TRAPPC9 29.06
chr5 MCC 27.20
chr12 GRIN2B 24.41
chr19 MUC16 21.69
chr2 ABI2 20.55

The smallest are just 1 bp !?

seqnames gene_name size.bp
chr2 ALK 1
chr3 ACAD11 1
chr4 PPA2 1
chr5 PAM 1
chr5 GALNT10 1