Built by aligning high-quality genomes, saved as paths through the pangenome.
Human Pangenome Reference Consortium (HPRC)
Liao, Asri, Ebler, et al. Nature 2023
How to represent those paths and index them efficiently?
(doing a minimum amount of work)
vg
, GraphAligner
, …{
"mapping_quality": 60,
"name": ">44878920>44878957",
"path": {
"mapping": [
{"edit": [{"from_length": 1, "to_length": 1}],
"position": {"node_id": "44878920"}, "rank": "1"},
{"edit": [{"from_length": 1, "to_length": 1}],
"position": {"node_id": "44878957"}, "rank": "2"}
]
},
"score": 2,
"sequence": "AC"
}
*json representation
read_name_6 100 0 100 + <394<393<392<391 128 25 124 100 100 60 cs:Z::100
!-----query info----------!-----target info----------------!--------------------!
name size range strand path size range align + opt. tags
https://github.com/samtools/htslib
chr:start-end
.
chr
, then start
, then end
.Assumes node IDs are sorted integers.
read_name_9 100 0 100 + <3222<3221<3220<3219 128 18 117 100 100 0
bgzip
/tabix
) to index based on min/max ID.vg gamsort
to sort GAFs based on min/max ID.Sorting a GAF file
Indexing sorted bgzipped GAF
Extracting a node interval
Illumina 30x short reads for HG002, aligned to the HPRC pangenome with vg giraffe
.
Format | Time (H:M:S) | Max. memory used (Kb) | File size (Gb) |
---|---|---|---|
GAM | 11:46:58 | 6,236.60 | 108 |
GAF | 6:50:28 | 1,904.83 | 52 |
Faster, less memory, and smaller files using GAF.
Input annotation relative to one haplotype (BED or GFF), present in the pangenome (path)
Application
Projected the HPRC v1 assemblies’ annotations:
Interactively query a subgraph and aligned reads.
Recently modified to accept indexed GAF files.
Haplotypes: CHM13 (purple), HG00621 (greys). Annotated CDS for HG00621 hap 1 (reds) and 2 (blues).
BandageNG can color nodes by paths, present in the input graph (GFA).
https://github.com/asl/BandageNG
vg chunk
or odgi extract
vg augment -BF subgraph.pg annot.gaf.gz
vg convert -f augmented.pg > augmented.gfa
Node color: blue for the reference path, orange for the AluYa5 transposon.
Custom scripts to (try to) find and annotate known variants in the pangenome.
Application
black: Reference path (GRCh38), pale colors: other human haplotypes, reds: GWAS catalog, blues: eQTLs across tissues (GTEx)
E.g. to investigate supporting reads.
yellow/green: annotation paths from vg call
genotypes. reds/blues: short sequencing reads
E.g. for epigenomics tracks.
On the linear genome:
Reads from 7 cell types aligned with vg giraffe
to the HPRC pangenome.
Compressed view (no sequence shown, node size not to scale).
Reference paths CHM13/GRCh38 (black/grey) and ATAC-seq coverage track for different tissues (colors). Opacity represents coverage level.
GAF files can now be sorted/indexed, and then queried fast.
Limitations
Adam M. Novak, Dickson Chung, Glenn Hickey, Sarah Djebali, Toshiyuki T. Yokoyama, Erik Garrison, Benedict Paten.
Graph sorting aims to find the best node order for a 1D and 2D layout.