End-to-end pipeline walkthrough
Inputs
WGS alignment file in BAM or CRAM format. If input is a CRAM file, reference genome FASTA (along with dictionary and index file) is additionally required.
Note
The input alignment file should be coordinate-sorted and indexed, however if these prerequisities are not met, mitopy will coordinate-sort and index the input file.
Outputs
Alignments
Alignments (.bam) to mitochondrial and shifted mitochondrial reference genomes and their respective index files (.bai)
Variants
Postprocessed variant calls (.vcf) and index file (.vcf.idx)
Annotations
Annotated variants (.vcf) and annotation report (.csv). The annotation CSV report contains following annotation fields:
Field |
Source |
Description |
|---|---|---|
POS |
VCF file |
Postion of the variant |
REF |
VCF file |
Reference allele |
ALT |
VCF file |
Alternate allele |
Heteroplasmy Fraction |
VCF file |
Variant allele fraction (coded as AF in FORMAT field of VCF) |
MT Variant Type |
VCF file |
Type of the mitochondrial variant based on minimal homoplasmy treshold [homoplasmic/heteroplasmic] |
LOCUS |
Mitochondrial locus/gene where variant is located |
|
BIOTYPE |
Locus biotype |
|
phastCons100way |
PhastCons conservation score; range [min 0, max 1] (conserved: score > 0.7) |
|
phyloP100way |
PhyloP conservation score; range [min -20, max 10] (conserved: score > 0) |
|
ClinVar_ID |
ClinVar Variation ID |
|
CLNDN |
Disease name associated with variant |
|
CLNSIG |
Clinical significance for variant |
|
CLNDISDB |
Disease database name and identifier for variant |
|
GNOMAD_AC_HOM |
Gnomad allele count restricted to variants with a heteroplasmy level >= 0.95 |
|
GNOMAD_AF_HOM |
Gnomad allele frequency restricted to variants with a heteroplasmy level >= 0.95 |
|
GNOMAD_AF_HET |
Gnomad allele frequency restricted to variants with a heteroplasmy level >= 0.10 and < 0.95 |
|
GNOMAD_AC_HET |
Gnomad allele count restricted to variants with a heteroplasmy level >= 0.10 and < 0.95 |
|
MitoTIP_Score |
tRNA raw pathogenicity score from Mitotip |
|
MitoTIP_Prediction |
tRNA pathogenicity classification from Mitotip |
|
PONmttRNA_Probability |
tRNA probability of pathogenicity from PON-mt-tRNA |
|
PONmttRNA_Prediction |
tRNA pathogenicity classification from PON-mt-tRNA |
|
SIFT |
Pathogenicity classification from SIFT |
|
SIFT_score |
Pathogenicity prediction score from SIFT |
|
MITOMAP_GENBANK_AC |
Allele count in GenBank out of 61168 full length (FL) human chrM sequences |
|
MITOMAP_GENBANK_AF |
Allele Frequency in full length (FL) Genbank sequence set |
|
MITOMAP_PubmedIDs |
Pubmed IDs |
|
MITOMAP_Disease |
Putative Disease Association |
|
MITOMAP_DiseaseStatus |
Disease Association Status |
|
SnpEff_EFFECT |
Functional effect of the variant (annotated using Sequence Ontology terms) |
|
SnpEff_IMPACT |
A simple estimation of putative impact / deleteriousness: [HIGH, MODERATE, LOW, MODIFIER] |
|
SnpEff_IMPACT |
Functional effect of the variant (annotated using Sequence Ontology terms) |
|
SnpEff_GENE |
Common gene name (HGNC). Optional: use closest gene when the variant is “intergenic” |
|
SnpEff_GENEID |
Gene ID |
|
SnpEff_FEATURE |
Type of feature (e.g. transcript, motif, miRNA, etc.) |
|
SnpEff_FEATUREID |
Depending on the annotation, this may be: Transcript ID, Motif ID, etc. |
|
SnpEff_BIOTYPE |
Transcript biotype |
|
SnpEff_HGVS_C |
Variant using HGVS notation (DNA level) |
|
SnpEff_HGVS_P |
HGVS notation (Protein level) - if variant is coding |
Visualization
An interactive visualization (.html) of variants at their positions in the mitochondrial circular genome. The basic information about variant are included on hover (e.g. postion, reference allele, alternate allele).
The variant is visualized in the inner trace at different levels depending on its heteroplasmy fraction (the closer to outer border, the higher heteroplasmy fraction).
Additionally, the calculated coverage is visualized in the innermost trace.
Example:
Haplogroup
Haplogroup classification report containing information on identified haplogroup (.txt)
Proccess
Preprocessing
The input WGS BAM file is first subsetted to include only reads mapped to the mitochondrial genome using gatk PrintReads tool. Consequently, the extracted mitochondrial reads are unaligned using Picard RevertSam tool, producing unmapped BAM. The unalignment step is necessary, since the following stage of the pipeline performs realignment of reads to mitochondrial reference genome.
Alignment to mitochondrial genome using double alignment strategy
Prior to performing alignment, the unaligned mitochondrial reads in uBAM format have to be converted to FASTQ format required by the aligner.
The unaligned mitochondrial reads (FASTQ) are then realigned to mitochondrial reference and the shifted version of mitochondrial reference using bwa-mem2 aligner. The purpose of double alignment strategy is to increase alignment precision in mitochondrial control region (D-loop). The shifted reference was created by shifting the original reference by 8000 bases, which moves the breakpoint of the mitochondrial genome from the control region to the opposite side of the contig, allowing reads originating from control region to align precisely.
During to conversion of uBAM to FASTQ, certain useful information from the uBAM are lost. In order to preserve this information and pass to downstream steps, the unmapped BAM is merged with SAM file produced by the aligner using Picard MergeBamAlignment tool.
Duplicates are marked and alignment files are sorted by coordinate using gatk MarkDuplicatesSpark.
Variant Calling
Mitochondrial variants are called separately for non-control region using reads aligned to canonical reference and control region using reads aligned to shifted reference. Variants are called using a somatic variant caller gatk Mutect2 in mitochondria mode.
Merging variant calls
After calling variants separately for control and non-control region of mitochondrial genome, the variant calls from both regions have to be merged.
The variants in the control region were called against shifted mitochondrial reference, and thus have to be shifted back to coordinates of canonical reference sequence using Picard LiftoverVcf.
The VCF file and shifted-back VCF file are then merged using Picatd MergeVcfs. Additionally, VCF stats files generated by Mutect2 are merged using gatk MergeMutectStats.
Postprocessing
The raw variant calls are postprocessed, applying several filters to remove potential false positive calls and normalizing variant calls to achieve standardized representation.
The initial filtering phase includes filtering variants based on multiple specified parameters using gatk FilterMutectCalls tool specifically designed for filtering of raw Mutect2 calls. Optionally, the variants are filtered based on estimated contamination level. To estimate the level of contamination in mitochondrial DNA sample, we utilize haplocheck.
The next level of filters eliminates common artifacts, i.e. variants overlapping known artifact-prone mitochondrial sites. Finally, an optional last filtering phase involves filering out potential NuMTs based on median autosomal coverage using gatk NuMTFilterTool. The median autosomal coverage can be estimated from input WGS BAM using Picard CollectWgsmetrics.
The normalization of variant calls includes left alignment and splitting multi-allelic sites using gatk LeftAlignAndTrimVariants.
Calculating coverage
The per-base coverage is calculated using mosdepth from canonical and shifted alignments. The resulting per-base coverage is produced by combining coverage from non-control region (canonical alignment) and control region (shifted alignment).
Annotation
The postprocessed variants are annotated with functional effects using SnpEff and additional information:
conservation scores (PhyloP100way and PhastCons100way)
population frequencies (gnomAD)
in-silico pathogenicity predictions (SIFT, MitoTIP and PON-mt-tRNA)
Additionally the variants are annotated as heteroplasmic (0/1 genotype) or homoplasmic (1/1 genotype) based on specified treshold.
The annotations are exported to human-readable CSV format. See Outputs section for description of individual annotation fields.
Visualization
An interactive HTML visualization of variants is created, visualizing variants at their positions within mitochondrial genome. The calculated coverage is included in the plot.
Identifying haplogroups
The haplogroup of the sample is identified based on detected variants using haplogrep3.