End-to-end pipeline walkthrough

Inputs

WGS alignment file in BAM or CRAM format. If input is a CRAM file, reference genome FASTA (along with dictionary and index file) is additionally required.

Note

The input alignment file should be coordinate-sorted and indexed, however if these prerequisities are not met, mitopy will coordinate-sort and index the input file.

Outputs

Alignments

Alignments (.bam) to mitochondrial and shifted mitochondrial reference genomes and their respective index files (.bai)

Variants

Postprocessed variant calls (.vcf) and index file (.vcf.idx)

Annotations

Annotated variants (.vcf) and annotation report (.csv). The annotation CSV report contains following annotation fields:

Field	Source	Description
POS	VCF file	Postion of the variant
REF	VCF file	Reference allele
ALT	VCF file	Alternate allele
Heteroplasmy Fraction	VCF file	Variant allele fraction (coded as AF in FORMAT field of VCF)
MT Variant Type	VCF file	Type of the mitochondrial variant based on minimal homoplasmy treshold [homoplasmic/heteroplasmic]
LOCUS	NCBI	Mitochondrial locus/gene where variant is located
BIOTYPE	NCBI	Locus biotype
phastCons100way	PhastCons100way	PhastCons conservation score; range [min 0, max 1] (conserved: score > 0.7)
phyloP100way	PhyloP100way	PhyloP conservation score; range [min -20, max 10] (conserved: score > 0)
ClinVar_ID	ClinVar	ClinVar Variation ID
CLNDN	ClinVar	Disease name associated with variant
CLNSIG	ClinVar	Clinical significance for variant
CLNDISDB	ClinVar	Disease database name and identifier for variant
GNOMAD_AC_HOM	gnomAD	Gnomad allele count restricted to variants with a heteroplasmy level >= 0.95
GNOMAD_AF_HOM	gnomAD	Gnomad allele frequency restricted to variants with a heteroplasmy level >= 0.95
GNOMAD_AF_HET	gnomAD	Gnomad allele frequency restricted to variants with a heteroplasmy level >= 0.10 and < 0.95
GNOMAD_AC_HET	gnomAD	Gnomad allele count restricted to variants with a heteroplasmy level >= 0.10 and < 0.95
MitoTIP_Score	MitoTIP	tRNA raw pathogenicity score from Mitotip
MitoTIP_Prediction	MitoTIP	tRNA pathogenicity classification from Mitotip
PONmttRNA_Probability	PON-mt-tRNA	tRNA probability of pathogenicity from PON-mt-tRNA
PONmttRNA_Prediction	PON-mt-tRNA	tRNA pathogenicity classification from PON-mt-tRNA
SIFT	SIFT	Pathogenicity classification from SIFT
SIFT_score	SIFT	Pathogenicity prediction score from SIFT
MITOMAP_GENBANK_AC	MITOMAP	Allele count in GenBank out of 61168 full length (FL) human chrM sequences
MITOMAP_GENBANK_AF	MITOMAP	Allele Frequency in full length (FL) Genbank sequence set
MITOMAP_PubmedIDs	MITOMAP	Pubmed IDs
MITOMAP_Disease	MITOMAP	Putative Disease Association
MITOMAP_DiseaseStatus	MITOMAP	Disease Association Status
SnpEff_EFFECT	SnpEff	Functional effect of the variant (annotated using Sequence Ontology terms)
SnpEff_IMPACT	SnpEff	A simple estimation of putative impact / deleteriousness: [HIGH, MODERATE, LOW, MODIFIER]
SnpEff_IMPACT	SnpEff	Functional effect of the variant (annotated using Sequence Ontology terms)
SnpEff_GENE	SnpEff	Common gene name (HGNC). Optional: use closest gene when the variant is “intergenic”
SnpEff_GENEID	SnpEff	Gene ID
SnpEff_FEATURE	SnpEff	Type of feature (e.g. transcript, motif, miRNA, etc.)
SnpEff_FEATUREID	SnpEff	Depending on the annotation, this may be: Transcript ID, Motif ID, etc.
SnpEff_BIOTYPE	SnpEff	Transcript biotype
SnpEff_HGVS_C	SnpEff	Variant using HGVS notation (DNA level)
SnpEff_HGVS_P	SnpEff	HGVS notation (Protein level) - if variant is coding

Visualization

An interactive visualization (.html) of variants at their positions in the mitochondrial circular genome. The basic information about variant are included on hover (e.g. postion, reference allele, alternate allele). The variant is visualized in the inner trace at different levels depending on its heteroplasmy fraction (the closer to outer border, the higher heteroplasmy fraction). Additionally, the calculated coverage is visualized in the innermost trace.

Example:

Haplogroup

Haplogroup classification report containing information on identified haplogroup (.txt)

Proccess

Preprocessing

The input WGS BAM file is first subsetted to include only reads mapped to the mitochondrial genome using gatk PrintReads tool. Consequently, the extracted mitochondrial reads are unaligned using Picard RevertSam tool, producing unmapped BAM. The unalignment step is necessary, since the following stage of the pipeline performs realignment of reads to mitochondrial reference genome.

Alignment to mitochondrial genome using double alignment strategy

Prior to performing alignment, the unaligned mitochondrial reads in uBAM format have to be converted to FASTQ format required by the aligner.

The unaligned mitochondrial reads (FASTQ) are then realigned to mitochondrial reference and the shifted version of mitochondrial reference using bwa-mem2 aligner. The purpose of double alignment strategy is to increase alignment precision in mitochondrial control region (D-loop). The shifted reference was created by shifting the original reference by 8000 bases, which moves the breakpoint of the mitochondrial genome from the control region to the opposite side of the contig, allowing reads originating from control region to align precisely.

During to conversion of uBAM to FASTQ, certain useful information from the uBAM are lost. In order to preserve this information and pass to downstream steps, the unmapped BAM is merged with SAM file produced by the aligner using Picard MergeBamAlignment tool.

Duplicates are marked and alignment files are sorted by coordinate using gatk MarkDuplicatesSpark.

Variant Calling

Mitochondrial variants are called separately for non-control region using reads aligned to canonical reference and control region using reads aligned to shifted reference. Variants are called using a somatic variant caller gatk Mutect2 in mitochondria mode.

Merging variant calls

After calling variants separately for control and non-control region of mitochondrial genome, the variant calls from both regions have to be merged.

The variants in the control region were called against shifted mitochondrial reference, and thus have to be shifted back to coordinates of canonical reference sequence using Picard LiftoverVcf.

The VCF file and shifted-back VCF file are then merged using Picatd MergeVcfs. Additionally, VCF stats files generated by Mutect2 are merged using gatk MergeMutectStats.

Postprocessing

The raw variant calls are postprocessed, applying several filters to remove potential false positive calls and normalizing variant calls to achieve standardized representation.

The initial filtering phase includes filtering variants based on multiple specified parameters using gatk FilterMutectCalls tool specifically designed for filtering of raw Mutect2 calls. Optionally, the variants are filtered based on estimated contamination level. To estimate the level of contamination in mitochondrial DNA sample, we utilize haplocheck.

The next level of filters eliminates common artifacts, i.e. variants overlapping known artifact-prone mitochondrial sites. Finally, an optional last filtering phase involves filering out potential NuMTs based on median autosomal coverage using gatk NuMTFilterTool. The median autosomal coverage can be estimated from input WGS BAM using Picard CollectWgsmetrics.

The normalization of variant calls includes left alignment and splitting multi-allelic sites using gatk LeftAlignAndTrimVariants.

Calculating coverage

The per-base coverage is calculated using mosdepth from canonical and shifted alignments. The resulting per-base coverage is produced by combining coverage from non-control region (canonical alignment) and control region (shifted alignment).

Annotation

The postprocessed variants are annotated with functional effects using SnpEff and additional information:

conservation scores (PhyloP100way and PhastCons100way)
population frequencies (gnomAD)
in-silico pathogenicity predictions (SIFT, MitoTIP and PON-mt-tRNA)
phenotype associations (MITOMAP and ClinVar)

Additionally the variants are annotated as heteroplasmic (0/1 genotype) or homoplasmic (1/1 genotype) based on specified treshold.

The annotations are exported to human-readable CSV format. See Outputs section for description of individual annotation fields.

Visualization

An interactive HTML visualization of variants is created, visualizing variants at their positions within mitochondrial genome. The calculated coverage is included in the plot.

Identifying haplogroups

The haplogroup of the sample is identified based on detected variants using haplogrep3.