End-to-end pipeline walkthrough

Inputs

WGS alignment file in BAM or CRAM format. If input is a CRAM file, reference genome FASTA (along with dictionary and index file) is additionally required.

Note

The input alignment file should be coordinate-sorted and indexed, however if these prerequisities are not met, mitopy will coordinate-sort and index the input file.

Outputs

Alignments

Alignments (.bam) to mitochondrial and shifted mitochondrial reference genomes and their respective index files (.bai)

Variants

Postprocessed variant calls (.vcf) and index file (.vcf.idx)

Annotations

Annotated variants (.vcf) and annotation report (.csv). The annotation CSV report contains following annotation fields:

Field

Source

Description

POS

VCF file

Postion of the variant

REF

VCF file

Reference allele

ALT

VCF file

Alternate allele

Heteroplasmy Fraction

VCF file

Variant allele fraction (coded as AF in FORMAT field of VCF)

MT Variant Type

VCF file

Type of the mitochondrial variant based on minimal homoplasmy treshold [homoplasmic/heteroplasmic]

LOCUS

NCBI

Mitochondrial locus/gene where variant is located

BIOTYPE

NCBI

Locus biotype

phastCons100way

PhastCons100way

PhastCons conservation score; range [min 0, max 1] (conserved: score > 0.7)

phyloP100way

PhyloP100way

PhyloP conservation score; range [min -20, max 10] (conserved: score > 0)

ClinVar_ID

ClinVar

ClinVar Variation ID

CLNDN

ClinVar

Disease name associated with variant

CLNSIG

ClinVar

Clinical significance for variant

CLNDISDB

ClinVar

Disease database name and identifier for variant

GNOMAD_AC_HOM

gnomAD

Gnomad allele count restricted to variants with a heteroplasmy level >= 0.95

GNOMAD_AF_HOM

gnomAD

Gnomad allele frequency restricted to variants with a heteroplasmy level >= 0.95

GNOMAD_AF_HET

gnomAD

Gnomad allele frequency restricted to variants with a heteroplasmy level >= 0.10 and < 0.95

GNOMAD_AC_HET

gnomAD

Gnomad allele count restricted to variants with a heteroplasmy level >= 0.10 and < 0.95

MitoTIP_Score

MitoTIP

tRNA raw pathogenicity score from Mitotip

MitoTIP_Prediction

MitoTIP

tRNA pathogenicity classification from Mitotip

PONmttRNA_Probability

PON-mt-tRNA

tRNA probability of pathogenicity from PON-mt-tRNA

PONmttRNA_Prediction

PON-mt-tRNA

tRNA pathogenicity classification from PON-mt-tRNA

SIFT

SIFT

Pathogenicity classification from SIFT

SIFT_score

SIFT

Pathogenicity prediction score from SIFT

MITOMAP_GENBANK_AC

MITOMAP

Allele count in GenBank out of 61168 full length (FL) human chrM sequences

MITOMAP_GENBANK_AF

MITOMAP

Allele Frequency in full length (FL) Genbank sequence set

MITOMAP_PubmedIDs

MITOMAP

Pubmed IDs

MITOMAP_Disease

MITOMAP

Putative Disease Association

MITOMAP_DiseaseStatus

MITOMAP

Disease Association Status

SnpEff_EFFECT

SnpEff

Functional effect of the variant (annotated using Sequence Ontology terms)

SnpEff_IMPACT

SnpEff

A simple estimation of putative impact / deleteriousness: [HIGH, MODERATE, LOW, MODIFIER]

SnpEff_IMPACT

SnpEff

Functional effect of the variant (annotated using Sequence Ontology terms)

SnpEff_GENE

SnpEff

Common gene name (HGNC). Optional: use closest gene when the variant is “intergenic”

SnpEff_GENEID

SnpEff

Gene ID

SnpEff_FEATURE

SnpEff

Type of feature (e.g. transcript, motif, miRNA, etc.)

SnpEff_FEATUREID

SnpEff

Depending on the annotation, this may be: Transcript ID, Motif ID, etc.

SnpEff_BIOTYPE

SnpEff

Transcript biotype

SnpEff_HGVS_C

SnpEff

Variant using HGVS notation (DNA level)

SnpEff_HGVS_P

SnpEff

HGVS notation (Protein level) - if variant is coding

Visualization

An interactive visualization (.html) of variants at their positions in the mitochondrial circular genome. The basic information about variant are included on hover (e.g. postion, reference allele, alternate allele). The variant is visualized in the inner trace at different levels depending on its heteroplasmy fraction (the closer to outer border, the higher heteroplasmy fraction). Additionally, the calculated coverage is visualized in the innermost trace.

Example:

../_images/vis.png

Haplogroup

Haplogroup classification report containing information on identified haplogroup (.txt)

Proccess

../_images/mitopy.png

Preprocessing

The input WGS BAM file is first subsetted to include only reads mapped to the mitochondrial genome using gatk PrintReads tool. Consequently, the extracted mitochondrial reads are unaligned using Picard RevertSam tool, producing unmapped BAM. The unalignment step is necessary, since the following stage of the pipeline performs realignment of reads to mitochondrial reference genome.

Alignment to mitochondrial genome using double alignment strategy

Prior to performing alignment, the unaligned mitochondrial reads in uBAM format have to be converted to FASTQ format required by the aligner.

The unaligned mitochondrial reads (FASTQ) are then realigned to mitochondrial reference and the shifted version of mitochondrial reference using bwa-mem2 aligner. The purpose of double alignment strategy is to increase alignment precision in mitochondrial control region (D-loop). The shifted reference was created by shifting the original reference by 8000 bases, which moves the breakpoint of the mitochondrial genome from the control region to the opposite side of the contig, allowing reads originating from control region to align precisely.

During to conversion of uBAM to FASTQ, certain useful information from the uBAM are lost. In order to preserve this information and pass to downstream steps, the unmapped BAM is merged with SAM file produced by the aligner using Picard MergeBamAlignment tool.

Duplicates are marked and alignment files are sorted by coordinate using gatk MarkDuplicatesSpark.

Variant Calling

Mitochondrial variants are called separately for non-control region using reads aligned to canonical reference and control region using reads aligned to shifted reference. Variants are called using a somatic variant caller gatk Mutect2 in mitochondria mode.

Merging variant calls

After calling variants separately for control and non-control region of mitochondrial genome, the variant calls from both regions have to be merged.

The variants in the control region were called against shifted mitochondrial reference, and thus have to be shifted back to coordinates of canonical reference sequence using Picard LiftoverVcf.

The VCF file and shifted-back VCF file are then merged using Picatd MergeVcfs. Additionally, VCF stats files generated by Mutect2 are merged using gatk MergeMutectStats.

Postprocessing

The raw variant calls are postprocessed, applying several filters to remove potential false positive calls and normalizing variant calls to achieve standardized representation.

The initial filtering phase includes filtering variants based on multiple specified parameters using gatk FilterMutectCalls tool specifically designed for filtering of raw Mutect2 calls. Optionally, the variants are filtered based on estimated contamination level. To estimate the level of contamination in mitochondrial DNA sample, we utilize haplocheck.

The next level of filters eliminates common artifacts, i.e. variants overlapping known artifact-prone mitochondrial sites. Finally, an optional last filtering phase involves filering out potential NuMTs based on median autosomal coverage using gatk NuMTFilterTool. The median autosomal coverage can be estimated from input WGS BAM using Picard CollectWgsmetrics.

The normalization of variant calls includes left alignment and splitting multi-allelic sites using gatk LeftAlignAndTrimVariants.

Calculating coverage

The per-base coverage is calculated using mosdepth from canonical and shifted alignments. The resulting per-base coverage is produced by combining coverage from non-control region (canonical alignment) and control region (shifted alignment).

Annotation

The postprocessed variants are annotated with functional effects using SnpEff and additional information:

Additionally the variants are annotated as heteroplasmic (0/1 genotype) or homoplasmic (1/1 genotype) based on specified treshold.

The annotations are exported to human-readable CSV format. See Outputs section for description of individual annotation fields.

Visualization

An interactive HTML visualization of variants is created, visualizing variants at their positions within mitochondrial genome. The calculated coverage is included in the plot.

Identifying haplogroups

The haplogroup of the sample is identified based on detected variants using haplogrep3.