Result files¶

Poirot produces a lot of intermediate and result files but only files defined in output_files.json are kept, the rest are temporary and will be deleted when not needed by any other rules in the pipeline. Please use the --no-temp snakemake option when running the pipeline to retain all files generated by the pipeline, or edit the output_files.json to retain a retain the files required.

Files¶

Successful execution of the pipeline result in the the following output files located in Results/-folder:

File	Format	Description
multiqc_DNA.html	html	Aggregated QC values for entire sequence run, open in browser
{sample}/{sample}_N.cram	cram	Sorted and deduplicated CRAM alignement file
{sample}/{sample}_N.cram.crai	crai	Index for the CRAM alignment file
{sample}/{sample}_snv_indels.vcf.gz	vcf.gz	Compressed deepvariant VCF file
{sample}/{sample}_snv_indels.vcf.gz.tbi	tbi	Index for the compressed deepvariant VCF file
{sample}/{sample}.contamination.html	html	Contamination report from haplocheck program
{sample}/{sample}.coverage_analysis.xlsx	excel	Excel file summarising the depth of covearage in genes in various gene panels
{sample}/{sample}.expansionhunter_stranger.vcf.gz	vcf.gz	Compressed VCF with repeat expansions called by Expansion Hunter and annotated with stranger
{sample}/{sample}.expansionhunter_stranger.bed	bed	Bed file with the repeat expansions called by Expansion Hunter and annotated with stranger
{sample}/{sample}.hard-filtered.vcf.gz	vcf.gz	Compressed deepvariant VCF file containing rare variants (filtered on VEP annotation max_af > 10%)
{sample}/{sample}.hard-filtered.vcf.gz.tbi	tbi	Index for the compressed deepvariant VCF file containing rare variants
{sample}/{sample}.upd_regions.bed	bed	Bed file of UPD regions called from a Trio VCF. Sample in this case is the sample id of the proband in the trio.
{sample}/automap/{sample}.HomRegions.tsv	tsv	Regions of homozygosity detected by AutoMap
{sample}/automap/{sample}.HomRegions.pdf	pdf	Visualisation of ROH regions detected by AutoMap
{sample}/cnv_sv/{sample}.cnv.vcf.gz	vcf.gz	Filtered CNVpytor VCF file
{sample}/cnv_sv/{sample}.cnvpytor.vcf.gz	vcf.gz	Unfiltered CNVpytor VCF file
{sample}/cnv_sv/{sample}.cnvpytor_filtered.aed	aed	Filtered CNVpytor calls in aed format (Affymetrix Extensible Data format, tab delimited)
{sample}/cnv_sv/{sample}.cnvpytor.aed	aed	CNVpytor calls in aed format (Affymetrix Extensible Data format, tab delimited)
{sample}/cnv_sv/{sample}.manta_diploidSV.vcf.gz	vcf.gz	Compressed VCF for SV calls called by Manta
{sample}/cnv_sv/{sample}.svdb_merged.vcf.gz	vcf.gz	Compressed VCF containg the merge of SV and CNV calls from Manta and CNVpytor
{sample}/cnv_sv/{sample}.cnv_sv.vcf.gz	vcf.gz	A filtered version of the compressed VCF containing the merge of SV and CNV calls from Manta and CNVpytor
{sample}/expansionhunter_reviewer	png	Directory of png image files generated by REViewer
{sample}/mobile_elements/{sample}.melt_ALU.vcf.gz	vcf.gz	Compressed MELT VCF-file with ALU-insertion
{sample}/mobile_elements/{sample}.melt_HERVK.vcf.gz	vcf.gz	Compressed MELT VCF-file with HERVK-insertion
{sample}/mobile_elements/{sample}.melt_LINE1.vcf.gz	vcf.gz	Compressed MELT VCF-file with LINE1-insertion
{sample}/mobile_elements/{sample}.melt_SVA.vcf.gz	vcf.gz	Compressed MELT VCF-file with SVA-insertion
{sample}/peddy/peddy.html	html	Peddy results visualised to be open in browser
{sample}/SMNCopyNumberCaller/{sample}.smn_charts.pdf	pdf	SMNCopyNumberCaller charts visualising the CN calls and read depth
{sample}/SMNCopyNumberCaller/{sample}.smn_caller.tsv	tsv	SMNCopyNumberCaller results
{sample}/SMNCopyNumberCaller/{sample}.smn_caller.json	json	SMNCopyNumberCaller results containing additional information not found in the SMNCopyNumberCaller tsv results

MultiQC report¶

Poirot produces a MultiQC-report for the entire sequencing run to enable easier QC tracking. The report starts with a general statistics table showing the most important QC-values followed by additional QC data and diagrams. The entire MultiQC html-file is interactive and you can filter, highlight, hide or export data using the ToolBox at the right edge of the report.

The report is configured based on a MultiQC config file.

Expand to view current MultiQC config.yaml

title: "Clinical Genomics MultiQC Report"
subtitle: "Reference used: GRCh38"
intro_text: "The MultiQC report summarise analysis results from WGS data that been analysed by the pipeline Poirot_RD-WGS (https://github.com/clinical-genomics-uppsala/poirot_rd_wgs)."

report_header_info:
  - Contact E-mail: "igp-klinsek-bioinfo@lists.uu.se"
  - Application Type: "Bioinformatic analysis of WGS for rare diseases"

show_analysis_paths: True

decimalPoint_format: ','

extra_fn_clean_exts: ##from this until end
    - '.dup'
    - type: regex
      pattern: '_fastq[12]'
#    - '_S'
extra_fn_clean_trim:
  - 'Sample_VE-3297_'

custom_table_header_config:
  general_stats_table:
    raw_total_sequences:
      suffix: ""
      title: "Total seqs M"
    reads_mapped:
      suffix: ""
      title: "Reads mapped M"
    reads_mapped_percent:
      suffix: ""
    reads_properly_paired_percent:
      suffix: ""
    median_coverage:
      suffix: ""
    10_x_pc:
      suffix: ""
    30_x_pc:
      suffix: ""
    PERCENT_DUPLICATION:
      suffix: ""
    summed_mean:
      suffix: ""

module_order:
  - fastqc
  - fastp
  - verifybamid
  - mosdepth
  - peddy
  - samtools
  - picard

table_columns_visible: 
  FastQC:
    percent_duplicates: False
    percent_gc: False
    avg_sequence_length: False
    percent_fails: False
    total_sequences: False
  fastp:
    pct_adapter: True
    pct_surviving: False
    after_filtering_gc_content: False
    filtering_result_passed_filter_reads: False
    after_filtering_q30_bases: False
    after_filtering_q30_rate: False
    pct_duplication: False
  mosdepth:
    median_coverage: True
    mean_coverage: False
    1_x_pc: False
    5_x_pc: False
    10_x_pc: True
    20_x_pc: True
    30_x_pc: True
    50_x_pc: False
  Peddy:
    family_id: False
    ancestry-prediction: False
    ancestry-prob_het_check: False
    sex_het_ratio: False
    error_sex_check: True
    predicted_sex_sex_check: True
  "Picard: HsMetrics":
    FOLD_ENRICHMENT: False
    MEDIAN_TARGET_COVERAGE: False
    PCT_TARGET_BASES_30X: False
  "Picard: InsertSizeMetrics":
    summed_median: False
    summed_mean: True
  "Picard: Mark Duplicates":
    PERCENT_DUPLICATION: True
  "Picard: WGSMetrics":
    STANDARD_DEVIATION: False
    MEDIAN_COVERAGE: False
    MEAN_COVERAGE: False
    SD_COVERAGE: False
    PCT_30X: False
    PCT_TARGET_BASES_30X: False
    FOLD_ENRICHMENT: False
  "Samtools: stats":
    error_rate: False
    non-primary_alignments: False
    reads_mapped: True
    reads_mapped_percent: True
    reads_properly_paired_percent: True
    reads_MQ0_percent: False
    raw_total_sequences: True

# Patriks plug in, addera egna columner till general stats
multiqc_cgs:
  "Picard: HsMetrics":
    FOLD_80_BASE_PENALTY: 
      title: "Fold80"
      description: "Fold80 penalty from picard hs metrics"
      min: 1
      max: 3
      scale: "RdYlGn-rev"
      format: "{:.1f}"
    PCT_SELECTED_BASES:
      title: "Bases on Target"
      description: "On+Near Bait Bases / PF Bases Aligned from Picard HsMetrics"
      format: "{:.2%}"
    ZERO_CVG_TARGETS_PCT:
      title: "Target bases with zero coverage [%]"
      description: "Target bases with zero coverage [%] from Picard"
      min: 0
      max: 100
      scale: "RdYlGn-rev"
      format: "{:.2%}"
  "Samtools: stats":
    average_quality:
      title: "Average Quality"
      description: "Ratio between the sum of base qualities and total length from Samtools stats"
      min: 0
      max: 60
      scale: "RdYlGn"

# All columns independent of module!
table_columns_placement:
  mosdepth:
    median_coverage: 601
    1_x_pc: 666
    5_x_pc: 666
    10_x_pc: 602
    20_x_pc: 603
    30_x_pc: 604
    50_x_pc: 605
  "Samtools: stats":
    raw_total_sequences: 500
    reads_mapped: 501
    reads_mapped_percent: 502
    reads_properly_paired_percent: 503
    average_quality: 504
    error_rate: 555
    reads_MQ0_percent: 555
    non-primary_alignments: 555
  Peddy:
    ancestry-prediction: 777
    ancestry-prob_het_check: 777
    sex_het_ratio: 777
    error_sex_check: 701
    predicted_sex_sex_check: 702
    family_id: 703
  "Picard: HsMetrics":
    PCT_SELECTED_BASES: 801
    FOLD_80_BASE_PENALTY: 802
    PCT_PF_READS_ALIGNED: 888
    ZERO_CVG_TARGETS_PCT: 888
  "Picard: InsertSizeMetrics":
    summed_median: 888
    summed_mean: 804
  "Picard: Mark Duplicates":
    PERCENT_DUPLICATION: 803
  "Picard: WGSMetrics":
    STANDARD_DEVIATION: 805
    MEDIAN_COVERAGE: 888
    MEAN_COVERAGE: 888
    SD_COVERAGE: 888
    PCT_30X: 888
    PCT_TARGET_BASES_30X: 888
    FOLD_ENRICHMENT: 888

mosdepth_config:
  include_contigs:
    - "chr*"
  exclude_contigs:
    - "*_alt"
    - "*_decoy"
    - "*_random"
    - "chrUn*"
    - "HLA*"
    - "chrM"
    - "chrEBV"

The report contain qc information derived from the following programs:

General Statistics¶

The general statistics table are ordered based on the sample order in Illumina SampleSheet.csv file used during sequencing, this is done by renaming the samples in two steps using the script extract_samples_info.py. To toggle between "Sample Order" and "Sample Name" use the buttons just above General Stats header.

Column Name	Origin	Comment
M Reads	Samtools stats	Total number of reads in the input nam fole
% Mapped	Samtools stats	Percent reads mapped, anywhere in the reference (no design file used)
% Proper pairs	Samtools stats
Average Quality	Samtools stats	Ratio between sum of base quality over total length. Only reads on target (`config[reference][design_bed]`)
Median	Mosdepth	Median Coverage over coding exon in design (`config[reference][exon_bed]`)
>= 30X	Mosdepth	Fraction of coding exons (`config[reference][exon_bed]`) with coverage over 30x
>=50X	Mosdepth	Fraction of coding exons (`config[reference][exon_bed]`) with coverage over 50x
Bases on Target	Picard HSMetrics	Bases inside the capture design (`config[reference][design_intervals]`)
Fold80	Picard HSMetrics	The fold over-coverage necessary to raise 80% of bases in "non-zero-cvg" targets to the mean coverage level in those targets (`config[reference][design_intervals]`)
% Dups	Picard DuplicationMetrics
Mean Insert Size	Picard InsertSizeMetrics
Target Bases with zero coverage [%]	Picard HSMetrics	Percent target (`config[reference][design_intervals]`) bases with 0 coverage

Coverage analysis¶

It is important to know that the information is actually there when trying to draw clinically relevant conclusions from the data. This is done by a gene coverage analysis which results in an excel file with a tab for each gene panel that is of interest,

Example of some data from tabs in {sample_id}.coverage_analysis.xlsx¶

Overview¶

{sample_id}
Bed-file from UCSC: refseq_select_mane_20230828.bed

Processing date: January 05, 2025

Created by: Valid from:
Signed by: Document nr:

DNAnr Avg. coverage (x) Duplicationlevel ()%) 10x (%) 20x (%) 30x (%)

{sample_id} 60.07 3.6 99.6 98.8 94.8

Number of regions not covered by at least 10x:
1177

Sheets:
Positions with coverage lower than 10x
Average coverage of all regions in bed
BRCA

CADASIL

EBS

EDS

Positions with coverage lower than 10x¶

Mosdepth coverage analysis

Sample: {sample_id}

Gene regions with coverage lower than 10x.
Region Name Chr Start Stop Mean Coverage

OR4F5_NM_001005484.2_[3] chr1 69438 69439 10

OR4F5_NM_001005484.2_[3] chr1 69439 69441 9

OR4F5_NM_001005484.2_[3] chr1 69441 69442 8

OR4F5_NM_001005484.2_[3] chr1 69442 69444 7

Average coverage of all regions in bed¶

Average coverage and coverage breadth per gene

Sample: {sample_id}

Averge coverage and coverage breadth of each gene in exon-bedfile
Gene Transcript Avg coverage 10x 20x 30x

A1BG NM_130786.4 40,6 100 96,6 80,7

A1CF NM_014576.4 59,28 100 100 97,6

A2M NM_000014.6 62,82 100 100 99,7

A2ML1 NM_144670.6 50,66 100 100 98,8

A3GALT2 NM_001080438.1 35,55 100 100 78,4

EBS¶

Coverage analysis per gene for EBS gene panel

Sample: {sample_id}

Average coverage and coverage breadth of genes in EBS gene panel.
Gene Transcript Avg coverage 10x 20x 30x

CAST NM_001750.7 63,2 100 100 99,6

CD151 NM_004357.5 43,89 100 100 98,2

Regions of exons that are covered below 10x.
Gene name_transcript_exon Chr Start Stop Coverage (x)
COL7A1_NM_000094.4_[109] chr3 48567605 48567606 10