Result files¶
Poirot produces a lot of intermediate and result files but only files defined in output_files.json are kept, the rest are temporary and will be deleted when not needed by any other rules in the pipeline. Please use the --no-temp snakemake option when running the pipeline to retain all files generated by the pipeline, or edit the output_files.json to retain a retain the files required.
Files¶
Successful execution of the pipeline result in the the following output files located in Results/-folder:
| File | Format | Description |
|---|---|---|
| multiqc_DNA.html | html | Aggregated QC values for entire sequence run, open in browser |
| {sample}/{sample}_N.cram | cram | Sorted and deduplicated CRAM alignement file |
| {sample}/{sample}_N.cram.crai | crai | Index for the CRAM alignment file |
| {sample}/{sample}_snv_indels.vcf.gz | vcf.gz | Compressed deepvariant VCF file |
| {sample}/{sample}_snv_indels.vcf.gz.tbi | tbi | Index for the compressed deepvariant VCF file |
| {sample}/{sample}.contamination.html | html | Contamination report from haplocheck program |
| {sample}/{sample}.coverage_analysis.xlsx | excel | Excel file summarising the depth of covearage in genes in various gene panels |
| {sample}/{sample}.expansionhunter_stranger.vcf.gz | vcf.gz | Compressed VCF with repeat expansions called by Expansion Hunter and annotated with stranger |
| {sample}/{sample}.expansionhunter_stranger.bed | bed | Bed file with the repeat expansions called by Expansion Hunter and annotated with stranger |
| {sample}/{sample}.hard-filtered.vcf.gz | vcf.gz | Compressed deepvariant VCF file containing rare variants (filtered on VEP annotation max_af > 10%) |
| {sample}/{sample}.hard-filtered.vcf.gz.tbi | tbi | Index for the compressed deepvariant VCF file containing rare variants |
| {sample}/{sample}.upd_regions.bed | bed | Bed file of UPD regions called from a Trio VCF. Sample in this case is the sample id of the proband in the trio. |
| {sample}/automap/{sample}.HomRegions.tsv | tsv | Regions of homozygosity detected by AutoMap |
| {sample}/automap/{sample}.HomRegions.pdf | Visualisation of ROH regions detected by AutoMap | |
| {sample}/cnv_sv/{sample}.cnv.vcf.gz | vcf.gz | Filtered CNVpytor VCF file |
| {sample}/cnv_sv/{sample}.cnvpytor.vcf.gz | vcf.gz | Unfiltered CNVpytor VCF file |
| {sample}/cnv_sv/{sample}.cnvpytor_filtered.aed | aed | Filtered CNVpytor calls in aed format (Affymetrix Extensible Data format, tab delimited) |
| {sample}/cnv_sv/{sample}.cnvpytor.aed | aed | CNVpytor calls in aed format (Affymetrix Extensible Data format, tab delimited) |
| {sample}/cnv_sv/{sample}.manta_diploidSV.vcf.gz | vcf.gz | Compressed VCF for SV calls called by Manta |
| {sample}/cnv_sv/{sample}.svdb_merged.vcf.gz | vcf.gz | Compressed VCF containg the merge of SV and CNV calls from Manta and CNVpytor |
| {sample}/cnv_sv/{sample}.cnv_sv.vcf.gz | vcf.gz | A filtered version of the compressed VCF containing the merge of SV and CNV calls from Manta and CNVpytor |
| {sample}/expansionhunter_reviewer | png | Directory of png image files generated by REViewer |
| {sample}/mobile_elements/{sample}.melt_ALU.vcf.gz | vcf.gz | Compressed MELT VCF-file with ALU-insertion |
| {sample}/mobile_elements/{sample}.melt_HERVK.vcf.gz | vcf.gz | Compressed MELT VCF-file with HERVK-insertion |
| {sample}/mobile_elements/{sample}.melt_LINE1.vcf.gz | vcf.gz | Compressed MELT VCF-file with LINE1-insertion |
| {sample}/mobile_elements/{sample}.melt_SVA.vcf.gz | vcf.gz | Compressed MELT VCF-file with SVA-insertion |
| {sample}/peddy/peddy.html | html | Peddy results visualised to be open in browser |
| {sample}/SMNCopyNumberCaller/{sample}.smn_charts.pdf | SMNCopyNumberCaller charts visualising the CN calls and read depth | |
| {sample}/SMNCopyNumberCaller/{sample}.smn_caller.tsv | tsv | SMNCopyNumberCaller results |
| {sample}/SMNCopyNumberCaller/{sample}.smn_caller.json | json | SMNCopyNumberCaller results containing additional information not found in the SMNCopyNumberCaller tsv results |
MultiQC report¶
Poirot produces a MultiQC-report for the entire sequencing run to enable easier QC tracking. The report starts with a general statistics table showing the most important QC-values followed by additional QC data and diagrams. The entire MultiQC html-file is interactive and you can filter, highlight, hide or export data using the ToolBox at the right edge of the report.
The report is configured based on a MultiQC config file.
Expand to view current MultiQC config.yaml
title: "Clinical Genomics MultiQC Report"
subtitle: "Reference used: GRCh38"
intro_text: "The MultiQC report summarise analysis results from WGS data that been analysed by the pipeline Poirot_RD-WGS (https://github.com/clinical-genomics-uppsala/poirot_rd_wgs)."
report_header_info:
- Contact E-mail: "igp-klinsek-bioinfo@lists.uu.se"
- Application Type: "Bioinformatic analysis of WGS for rare diseases"
show_analysis_paths: True
decimalPoint_format: ','
extra_fn_clean_exts: ##from this until end
- '.dup'
- type: regex
pattern: '_fastq[12]'
# - '_S'
extra_fn_clean_trim:
- 'Sample_VE-3297_'
custom_table_header_config:
general_stats_table:
raw_total_sequences:
suffix: ""
title: "Total seqs M"
reads_mapped:
suffix: ""
title: "Reads mapped M"
reads_mapped_percent:
suffix: ""
reads_properly_paired_percent:
suffix: ""
median_coverage:
suffix: ""
10_x_pc:
suffix: ""
30_x_pc:
suffix: ""
PERCENT_DUPLICATION:
suffix: ""
summed_mean:
suffix: ""
module_order:
- fastqc
- fastp
- verifybamid
- mosdepth
- peddy
- samtools
- picard
table_columns_visible:
FastQC:
percent_duplicates: False
percent_gc: False
avg_sequence_length: False
percent_fails: False
total_sequences: False
fastp:
pct_adapter: True
pct_surviving: False
after_filtering_gc_content: False
filtering_result_passed_filter_reads: False
after_filtering_q30_bases: False
after_filtering_q30_rate: False
pct_duplication: False
mosdepth:
median_coverage: True
mean_coverage: False
1_x_pc: False
5_x_pc: False
10_x_pc: True
20_x_pc: True
30_x_pc: True
50_x_pc: False
Peddy:
family_id: False
ancestry-prediction: False
ancestry-prob_het_check: False
sex_het_ratio: False
error_sex_check: True
predicted_sex_sex_check: True
"Picard: HsMetrics":
FOLD_ENRICHMENT: False
MEDIAN_TARGET_COVERAGE: False
PCT_TARGET_BASES_30X: False
"Picard: InsertSizeMetrics":
summed_median: False
summed_mean: True
"Picard: Mark Duplicates":
PERCENT_DUPLICATION: True
"Picard: WGSMetrics":
STANDARD_DEVIATION: False
MEDIAN_COVERAGE: False
MEAN_COVERAGE: False
SD_COVERAGE: False
PCT_30X: False
PCT_TARGET_BASES_30X: False
FOLD_ENRICHMENT: False
"Samtools: stats":
error_rate: False
non-primary_alignments: False
reads_mapped: True
reads_mapped_percent: True
reads_properly_paired_percent: True
reads_MQ0_percent: False
raw_total_sequences: True
# Patriks plug in, addera egna columner till general stats
multiqc_cgs:
"Picard: HsMetrics":
FOLD_80_BASE_PENALTY:
title: "Fold80"
description: "Fold80 penalty from picard hs metrics"
min: 1
max: 3
scale: "RdYlGn-rev"
format: "{:.1f}"
PCT_SELECTED_BASES:
title: "Bases on Target"
description: "On+Near Bait Bases / PF Bases Aligned from Picard HsMetrics"
format: "{:.2%}"
ZERO_CVG_TARGETS_PCT:
title: "Target bases with zero coverage [%]"
description: "Target bases with zero coverage [%] from Picard"
min: 0
max: 100
scale: "RdYlGn-rev"
format: "{:.2%}"
"Samtools: stats":
average_quality:
title: "Average Quality"
description: "Ratio between the sum of base qualities and total length from Samtools stats"
min: 0
max: 60
scale: "RdYlGn"
# All columns independent of module!
table_columns_placement:
mosdepth:
median_coverage: 601
1_x_pc: 666
5_x_pc: 666
10_x_pc: 602
20_x_pc: 603
30_x_pc: 604
50_x_pc: 605
"Samtools: stats":
raw_total_sequences: 500
reads_mapped: 501
reads_mapped_percent: 502
reads_properly_paired_percent: 503
average_quality: 504
error_rate: 555
reads_MQ0_percent: 555
non-primary_alignments: 555
Peddy:
ancestry-prediction: 777
ancestry-prob_het_check: 777
sex_het_ratio: 777
error_sex_check: 701
predicted_sex_sex_check: 702
family_id: 703
"Picard: HsMetrics":
PCT_SELECTED_BASES: 801
FOLD_80_BASE_PENALTY: 802
PCT_PF_READS_ALIGNED: 888
ZERO_CVG_TARGETS_PCT: 888
"Picard: InsertSizeMetrics":
summed_median: 888
summed_mean: 804
"Picard: Mark Duplicates":
PERCENT_DUPLICATION: 803
"Picard: WGSMetrics":
STANDARD_DEVIATION: 805
MEDIAN_COVERAGE: 888
MEAN_COVERAGE: 888
SD_COVERAGE: 888
PCT_30X: 888
PCT_TARGET_BASES_30X: 888
FOLD_ENRICHMENT: 888
mosdepth_config:
include_contigs:
- "chr*"
exclude_contigs:
- "*_alt"
- "*_decoy"
- "*_random"
- "chrUn*"
- "HLA*"
- "chrM"
- "chrEBV"
The report contain qc information derived from the following programs:
General Statistics¶
The general statistics table are ordered based on the sample order in Illumina SampleSheet.csv file used during sequencing, this is done by renaming the samples in two steps using the script extract_samples_info.py. To toggle between "Sample Order" and "Sample Name" use the buttons just above General Stats header.
| Column Name | Origin | Comment |
|---|---|---|
| M Reads | Samtools stats | Total number of reads in the input nam fole |
| % Mapped | Samtools stats | Percent reads mapped, anywhere in the reference (no design file used) |
| % Proper pairs | Samtools stats | |
| Average Quality | Samtools stats | Ratio between sum of base quality over total length. Only reads on target (config[reference][design_bed]) |
| Median | Mosdepth | Median Coverage over coding exon in design (config[reference][exon_bed]) |
| >= 30X | Mosdepth | Fraction of coding exons (config[reference][exon_bed]) with coverage over 30x |
| >=50X | Mosdepth | Fraction of coding exons (config[reference][exon_bed]) with coverage over 50x |
| Bases on Target | Picard HSMetrics | Bases inside the capture design (config[reference][design_intervals]) |
| Fold80 | Picard HSMetrics | The fold over-coverage necessary to raise 80% of bases in "non-zero-cvg" targets to the mean coverage level in those targets (config[reference][design_intervals]) |
| % Dups | Picard DuplicationMetrics | |
| Mean Insert Size | Picard InsertSizeMetrics | |
| Target Bases with zero coverage [%] | Picard HSMetrics | Percent target (config[reference][design_intervals]) bases with 0 coverage |
Coverage analysis¶
It is important to know that the information is actually there when trying to draw clinically relevant conclusions from the data. This is done by a gene coverage analysis which results in an excel file with a tab for each gene panel that is of interest,
Example of some data from tabs in {sample_id}.coverage_analysis.xlsx¶
Overview¶
{sample_id}
Bed-file from UCSC: refseq_select_mane_20230828.bed
Processing date: January 05, 2025
Created by: Valid from:
Signed by: Document nr:
DNAnr Avg. coverage (x) Duplicationlevel ()%) 10x (%) 20x (%) 30x (%)
{sample_id} 60.07 3.6 99.6 98.8 94.8
Number of regions not covered by at least 10x:
1177
Sheets:
Positions with coverage lower than 10x
Average coverage of all regions in bed
BRCA
CADASIL
EBS
EDS
Positions with coverage lower than 10x¶
Mosdepth coverage analysis
Sample: {sample_id}
Gene regions with coverage lower than 10x.
Region Name Chr Start Stop Mean Coverage
OR4F5_NM_001005484.2_[3] chr1 69438 69439 10
OR4F5_NM_001005484.2_[3] chr1 69439 69441 9
OR4F5_NM_001005484.2_[3] chr1 69441 69442 8
OR4F5_NM_001005484.2_[3] chr1 69442 69444 7
Average coverage of all regions in bed¶
Average coverage and coverage breadth per gene
Sample: {sample_id}
Averge coverage and coverage breadth of each gene in exon-bedfile
Gene Transcript Avg coverage 10x 20x 30x
A1BG NM_130786.4 40,6 100 96,6 80,7
A1CF NM_014576.4 59,28 100 100 97,6
A2M NM_000014.6 62,82 100 100 99,7
A2ML1 NM_144670.6 50,66 100 100 98,8
A3GALT2 NM_001080438.1 35,55 100 100 78,4
EBS¶
Coverage analysis per gene for EBS gene panel
Sample: {sample_id}
Average coverage and coverage breadth of genes in EBS gene panel.
Gene Transcript Avg coverage 10x 20x 30x
CAST NM_001750.7 63,2 100 100 99,6
CD151 NM_004357.5 43,89 100 100 98,2
Regions of exons that are covered below 10x.
Gene name_transcript_exon Chr Start Stop Coverage (x)
COL7A1_NM_000094.4_[109] chr3 48567605 48567606 10