Human GIAB Tutorial
This tutorial completes one round of re-training with a Human trio from GIAB.
Warning | Storage Needs
Completing this tutorial will produce ~2T of intermediate and output data. Ensure you have sufficient space before proceeding!
1. Confirm Successful Configuration
Check | Installation
deepvariant_1.4.0-gpu.sif docs hap.py_v0.3.12.sif miniconda_envs README.md triotrain
deepvariant_1.4.0.sif errors LICENSE mkdocs.yml scripts
-
triotrain/
directory contains Python modules for the DV-TrioTrain package -
scripts/
directory contains Bash helper scripts and functions that can be used as templates
2. Activate Environment
Repeat the first two steps of Configuration:
4. Download pre-trained models
Warning | Download Size
Running a local copy of a container requires us to create a local copy of the model.ckpt
files from v1.4.0 of DeepVariant. These checkpoints are the human-trained models produced by Google Genomics Health Group. Details about published models compatible with TrioTrain can be found here.
We need to download (2) two model checkpoints:
- the default human model
- the WGS.AF human model
Example | download_models.sh
#!/bin/bash
# scripts/setup/download_models.sh
"""
NOTES: These model ckpts are only required for warm-starting re-training, and are not used with 'run_deepvariant'
"""
echo -e "=== scripts/setup/download_models.sh > start $(date)"
##======= Downloading default WGS model (without AF channel) ===================##
# This model allows for ONLY 7 layers in the example images
# Meaning, it is NOT compatible examples built with the allele frequency channel!
# To view in web-browser, execute this to obtain a valid link:
# echo "https://console.cloud.google.com/storage/browser/deepvariant/models/DeepVariant/${BIN_VERSION_DV}/DeepVariant-inception_v3-${BIN_VERSION_DV}+data-wgs_standard"
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Downloading 7-channel WGS model checkpoint (with InsertSize, but without AlleleFreq) - DeepVariant v${BIN_VERSION_DV}"
NO_AF_MODEL_BUCKET="https://storage.googleapis.com/deepvariant/models/DeepVariant/${BIN_VERSION_DV}/DeepVariant-inception_v3-${BIN_VERSION_DV}+data-wgs_standard"
# Use the same file naming convention and enable restarting of any interrupted downloads
curl --create-dirs --continue-at - "${NO_AF_MODEL_BUCKET}/model.ckpt.{data-00000-of-00001,index,example_info.json,meta}" --output "triotrain/model_training/pretrained_models/v${BIN_VERSION_DV}_withIS_noAF/model.ckpt.#1"
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Done - Downloading 7-channel WGS model checkpoint (with InsertSize, but without AlleleFreq)"
##========================================================================##
##======= Downloading custom WGS model (with Allele Frequency channel) ======================##
# This model allows for 8 layers in the example images
# Meaning, it IS compatible compatible examples built with the allele frequency channel!
# To view in web-browser, execute this to obtain a valid link:
# https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_WGS/1.4.0;tab=objects?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Downloading 8-channel WGS model checkpoint (with InsertSize, with AlleleFreq) - DeepVariant v${BIN_VERSION_DV}"
AF_MODEL_BUCKET="https://storage.googleapis.com/brain-genomics-public/research/allele_frequency/pretrained_model_WGS/${BIN_VERSION_DV}"
# Use the same file naming convention and enable restarting of any interrupted downloads
curl --create-dirs --continue-at - "${AF_MODEL_BUCKET}/model.ckpt.{data-00000-of-00001,index,meta,example_info.json}" --output "triotrain/model_training/pretrained_models/v${BIN_VERSION_DV}_withIS_withAF/wgs_af.model.ckpt.#1"
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Done - Downloading 8-channel WGS model checkpoint (with InsertSize, with AlleleFreq)"
##========================================================================##
echo -e "=== scripts/setup/download_models.sh > end $(date)"
Check | Default human model
Check | WGS.AF human model
5. Download Raw Data
Warning | Download Size
We are using two Human trios from the v4.2.1 GIAB benchmarking data.
TrioNumber | TrioName | CommonID | SampleID | Relationship |
---|---|---|---|---|
1 | AshkenaziJew | HG002 | NA24385 | Son |
HG003 | NA24149 | Father | ||
HG004 | NA24143 | Mother | ||
2 | HanChinese | HG005 | NA24631 | Son |
HG006 | NA24694 | Father | ||
HG007 | NA24695 | Mother |
We need (5) types of raw data:
Number | Description | Extension |
---|---|---|
1. | the GRCh38 reference genome | .fasta , .fai |
2. | 1kGP Population Allele Frequency, with index file | .vcf.gz , vcf.gz.tbi |
3. | benchmarking files | |
per-genome truth callsets, with index files | .vcf.gz , vcf.gz.tbi |
|
per genome truth regions file | .bed |
|
4. | sample metadata | |
checksums file | .md5 |
|
index files | .txt |
|
5. | the aligned reads files, with index file | .bam , .bai |
Example | download_GIAB.sh
#!/bin/bash
# scripts/setup/download_GIAB.sh
echo -e "=== scripts/setup/download_GIAB.sh > start $(date)"
# NOTES --------------------------------------------------------------#
# The DV developers use the following data in their walk-through docs:
# 1) Reference Genome Version: GRCh38_no_alt_analysis_set.fasta
# 2) Benchmark Sample: HGOO3 [AshkenazimTrio-Father], typically only Chr20.
# 3) PopVCF (without genotypes): v3_missing2ref
# However, for a direct comparision between our cattle model and previous models, we need to calculate Mendelian Inheritance Errors (MIE). Therefore, we will be downloading the parents in this trio. We'll also need to download the reference PopVCF used to build the humanWGS_AF model.
#-------------------------------------------------------------------#
# GIAB Trio1 #
#-------------------------------------------------------------------#
# Known as the Ashkenazi Jew Trio
# From Personal Genome Project
# AJ Son = HG002_NA24385_son
# AJ Father = HG003_NA24149_father
# AJ Mother = HG004_NA24143_mother
#-------------------------------------------------------------------#
# GIAB Trio2 #
#-------------------------------------------------------------------#
# Known as the Han Chinese Trio
# From Personal Genome Project
# Chinese Son = HG005_NA24631_son
# Chinese Father = HG006_NA24694_father
# Chinese Mother = HG007_NA24695_mother
# A preprint describing these calls is at https://doi.org/10.1101/2020.07.24.212712. The paper(s) above can be cited for use of the benchmark, and please cite http://www.nature.com/articles/sdata201625 (doi:10.1038/sdata.2016.25) when using the corresponding sequencing data.
cd triotrain/variant_calling/data/GIAB
##===================================================================
## General Files
##===================================================================
## --------------- Download Reference Genome ------------------------
## NOTES ----
## Files Downloaded Include:
## 1. GRCh38 reference genome [FASTA]
## 2. GRCh38 index [FAI]
# Create the output dir
install --directory --verbose reference
REFDIR=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/
REFFILE=GCA_000001405.15_GRCh38_no_alt_analysis_set
if [ ! -f ./reference/md5checksums.txt ]; then
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading GRCh38 checksum now..."
curl -s --continue-at - ${REFDIR}/md5checksums.txt -o ./reference/md5checksums.txt
else
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | './reference/md5checksums.txt'"
fi
# Define the file extensions to be downloaded:
declare -a Ext=(".fna.gz" ".fna.fai")
if [ ! -f "./reference/GRCh38_no_alt_analysis_set.fasta" ]; then
for e in ${Ext[@]}; do
if [ -f ./reference/md5checksums.txt ]; then
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading ${REFFILE}${e} now..."
curl -s --continue-at - "${REFDIR}/${REFFILE}${e}" -o "./reference/${REFFILE}${e}"
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: checking ${REFFILE}${e} for corruption..."
check_sum=$(cat ./reference/md5checksums.txt | grep "${REFFILE}${e}")
old_path="./"
new_path="./reference/"
valid_check_sum="${check_sum/$old_path/$new_path}"
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: $(echo $valid_check_sum | md5sum -c)"
else
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | '${REFFILE}${e}'"
fi
done
fi
if [ -f "./reference/${REFFILE}.fna.gz" ]; then
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Unzipping GRCh38 and re-naming reference files now..."
gunzip -c "./reference/${REFFILE}.fna.gz" > "./reference/GRCh38_no_alt_analysis_set.fasta"
rm "./reference/${REFFILE}.fna.gz"
mv "./reference/${REFFILE}.fna.fai" ./reference/GRCh38_no_alt_analysis_set.fasta.fai
else
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: reference re-named already... SKIPPING AHEAD"
fi
## --------------- Download Population VCF --------------------------
## NOTES ----
## Files Downloaded Include:
## 1. One-thousand genomes (1kGP) allele frequencies [VCF]
## 2. Allele-frequencies index [TBI]
## These data are ~940GiB to download completly.
## THERE ARE TWO PAGES OF FILES! (v)
## The newest version of WGS_AF model can be viewed on GCP here: https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false
# Create the output dir
mkdir -p allele_freq
# Define where to get the AF data
AF_DIR="https://storage.googleapis.com/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref"
for i in {{1..22},X,Y}
do
if [ ! -f ./allele_freq/cohort-chr${i}.release_missing2ref.no_calls.vcf.gz ]; then
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading chr${i} PopVCF..."
curl -s --continue-at - ${AF_DIR}/cohort-chr${i}.release_missing2ref.no_calls.vcf.gz -o ./allele_freq/cohort-chr${i}.release_missing2ref.no_calls.vcf.gz
else
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | 'chr${i} PopVCF'"
fi
if [ ! -f ./allele_freq/cohort-chr${i}.release_missing2ref.no_calls.vcf.gz.tbi ]; then
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading chr${i} index ..."
curl -s --continue-at - ${AF_DIR}/cohort-chr${i}.release_missing2ref.no_calls.vcf.gz.tbi -o ./allele_freq/cohort-chr${i}.release_missing2ref.no_calls.vcf.gz.tbi
else
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | 'chr${i} index'"
fi
done
# Merge the chr AF into a genome-wide AF
if [ ! -f "./allele_freq/PopVCF.merge.list" ]; then
for i in {{1..22},X,Y}
do
echo "././triotrain/variant_calling/data/GIAB/allele_freq/cohort-chr$i.release_missing2ref.no_calls.vcf.gz" >> ./allele_freq/PopVCF.merge.list
done
fi
if [ ! -f "./allele_freq/concat_PopVCFs.sh" ]; then
echo -e "source ./scripts/setup/modules.sh
bcftools concat --file-list ./triotrain/variant_calling/data/GIAB/allele_freq/PopVCF.merge.list -Oz -o ./triotrain/variant_calling/data/GIAB/allele_freq/cohort.release_missing2ref.no_calls.vcf.gz
bcftools index ./triotrain/variant_calling/data/GIAB/allele_freq/cohort.release_missing2ref.no_calls.vcf.gz" > ./allele_freq/concat_PopVCFs.sh
fi
##===================================================================
## Trio-Specific Files
##===================================================================
## --------------- Download GIAB Truth Files ------------------------
## NOTES ----
## Currently using v4.2.1
## Files Downloaded Include:
## 1. benchmarking region files [BED]
## 2. benchmarking genotypes [VCF]
## 3. benchmarking genotype index [TBI]
## The benchmarking files from NIST can be found here: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/
# Create output dir
mkdir -p benchmark
# Define where to get the truth data
TRUTHDIR=https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release
benchmark_version="_GRCh38_1_22_v4.2.1_benchmark"
bench_ext=("_noinconsistent.bed" ".vcf.gz" ".vcf.gz.tbi")
#-------------------------------------------------------------------#
# GIAB Trio1 #
#-------------------------------------------------------------------#
declare -A Trio1=(["HG002"]="HG002_NA24385_son" ["HG003"]="HG003_NA24149_father" ["HG004"]="HG004_NA24143_mother")
for t in ${!Trio1[@]}; do
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading ${t}=${Trio1[${t}]} benchmarking files now..."
for e in ${!bench_ext[@]}; do
if [ ! -f "./benchmark/${t}${benchmark_version}${bench_ext[${e}]}" ]; then
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading ${t}${benchmark_version}${bench_ext[${e}]}..."
curl --continue-at - ${TRUTHDIR}/AshkenazimTrio/${Trio1[${t}]}/NISTv4.2.1/GRCh38/${t}${benchmark_version}${bench_ext[${e}]} -o ./benchmark/${t}${benchmark_version}${bench_ext[${e}]}
# No MD5 listed on NIST FTP SITE!
else
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | '${t}${benchmark_version}${bench_ext[${e}]}'"
fi
done
# Download the README
if [ ! -f "./benchmark/${t}_README_v4.2.1.txt" ]; then
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading ${t} readme..."
curl -s --continue-at - ${TRUTHDIR}/AshkenazimTrio/${Trio1[${t}]}/NISTv4.2.1/README_v4.2.1.txt -o ./benchmark/${t}_README_v4.2.1.txt
else
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | '${t}_README_v4.2.1.txt'"
fi
done
#-------------------------------------------------------------------#
# GIAB Trio2 #
#-------------------------------------------------------------------#
bench_ext=(".bed" ".vcf.gz" ".vcf.gz.tbi")
declare -A Trio2=(["HG005"]="HG005_NA24631_son" ["HG006"]="HG006_NA24694_father" ["HG007"]="HG007_NA24695_mother")
for t in ${!Trio2[@]}; do
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading ${t}=${Trio2[${t}]} benchmarking files now..."
# Download the MD5
if [ ! -f "./benchmark/${t}_benchmark.md5" ]; then
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading ${t} MD5..."
curl -s --continue-at - ${TRUTHDIR}/ChineseTrio/${Trio2[${t}]}/NISTv4.2.1/md5.in -o ./benchmark/${t}_benchmark.md5
else
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | '${t}_benchmark.md5'"
fi
for e in ${!bench_ext[@]}; do
if [ ! -f "./benchmark/${t}${benchmark_version}${bench_ext[${e}]}" ]; then
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading ${t}${benchmark_version}${bench_ext[${e}]}..."
curl --continue-at - ${TRUTHDIR}/ChineseTrio/${Trio2[${t}]}/NISTv4.2.1/GRCh38/${t}${benchmark_version}${bench_ext[${e}]} -o ./benchmark/${t}${benchmark_version}${bench_ext[${e}]}
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: checking ./benchmark/${t}${benchmark_version}${bench_ext[${e}]} for corruption..."
check_sum=$(cat ./benchmark/${t}_benchmark.md5 | grep "${t}${benchmark_version}${bench_ext[${e}]}" | head -n1)
old_path="GRCh38/"
new_path="./benchmark/"
valid_check_sum="${check_sum/$old_path/$new_path}"
msg=$(echo $valid_check_sum | md5sum -c -)
if [ $? -eq 0 ]; then
echo $(date '+%Y-%m-%d %H:%M:%S') INFO: $msg || error_exit $msg tracker_file.txt
fi
else
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | '${t}${benchmark_version}${bench_ext[${e}]}'"
fi
done
# Download the README
if [ ! -f "./benchmark/${t}_README_v4.2.1.txt" ]; then
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading ${t} readme..."
curl -s --continue-at - ${TRUTHDIR}/ChineseTrio/${Trio2[${t}]}/NISTv4.2.1/README_v4.2.1.txt -o ./benchmark/${t}_README_v4.2.1.txt
else
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | '${t}_README_v4.2.1.txt'"
fi
done
## --------------- Download GIAB Aligned Reads ----------------------
## NOTES ----
## Files Downloaded Include:
## 1. NIST Illumina 2x250bp sequence reads alligned to GRCh38 [BAM]
## 2. Sequence reads index [BAI]
## 3. MD5 checksum [MD5]
## Indexes for various sequencing data and aligned files can be found here: https://github.com/genome-in-a-bottle/giab_data_indexes
## Function for downloading BAM + BAI files -----------
download () {
file_input=$1
# split the file_path into an array
IFS='_' read -r -a result <<< "$file_input"
trio_name=${result[0]}
if [ ! -f "./bam/${trio_name}.download" ]; then
# echo "source ./scripts/setup/modules.sh" > ./bam/$trio_name.download
# Command Translation:
# For each line after the header row:
# First, use the first column in each row,
# from the .txt file input, and
# extract the BAM file path end, called "filename", using
# the shell cmd called "basename", then
# close that shell command
# Next, if "filename" containes the correct reference genome name,
# print a command to download that file
# with timestamp wrappers before/after
# Next, process the MD5 flexibly...
# For HG002, use the corrected MD5 sum (output)
# For all others, use the original MD5 sum ($2)
# Next, write the correct MD5 + filepath to a new file, and
# print the command for checking the MD5 sum
# Next, use the 3rd column in each row,
# from the .txt file input, and
# extract the BAI file path end, called "secondfile", using
# the shell cmd called "basename", then
# close that shell command
# Next, if "filename" containes the correct reference genome name,
# print a command to download that file
# with timestamp wrappers before/after
# Next, process the MD5 flexibly...
# For HG002, use the corrected MD5 sum (output)
# For all others, use the original MD5 sum ($2)
# Next, write the correct MD5 + filepath to a new file, and
# print the command for checking the MD5 sum.
awk -F '\t' 'NR!=1 {
cmd = "basename " $1
cmd | getline filename
close(cmd)
}
index(filename, "GRCh38") {
print "echo $(date \"+%Y-%m-%d %H:%M:%S\") INFO: downloading ["filename"] now..."
print "curl -o ./triotrain/variant_calling/data/GIAB/bam/"filename" -C - "$1" --keepalive-time 300"
print "echo $(date \"+%Y-%m-%d %H:%M:%S\") INFO: downloading ["filename"]... done"
}
filename ~ /HG002.GRCh38/ {
cmd = "grep -Fw "filename" ./bam/HG002_corrected_md5sums.feb19upload.txt"
cmd | getline result
close(cmd)
split(result, output, " ")
print "echo "output[1]"\t./triotrain/variant_calling/data/GIAB/bam/"filename" > ./triotrain/variant_calling/data/GIAB/bam/"filename".md5"
}
filename !~ /HG002/ && filename ~ /GRCh38/ {
print "echo "$2"\t./triotrain/variant_calling/data/GIAB/bam/"filename" > ./triotrain/variant_calling/data/GIAB/bam/"filename".md5"
}
index(filename, "GRCh38") {
print "echo $(date \"+%Y-%m-%d %H:%M:%S\") INFO: checking ["filename"] for corruption..."
print "md5sum -c ./triotrain/variant_calling/data/GIAB/bam/"filename".md5"
print "echo $(date \"+%Y-%m-%d %H:%M:%S\") INFO: checking ["filename"] for corruption... done"
print "echo ----------------------------------------------------"
}
{
cmd = "basename " $3
cmd | getline secondfile
close(cmd)
}
index(secondfile, "GRCh38") {
print "echo $(date \"+%Y-%m-%d %H:%M:%S\") INFO: downloading ["secondfile"] now..."
print "curl -o ./triotrain/variant_calling/data/GIAB/bam/"secondfile" -C - "$3
print "echo $(date \"+%Y-%m-%d %H:%M:%S\") INFO: downloading ["secondfile"] ... done"
}
secondfile ~ /HG002.GRCh38/ {
cmd = "grep -Fw "secondfile" ./bam/HG002_corrected_md5sums.feb19upload.txt"
cmd | getline result
close(cmd)
split(result, output, " ")
print "echo "output[1]"\t./triotrain/variant_calling/data/GIAB/bam/"secondfile" > ./triotrain/variant_calling/data/GIAB/bam/"secondfile".md5"
}
secondfile !~ /HG002/ && secondfile ~ /GRCh38/ {
print "echo "$4"\t./triotrain/variant_calling/data/GIAB/bam/"secondfile" > ./triotrain/variant_calling/data/GIAB/bam/"secondfile".md5"
}
index(secondfile, "GRCh38") {
print "echo $(date \"+%Y-%m-%d %H:%M:%S\") INFO: checking ["secondfile"] for corruption..."
print "md5sum -c ./triotrain/variant_calling/data/GIAB/bam/"secondfile".md5"
print "echo $(date \"+%Y-%m-%d %H:%M:%S\") INFO: checking ["secondfile"] for corruption... done"
print "echo ===================================================="
}' "./bam/${file_input}" >> "./bam/${trio_name}.download"
else
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | './triotrain/variant_calling/data/GIAB/bam/${trio_name}.download'"
fi
}
## Function for calculating AVG COVERAGE -----------
calc_cov () {
file_input=$1
# split the file_path into an array
IFS='_' read -r -a result <<< "$file_input"
trio_name=${result[0]}
if [ ! -f "./bam/${trio_name}.run" ]; then
# echo "source ./scripts/setup/modules.sh" > ./bam/${trio_name}.run
# Command Translation:
# For each line after the header row:
# First, use the first column in each row,
# from the .txt file input, and
# extract the BAM file path end, called "filename", using
# the shell cmd called "basename", then
# close that shell command
# Next, if "filename" containes the correct reference genome name,
# print a if/then statement to test if job file to run coverage
# calculations exists, and create it if it does not.
# print a if/then statement to test if output from avg_coverage
# calculations exists, and create it if it does not.
awk -F '\t' 'NR!=1 {
cmd = "basename " $1
cmd | getline filename
close(cmd)
}
index(filename, "GRCh38") {
cmd = "basename " filename " .bam"
cmd | getline label
close(cmd)
print "if [ ! -f ./triotrain/variant_calling/data/GIAB/bam/"label".coverage.out ]; then"
print " echo \"$(date \"+%Y-%m-%d %H:%M:%S\") INFO: Calculating Coverage for ["label"] now...\""
print " samtools coverage ./triotrain/variant_calling/data/GIAB/bam/"filename" --output ./triotrain/variant_calling/data/GIAB/bam/"label".coverage.out"
print " echo \"$(date \"+%Y-%m-%d %H:%M:%S\") INFO: Calculating Coverage for ["label"] now... done\""
print "else"
print " echo \"$(date \"+%Y-%m-%d %H:%M:%S\") INFO: bash file found | ./triotrain/variant_calling/data/GIAB/bam/"label".coverage.out\""
print "fi"
print "echo ----------------------------------------------------"
split(label, output, ".")
print "if [ ! -f ./triotrain/variant_calling/data/GIAB/bam/"output[1]".avg_coverage.out ]; then"
print " echo \"$(date \"+%Y-%m-%d %H:%M:%S\") INFO: Calculating Average Coverage for ["output[1]"] now...\""
print " awk '\''{ total += $6; count++ } END { print \""label" AVERAGE COVERAGE = \" total/count }'\'' ./triotrain/variant_calling/data/GIAB/bam/"label".coverage.out > ./triotrain/variant_calling/data/GIAB/bam/"output[1]".avg_coverage.out"
print " echo \"$(date \"+%Y-%m-%d %H:%M:%S\") INFO: Calculating Average Coverage for ["output[1]"] now... done\""
print "else"
print " echo \"$(date \"+%Y-%m-%d %H:%M:%S\") INFO: output file found | ./triotrain/variant_calling/data/GIAB/bam/"output[1]".avg_coverage.out\""
print "fi"
print "echo ===================================================="
} ' "./bam/${file_input}" >> "./bam/${trio_name}.run"
# if running on an interactive session with lots of memory,
# uncomment the line below
# . ./bam/$1.run
else
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | './triotrain/variant_calling/data/GIAB/bam/${trio_name}.run'"
fi
}
# output dir
mkdir -p bam
#-------------------------------------------------------------------#
# GIAB Trio1 #
#-------------------------------------------------------------------#
# BAM and BAI with MD5 checksums can be found here: https://github.com/genome-in-a-bottle/giab_data_indexes/blob/c4d3b95c2ebf14c175151e4723f82e8980722e90/AshkenazimTrio/alignment.index.AJtrio_Illumina_2x250bps_novoalign_GRCh37_GRCh38_NHGRI_06062016
## CHECKSUMS -----------
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading AJ Trio checksum now..."
curl -s -C - https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/AshkenazimTrio/alignment.index.AJtrio_Illumina_2x250bps_novoalign_GRCh37_GRCh38_NHGRI_06062016 -o ./bam/AJtrio_Illumina_2x250bps_novoaligns_GRCh37_GRCh38.txt
# NOTE: 07 JUNE 2023:
# The checksum for HG002 does not match the ^ above MD5,
# therefore, download the correct one.
curl -s -C - https://ftp.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_Illumina_2x250bps/novoalign_bams/HG002_corrected_md5sums.feb19upload.txt -o ./bam/HG002_corrected_md5sums.feb19upload.txt
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading AJ Trio checksum now... done"
download "AJtrio_Illumina_2x250bps_novoaligns_GRCh37_GRCh38.txt"
# calc_cov "AJtrio_Illumina_2x250bps_novoaligns_GRCh37_GRCh38.txt"
#-------------------------------------------------------------------#
# GIAB Trio2 #
#-------------------------------------------------------------------#
# BAM and BAI with MD5 checksums can be found here: https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/ChineseTrio/alignment.index.ChineseTrio_Illumina300X100X_wgs_novoalign_GRCh37_GRCh38_NHGRI_04062016
## CHECKSUMS -----------
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading HC Trio checksum now..."
curl -s -C - https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/ChineseTrio/alignment.index.ChineseTrio_Illumina300X100X_wgs_novoalign_GRCh37_GRCh38_NHGRI_04062016 -o ./bam/HCtrio_Illumina300X100X_wgs_novoalign_GRCh37_GRCh38.txt
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading HC Trio checksum now... done"
download "HCtrio_Illumina300X100X_wgs_novoalign_GRCh37_GRCh38.txt"
# calc_cov "HCtrio_Illumina300X100X_wgs_novoalign_GRCh37_GRCh38.txt"
echo -e "=== scripts/setup/download_humanGIABdata.sh > end $(date)"
Check | Data Directories
Check | allele_freq/
cohort-chr10.release_missing2ref.no_calls.vcf.gz cohort-chr21.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr10.release_missing2ref.no_calls.vcf.gz.tbi cohort-chr22.release_missing2ref.no_calls.vcf.gz
cohort-chr11.release_missing2ref.no_calls.vcf.gz cohort-chr22.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr11.release_missing2ref.no_calls.vcf.gz.tbi cohort-chr2.release_missing2ref.no_calls.vcf.gz
cohort-chr12.release_missing2ref.no_calls.vcf.gz cohort-chr2.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr12.release_missing2ref.no_calls.vcf.gz.tbi cohort-chr3.release_missing2ref.no_calls.vcf.gz
cohort-chr13.release_missing2ref.no_calls.vcf.gz cohort-chr3.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr13.release_missing2ref.no_calls.vcf.gz.tbi cohort-chr4.release_missing2ref.no_calls.vcf.gz
cohort-chr14.release_missing2ref.no_calls.vcf.gz cohort-chr4.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr14.release_missing2ref.no_calls.vcf.gz.tbi cohort-chr5.release_missing2ref.no_calls.vcf.gz
cohort-chr15.release_missing2ref.no_calls.vcf.gz cohort-chr5.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr15.release_missing2ref.no_calls.vcf.gz.tbi cohort-chr6.release_missing2ref.no_calls.vcf.gz
cohort-chr16.release_missing2ref.no_calls.vcf.gz cohort-chr6.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr16.release_missing2ref.no_calls.vcf.gz.tbi cohort-chr7.release_missing2ref.no_calls.vcf.gz
cohort-chr17.release_missing2ref.no_calls.vcf.gz cohort-chr7.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr17.release_missing2ref.no_calls.vcf.gz.tbi cohort-chr8.release_missing2ref.no_calls.vcf.gz
cohort-chr18.release_missing2ref.no_calls.vcf.gz cohort-chr8.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr18.release_missing2ref.no_calls.vcf.gz.tbi cohort-chr9.release_missing2ref.no_calls.vcf.gz
cohort-chr19.release_missing2ref.no_calls.vcf.gz cohort-chr9.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr19.release_missing2ref.no_calls.vcf.gz.tbi cohort-chrX.release_missing2ref.no_calls.vcf.gz
cohort-chr1.release_missing2ref.no_calls.vcf.gz cohort-chrX.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr1.release_missing2ref.no_calls.vcf.gz.tbi cohort-chrY.release_missing2ref.no_calls.vcf.gz
cohort-chr20.release_missing2ref.no_calls.vcf.gz cohort-chrY.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr20.release_missing2ref.no_calls.vcf.gz.tbi concat_PopVCFs.sh
cohort-chr21.release_missing2ref.no_calls.vcf.gz PopVCF.merge.list
Check | bam/
Check | benchmark/
HG002_GRCh38_1_22_v4.2.1_benchmark.bed HG005_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz HG005_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi HG005_README_v4.2.1.txt
HG002_README_v4.2.1.txt HG006_benchmark.md5
HG003_GRCh38_1_22_v4.2.1_benchmark.bed HG006_GRCh38_1_22_v4.2.1_benchmark.bed
HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz HG006_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi HG006_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
HG003_README_v4.2.1.txt HG006_README_v4.2.1.txt
HG004_GRCh38_1_22_v4.2.1_benchmark.bed HG007_benchmark.md5
HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz HG007_GRCh38_1_22_v4.2.1_benchmark.bed
HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi HG007_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
HG004_README_v4.2.1.txt HG007_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
HG005_benchmark.md5 HG007_README_v4.2.1.txt
HG005_GRCh38_1_22_v4.2.1_benchmark.bed
Check | reference/
Check | Processing Raw Data Scripts
In addition to the raw data, download_GIAB.sh
also creates (3) bash scripts to process raw data into the formats expected by the tutorial:
concat_PopVCFs.sh
+ input filePopVCF.merge.list
— merges the per-chr VCFs into a single fileAJtrio.download
— downloads GIAB Trio1HCtrio.download
— downloads GIAB Trio2
6. Process Raw Data
Note
These scripts can either be wrapped with SBATCH, or run interactively at the command line if you have enough memory. However, each script can take awhile to complete, particularly the .download
scripts (1hr+).
a. Merge the PopVCF
Warning | Download Size
We need to create a single, genome-wide PopVCF from the raw per-chromosome PopVCF files. The per-chr Population VCFs were produced by the One Thousand Genomes Project (1kGP) and used to train the Human WGS.AF model. You can view the raw files on Google Cloud Platform.
bash triotrain/variant_calling/data/GIAB/allele_freq/concat_PopVCFs.sh
Example | concat_PopVCFs.sh
source ./scripts/setup/modules.sh
bcftools concat --file-list ./triotrain/variant_calling/data/GIAB/allele_freq/PopVCF.merge.list -Oz -o ./triotrain/variant_calling/data/GIAB/allele_freq/cohort.release_missing2ref.no_calls.vcf.gz
bcftools index ./triotrain/variant_calling/data/GIAB/allele_freq/cohort.release_missing2ref.no_calls.vcf.gz
Check Intermediate Data | allele_freq/
b. Download Sequence Data
Warning | Download Size
Warning
Given the large file size, the NIST FTP server runs slowly causing curl
to timeout. You may need to run these scripts repeatedly until all data is transfered completely.
We need to download the large sequence data files, and confirm they are not corrupted by checking the MD5 checksums. These BAM/BAI
files orginate from the GIAB FTP site. An index of GIAB data created with these samples can be found on GitHub.
bash triotrain/variant_calling/data/GIAB/bam/AJtrio.download
Example | AJtrio.download
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG002.GRCh38.2x250.bam] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG002.GRCh38.2x250.bam -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_Illumina_2x250bps/novoalign_bams/HG002.GRCh38.2x250.bam --keepalive-time 300
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG002.GRCh38.2x250.bam]... done
echo 56c30eaa4e2f25ff0ac80ef30e09d78e ./triotrain/variant_calling/data/GIAB/bam/HG002.GRCh38.2x250.bam > ./triotrain/variant_calling/data/GIAB/bam/HG002.GRCh38.2x250.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG002.GRCh38.2x250.bam] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG002.GRCh38.2x250.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG002.GRCh38.2x250.bam] for corruption... done
echo ----------------------------------------------------
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG002.GRCh38.2x250.bam.bai] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG002.GRCh38.2x250.bam.bai -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_Illumina_2x250bps/novoalign_bams/HG002.GRCh38.2x250.bam.bai
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG002.GRCh38.2x250.bam.bai] ... done
echo a3c2b449df6509ca83fbd3fea22b9aee ./triotrain/variant_calling/data/GIAB/bam/HG002.GRCh38.2x250.bam.bai > ./triotrain/variant_calling/data/GIAB/bam/HG002.GRCh38.2x250.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG002.GRCh38.2x250.bam.bai] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG002.GRCh38.2x250.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG002.GRCh38.2x250.bam.bai] for corruption... done
echo ====================================================
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG003.GRCh38.2x250.bam] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG003.GRCh38.2x250.bam -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG003_NA24149_father/NIST_Illumina_2x250bps/novoalign_bams/HG003.GRCh38.2x250.bam --keepalive-time 300
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG003.GRCh38.2x250.bam]... done
echo 1b1ad2e8f5eba062086890cceb9671ab ./triotrain/variant_calling/data/GIAB/bam/HG003.GRCh38.2x250.bam > ./triotrain/variant_calling/data/GIAB/bam/HG003.GRCh38.2x250.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG003.GRCh38.2x250.bam] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG003.GRCh38.2x250.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG003.GRCh38.2x250.bam] for corruption... done
echo ----------------------------------------------------
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG003.GRCh38.2x250.bam.bai] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG003.GRCh38.2x250.bam.bai -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG003_NA24149_father/NIST_Illumina_2x250bps/novoalign_bams/HG003.GRCh38.2x250.bam.bai
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG003.GRCh38.2x250.bam.bai] ... done
echo a2697b3ebb9b5ac6d9a27734406a211c ./triotrain/variant_calling/data/GIAB/bam/HG003.GRCh38.2x250.bam.bai > ./triotrain/variant_calling/data/GIAB/bam/HG003.GRCh38.2x250.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG003.GRCh38.2x250.bam.bai] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG003.GRCh38.2x250.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG003.GRCh38.2x250.bam.bai] for corruption... done
echo ====================================================
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG004.GRCh38.2x250.bam] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG004.GRCh38.2x250.bam -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG004_NA24143_mother/NIST_Illumina_2x250bps/novoalign_bams/HG004.GRCh38.2x250.bam --keepalive-time 300
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG004.GRCh38.2x250.bam]... done
echo e084422a267170b550c9704ad61db9b2 ./triotrain/variant_calling/data/GIAB/bam/HG004.GRCh38.2x250.bam > ./triotrain/variant_calling/data/GIAB/bam/HG004.GRCh38.2x250.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG004.GRCh38.2x250.bam] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG004.GRCh38.2x250.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG004.GRCh38.2x250.bam] for corruption... done
echo ----------------------------------------------------
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG004.GRCh38.2x250.bam.bai] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG004.GRCh38.2x250.bam.bai -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG004_NA24143_mother/NIST_Illumina_2x250bps/novoalign_bams/HG004.GRCh38.2x250.bam.bai
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG004.GRCh38.2x250.bam.bai] ... done
echo 4827d6457b202f85482787964a1bc6c7 ./triotrain/variant_calling/data/GIAB/bam/HG004.GRCh38.2x250.bam.bai > ./triotrain/variant_calling/data/GIAB/bam/HG004.GRCh38.2x250.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG004.GRCh38.2x250.bam.bai] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG004.GRCh38.2x250.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG004.GRCh38.2x250.bam.bai] for corruption... done
echo ====================================================
Check Intermediate Data | HG002:
Check Intermediate Data | HG003:
Check Intermediate Data | HG004:
bash triotrain/variant_calling/data/GIAB/bam/HCtrio.download
Example | HCtrio.download
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ChineseTrio/HG005_NA24631_son/HG005_NA24631_son_HiSeq_300x/NHGRI_Illumina300X_Chinesetrio_novoalign_bams/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam --keepalive-time 300
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam]... done
echo 50080d14ba49462cfc6348e1fb41b5b3 ./triotrain/variant_calling/data/GIAB/bam/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam > ./triotrain/variant_calling/data/GIAB/bam/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam] for corruption... done
echo ----------------------------------------------------
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ChineseTrio/HG005_NA24631_son/HG005_NA24631_son_HiSeq_300x/NHGRI_Illumina300X_Chinesetrio_novoalign_bams/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai] ... done
echo 2d38162ba61e909d08e4ed438096174e ./triotrain/variant_calling/data/GIAB/bam/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai > ./triotrain/variant_calling/data/GIAB/bam/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai] for corruption... done
echo ====================================================
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ChineseTrio/HG006_NA24694-huCA017E_father/NA24694_Father_HiSeq100x/NHGRI_Illumina100X_Chinesetrio_novoalign_bams/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam --keepalive-time 300
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam]... done
echo 3a7c239988862e5a4556eb8781058b1f ./triotrain/variant_calling/data/GIAB/bam/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam > ./triotrain/variant_calling/data/GIAB/bam/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam] for corruption... done
echo ----------------------------------------------------
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ChineseTrio/HG006_NA24694-huCA017E_father/NA24694_Father_HiSeq100x/NHGRI_Illumina100X_Chinesetrio_novoalign_bams/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai] ... done
echo 7bfe1f0988d27898befeaa4673d23a47 ./triotrain/variant_calling/data/GIAB/bam/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai > ./triotrain/variant_calling/data/GIAB/bam/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai] for corruption... done
echo ====================================================
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ChineseTrio/HG007_NA24695-hu38168_mother/NA24695_Mother_HiSeq100x/NHGRI_Illumina100X_Chinesetrio_novoalign_bams/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam --keepalive-time 300
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam]... done
echo 3b8442105a4a45f71c3c1bda8af238dd ./triotrain/variant_calling/data/GIAB/bam/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam > ./triotrain/variant_calling/data/GIAB/bam/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam] for corruption... done
echo ----------------------------------------------------
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ChineseTrio/HG007_NA24695-hu38168_mother/NA24695_Mother_HiSeq100x/NHGRI_Illumina100X_Chinesetrio_novoalign_bams/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai] ... done
echo 0c218cecd8cde9ad272de6d77bbd110f ./triotrain/variant_calling/data/GIAB/bam/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai > ./triotrain/variant_calling/data/GIAB/bam/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai] for corruption... done
echo ====================================================
Check Intermediate Data | HG005:
Check Intermediate Data | HG006:
Check Intermediate Data | HG007:
7. Create TrioTrain Inputs
There are (3) required input files we must create before we can run TrioTrain. Complete details about all required data can be found in the TrioTrain User Guide.
a. Reference Dictionary File (.dict
)
Warning
This step is specific to the Human reference genome GRCh38 since cattle-specific input files are pre-packaged with TrioTrain. If you are working with a new species, you will need to create this file for your reference genome.
We need a reference dictionary file in the same directory as the reference genome. This file defines the valid genomic coordinates for TrioTrain's region shuffling.
By default, region shuffling will only use the autosomes and X chromosome. However, you can expand or contract the shuffling area by providing an alternative region file for a particular trio by providing an existing BED
file under the RegionsFile
column within the metadata file (.csv
).
picard CreateSequenceDictionary \
--REFERENCE ./triotrain/variant_calling/data/GIAB/reference/GRCh38_no_alt_analysis_set.fasta \
--OUTPUT ./triotrain/variant_calling/data/GIAB/reference/GRCh38_no_alt_analysis_set.dict \
--SPECIES human
Check | Dictionary File
b. Metadata file (.csv
)
We also need a metadata file to tell TrioTrain where to find all of previously downloaded Human GIAB data. This file contains pedigree information, and the absolute paths for file inputs. Absolute paths are required to help the Apptainer/Singularity containers identify local files. Formatting specifications for this required input can be found in the TrioTrain User Guide.
For the tutorial, we've created a helper script to automatically create an example of this file. This script uses expectations of where the tutorial data are stored to add local paths. However, outside of the tutorial this file is a user-created input.
source ./scripts/setup/modules.sh
source ./scripts/start_conda.sh # Ensure the previously built conda env is active
python3 ./triotrain/model_training/tutorial/create_metadata.py
Example | create_metadata.py
#!/usr/bin/python3
"""
description: creates a tutorial metadata file from the Human GIAB data downloaded during the walk-through
example:
python3 triotrain/model_training/tutorial/create_metadata.py
"""
import os
from pathlib import Path
from sys import path
import regex
# get the absolute path to the triotrain/ dir
abs_path = Path(__file__).resolve()
module_path = str(abs_path.parent.parent.parent)
path.append(module_path)
from helpers.dictionary import add_to_dict
from helpers.files import WriteFiles
from helpers.utils import get_logger
from helpers.wrapper import Wrapper, timestamp
# Collect start time
Wrapper(__file__, "start").wrap_script(timestamp())
# Create error log
current_file = os.path.basename(__file__)
module_name = os.path.splitext(current_file)[0]
logger = get_logger(module_name)
defaults = {
"RunOrder": 1,
"RunName": "Human_tutorial",
"ChildSampleID": "NA24385",
"ChildLabID": "HG002",
"FatherSampleID": "NA24149",
"FatherLabID": "HG003",
"MotherSampleID": "NA24695",
"MotherLabID": "HG004",
"ChildSex": "M",
"RefFASTA": "/triotrain/variant_calling/data/GIAB/reference/GRCh38_no_alt_analysis_set.fasta",
"PopVCF": "/triotrain/variant_calling/data/GIAB/allele_freq/cohort.release_missing2ref.no_calls.vcf.gz",
"RegionsFile": "NA",
"ChildReadsBAM": "/triotrain/variant_calling/data/GIAB/bam/HG002.GRCh38.2x250.bam",
"ChildTruthVCF": "/triotrain/variant_calling/data/GIAB/benchmark/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz",
"ChildCallableBED": "/triotrain/variant_calling/data/GIAB/benchmark/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed",
"FatherReadsBAM": "/triotrain/variant_calling/data/GIAB/bam/HG003.GRCh38.2x250.bam",
"FatherTruthVCF": "/triotrain/variant_calling/data/GIAB/benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz",
"FatherCallableBED": "/triotrain/variant_calling/data/GIAB/benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed",
"MotherReadsBAM": "/triotrain/variant_calling/data/GIAB/bam/HG004.GRCh38.2x250.bam",
"MotherTruthVCF": "/triotrain/variant_calling/data/GIAB/benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz",
"MotherCallableBED": "/triotrain/variant_calling/data/GIAB/benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed",
"Test1ReadsBAM": "/triotrain/variant_calling/data/GIAB/bam/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam",
"Test1TruthVCF": "/triotrain/variant_calling/data/GIAB/benchmark/HG005_GRCh38_1_22_v4.2.1_benchmark.vcf.gz",
"Test1CallableBED": "/triotrain/variant_calling/data/GIAB/benchmark/HG005_GRCh38_1_22_v4.2.1_benchmark.bed",
"Test2ReadsBAM": "/triotrain/variant_calling/data/GIAB/bam/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam",
"Test2TruthVCF": "/triotrain/variant_calling/data/GIAB/benchmark/HG006_GRCh38_1_22_v4.2.1_benchmark.vcf.gz",
"Test2CallableBED": "/triotrain/variant_calling/data/GIAB/benchmark/HG006_GRCh38_1_22_v4.2.1_benchmark.bed",
"Test3ReadsBAM": "/triotrain/variant_calling/data/GIAB/bam/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam",
"Test3TruthVCF": "/triotrain/variant_calling/data/GIAB/benchmark/HG007_GRCh38_1_22_v4.2.1_benchmark.vcf.gz",
"Test3CallableBED": "/triotrain/variant_calling/data/GIAB/benchmark/HG007_GRCh38_1_22_v4.2.1_benchmark.bed",
}
cwd = Path.cwd()
output_dict = dict()
for k, v in defaults.items():
match = regex.search(r"\/triotrain\/variant_calling\/.*", str(v))
if match:
add_to_dict(
update_dict=output_dict,
new_key=k,
new_val=f"{cwd}{v}",
logger=logger,
logger_msg="[tutorial]",
)
else:
add_to_dict(
update_dict=output_dict,
new_key=k,
new_val=v,
logger=logger,
logger_msg="[tutorial]",
)
output_file = WriteFiles(
path_to_file=str(cwd / "triotrain" / "model_training" / "tutorial"),
file=f"GIAB.Human_tutorial_metadata.csv",
logger=logger,
logger_msg="[tutorial]",
)
output_file.check_missing()
if output_file.file_exists:
logger.info("[tutorial]: tutorial metadata file already exists... SKIPPING AHEAD")
else:
logger.info("[tutorial]: creating a tutorial metadata file now...")
output_file.add_rows(col_names=output_dict.keys(), data_dict=output_dict)
output_file.check_missing()
if output_file.file_exists:
logger.info("[tutorial]: successfully created the tutorial metadata file")
# Collect end time
Wrapper(__file__, "start").wrap_script(timestamp())
Check | Tutorial Metadata File
c. SLURM Resource Config File (.json
)
The last required input we need for TrioTrain is the SLURM resource config file (.json
). This file tells TrioTrain what resources to request from your HPC cluster when submitting SLURM jobs. [
Example | Resource Config File
{
"make_examples": {
"partition": "hpc5,hpc6,BioCompute",
"nodes": 1,
"ntasks": 40,
"mem": 379067,
"CPUmem": 9000,
"time": "0-2:00:00",
"account": "animalsci",
"email": "jakth2@mail.missouri.edu"
},
"beam_shuffle": {
"partition": "hpc5,BioCompute",
"nodes": 1,
"ntasks": 40,
"mem": 379067,
"time": "0-2:00:00",
"account": "animalsci",
"email": "jakth2@mail.missouri.edu"
},
"re_shuffle": {
"partition": "BioCompute,Lewis",
"nodes": 1,
"ntasks": 1,
"mem": "200G",
"time": "0-10:00:00",
"account": "animalsci",
"email": "jakth2@mail.missouri.edu"
},
"train_eval": {
"partition": "gpu3",
"gres": "gpu:2",
"nodes": 1,
"ntasks": 16,
"mem": "0",
"time": "2-00:00:00",
"account": "animalsci",
"email": "jakth2@mail.missouri.edu"
},
"select_ckpt": {
"partition": "BioCompute,Lewis",
"nodes": 1,
"ntasks": 1,
"mem": 500,
"time": "0-00:30:00",
"account": "animalsci",
"email": "jakth2@mail.missouri.edu"
},
"call_variants": {
"partition": "hpc5,BioCompute",
"nodes": 1,
"ntasks": 40,
"mem": 379067,
"time": "2-00:00:00",
"account": "animalsci",
"email": "jakth2@mail.missouri.edu"
},
"compare_happy": {
"partition": "BioCompute,hpc5",
"nodes": 1,
"ntasks": 40,
"mem": "300G",
"time": "0-12:00:00",
"account": "biocommunity",
"email": "jakth2@mail.missouri.edu"
},
"convert_happy": {
"partition": "BioCompute,hpc3,hpc5,Lewis",
"nodes": 1,
"ntasks": 24,
"mem": "120G",
"time": "0-05:00:00",
"account": "biocommunity",
"email": "jakth2@mail.missouri.edu"
},
"show_examples": {
"partition": "BioCompute,Lewis",
"nodes": 1,
"mem": "1G",
"time": "0-02:00:00",
"account": "animalsci",
"email": "jakth2@mail.missouri.edu"
},
"summary": {
"partition": "hpc5,hpc6,Lewis",
"nodes": 1,
"ntasks": 4,
"mem": "40G",
"time": "0-10:00:00",
"account": "schnabellab",
"email": "jakth2@mail.missouri.edu"
}
}
The hardware listed above for each phase (e.g. mem
, ntasks
, gpus
, etc.) vary, as some phases are memory intensive. These values should not be interpreted as the minimum or the optimum resources required for each phase. The MU Lewis Research Computing Cluster is heterogenous, the example config file requests resources to maximize the number of compute nodes when running memory-intensive phases.
For the tutorial, copy the above example into a new file, and manually edit the SBATCH parameters to match your HPC cluster (i.e. changing the partition list, account, and email address). This new resource config file is passed to TrioTrain via -r </path/to/new_file.json>
8. Run Shuffling Demo
Shuffling the labeled examples is a critical step to re-training DeepVariant because the model assumes successive training images are independent of one another. DeepVariant includes an Apache Beam pipeline that puts training examples in random genomic order. However, in our experience, getting the Python SDK for Beam, Apache Spark and SLURM to cooperate is a dependency nightmare.
Our alternative approach, referred to as "Region Shuffling", splits the autosomes and the X chromosome into subset regions before making the labeled examples. Using the previously created .dict
file created for the reference, TrioTrain will determine how many regions are required automatically.
Warning
For our analyses in bovine genomes, we exclude parts of the genome from our region shuffling, including:
-
the Y chromosome and the Mitochondrial genome — due to limitations with the
ARS-UCD1.2_Btau5.0.1Y
reference genome -
the unmapped reads — due to a large volume of contigs containing a small amount of variants that can not be easily distributed across 60-150+ shuffling regions
TrioTrain will automatically search the reference genome .dict
file for contigs labeled with Y
and M/MT
and will ignore them prior to defining shuffling regions.
Use the --unmapped-reads
flag to provide a contig prefix pattern to exclude. The default value (NKLS) is approprate for the ARS-UCD1.2_Btau5.0.1Y
reference genome.
Note | Identifying Contig Labels
A shuffling region is defined in a non-overlapping, 0-based BED
file. The goal is to create a subset of labeled examples which are representative of the whole genome, yet small enough to be shuffled within the memory of a single compute node. After defining the shuffling regions, labeled and shuffled examples are created by running parallel SLURM jobs. In our testing with bovine genomes, constraining each region to produce ~200,000 labeled examples works well.
In our experience, numerous small jobs running quickly minimizes wall time between phases. We also priorize a large number of regions, as this further increases the randomness during the final re_shuffle
processes that randomizes the order regions are given to the model.
Example | Computing Power
Given ~370G mem across 40 cores, DeepVariant will shuffle each region's examples in around 6 minutes, which can be submitted in parallel to reduce wall time.
Note | Output Volume
A typical cattle genome with ~8 million variants results in around ~65 regions, resulting in (65 regions x 40 cores) tfrecord output files. TrioTrain repeats this process for shuffling, resulting in another (65 regions x 40 cores) shuffled tfrecord output files. This is repeated for all 3 genomes within a trio resulting in a large volume of individual files (65 regions x 40 cores x 2 phases per genome).
The final step of our approach, referrred to as re_shuffle
, condenses the 40 tfrecord files for a region, reducing the number of files fed to DeepVariant per genome back down to 65. To further improve shuffling, we randomize the order regions are provided to DeepVariant.
If your species of interest produces a large volume of variants (> 8 million / genome), TrioTrain can easily overload your cluster's I/O capabilities, or write more than 10,000 files to a directory, or overwhelm the SLURM scheduler by submitting thousands of jobs simultaneously. Future versions of TrioTrain will address these challenges, but proceed with caution; your cluster's SysAdmin will thank you!
This shuffling demo enables you to adjust TrioTrain's parameters for estimating how many shuffling regions to use for each genome. The two primary flags which control Region Shuffling are:
-
--max-examples
— defines the upper limit of total number of examples to produce for each region; values greater than the default (200,000) will increase the wall time. -
--est-examples
— defines expectations of the number of examples DeepVariant produces per genomic variant; the default value (1.5) is appropriate for bovine genomes, but with the human GIAB samples, we found a value of 1 to be more appropriate.
Additionally, TrioTrain bases the Region Shuffling calculations on the total number of variants found in each genome. This value can vary drastically across training genomes, so TrioTrain uses the number of PASS variants within the corresponding TruthVCF. A larger number of truth variants results in more regions made per region.
Before running TrioTrain in a new mamalian genome or on a new HPC cluster, we strongly recommend completing the region shuffling demo using the smallest chromosome possible using the --demo-chr=<region_literal>
flag.
Note | Working in Other Species
The appropriate values for max-examples
and est-examples
may vary widely across species, as the number of examples DeepVariant produces depends on several factors including:
-
genome size
-
the reference genome used
-
how many variants are identified in an individual genome
For example, the F1-hybrid offspring in cattle are crosses between diverent lineages, resulting in an abnormal amount heterzygous genotypes compared to typical cattle genomes. Instead of typical 8 million variants per genome, these samples produced 20+ million variants per genome.
For this tutorial, complete the demo for Human genomes by editing the following at the command to include your new SLURM config file:
# You can add the --dry-run flag to this command to confirm the TrioTrain pipeline runs smoothly before submitting jobs to the queue
source ./scripts/setup/modules.sh
source ./scripts/start_conda.sh # Ensure the previously built conda env is active
python3 triotrain/run_trio_train.py \
-g Father \
--unmapped-reads chrUn \
-m triotrain/model_training/tutorial/GIAB.Human_tutorial_metadata.csv \
-n demo \
--demo-mode \
--demo-chr chr21 \
--num-tests 3 \
--custom-checkpoint triotrain/model_training/pretrained_models/v1.4.0_withIS_withAF/wgs_af.model.ckpt \
--output ../TUTORIAL \
-r </path/to/new_file.json>
This demo will produce and submit the following SLURM jobs:
make_examples
— for both Father & Childbeam_shuffle
— for both Father & Childre_shuffle
— for both Father & Childcall_variants
— for just the Father
There are (6) types of output files from running the demo:
File Extension | Number of Files (Shards) | Genome Used |
---|---|---|
labeled.tfrecords*.gz |
N | Father, Child |
labeled.tfrecords*.gz.example_info.json |
N | Father, Child |
labeled.shuffled*.tfrecord.gz |
N | Father, Child |
labeled.shuffled.dataset_config.pbtxt |
1 | Father, Child |
labeled.shuffled.merged.dataset_config.pbtxt |
1 | Father, Child |
labeled.shuffled.merged.tfrecord.gz |
1 | Father, Child |
N = number of CPUs requested in your resources_used.json file, shards are numbered 0 — (N - 1). |
Expected Output | Father Shuffling:
Father.chr21.labeled.shuffled-00000-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00001-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00002-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00003-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00004-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00005-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00006-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00007-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00008-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00009-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00010-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00011-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00012-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00013-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00014-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00015-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00016-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00017-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00018-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00019-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00020-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00021-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00022-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00023-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00024-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00025-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00026-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00027-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00028-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00029-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00030-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00031-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00032-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00033-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00034-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00035-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00036-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00037-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00038-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00039-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled.dataset_config.pbtxt
Father.chr21.labeled.tfrecords-00000-of-00040.gz
Father.chr21.labeled.tfrecords-00000-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00001-of-00040.gz
Father.chr21.labeled.tfrecords-00001-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00002-of-00040.gz
Father.chr21.labeled.tfrecords-00002-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00003-of-00040.gz
Father.chr21.labeled.tfrecords-00003-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00004-of-00040.gz
Father.chr21.labeled.tfrecords-00004-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00005-of-00040.gz
Father.chr21.labeled.tfrecords-00005-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00006-of-00040.gz
Father.chr21.labeled.tfrecords-00006-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00007-of-00040.gz
Father.chr21.labeled.tfrecords-00007-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00008-of-00040.gz
Father.chr21.labeled.tfrecords-00008-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00009-of-00040.gz
Father.chr21.labeled.tfrecords-00009-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00010-of-00040.gz
Father.chr21.labeled.tfrecords-00010-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00011-of-00040.gz
Father.chr21.labeled.tfrecords-00011-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00012-of-00040.gz
Father.chr21.labeled.tfrecords-00012-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00013-of-00040.gz
Father.chr21.labeled.tfrecords-00013-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00014-of-00040.gz
Father.chr21.labeled.tfrecords-00014-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00015-of-00040.gz
Father.chr21.labeled.tfrecords-00015-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00016-of-00040.gz
Father.chr21.labeled.tfrecords-00016-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00017-of-00040.gz
Father.chr21.labeled.tfrecords-00017-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00018-of-00040.gz
Father.chr21.labeled.tfrecords-00018-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00019-of-00040.gz
Father.chr21.labeled.tfrecords-00019-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00020-of-00040.gz
Father.chr21.labeled.tfrecords-00020-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00021-of-00040.gz
Father.chr21.labeled.tfrecords-00021-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00022-of-00040.gz
Father.chr21.labeled.tfrecords-00022-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00023-of-00040.gz
Father.chr21.labeled.tfrecords-00023-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00024-of-00040.gz
Father.chr21.labeled.tfrecords-00024-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00025-of-00040.gz
Father.chr21.labeled.tfrecords-00025-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00026-of-00040.gz
Father.chr21.labeled.tfrecords-00026-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00027-of-00040.gz
Father.chr21.labeled.tfrecords-00027-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00028-of-00040.gz
Father.chr21.labeled.tfrecords-00028-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00029-of-00040.gz
Father.chr21.labeled.tfrecords-00029-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00030-of-00040.gz
Father.chr21.labeled.tfrecords-00030-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00031-of-00040.gz
Father.chr21.labeled.tfrecords-00031-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00032-of-00040.gz
Father.chr21.labeled.tfrecords-00032-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00033-of-00040.gz
Father.chr21.labeled.tfrecords-00033-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00034-of-00040.gz
Father.chr21.labeled.tfrecords-00034-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00035-of-00040.gz
Father.chr21.labeled.tfrecords-00036-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00037-of-00040.gz
Father.chr21.labeled.tfrecords-00037-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00038-of-00040.gz
Father.chr21.labeled.tfrecords-00038-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00039-of-00040.gz
Father.chr21.labeled.tfrecords-00039-of-00040.gz.example_info.json
Confirm that SLURM jobs completed successfully
Expected Output | Child Shuffling:
Child.chr21.labeled.shuffled-00000-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00001-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00002-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00003-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00004-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00005-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00006-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00007-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00008-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00009-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00010-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00011-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00012-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00013-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00014-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00015-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00016-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00017-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00018-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00019-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00020-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00021-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00022-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00023-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00024-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00025-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00026-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00027-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00028-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00029-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00030-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00031-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00032-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00033-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00034-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00035-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00036-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00037-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00038-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00039-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled.dataset_config.pbtxt
Child.chr21.labeled.tfrecords-00000-of-00040.gz
Child.chr21.labeled.tfrecords-00000-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00001-of-00040.gz
Child.chr21.labeled.tfrecords-00001-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00002-of-00040.gz
Child.chr21.labeled.tfrecords-00002-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00003-of-00040.gz
Child.chr21.labeled.tfrecords-00003-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00004-of-00040.gz
Child.chr21.labeled.tfrecords-00004-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00005-of-00040.gz
Child.chr21.labeled.tfrecords-00005-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00006-of-00040.gz
Child.chr21.labeled.tfrecords-00006-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00007-of-00040.gz
Child.chr21.labeled.tfrecords-00007-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00008-of-00040.gz
Child.chr21.labeled.tfrecords-00008-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00009-of-00040.gz
Child.chr21.labeled.tfrecords-00009-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00010-of-00040.gz
Child.chr21.labeled.tfrecords-00010-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00011-of-00040.gz
Child.chr21.labeled.tfrecords-00011-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00012-of-00040.gz
Child.chr21.labeled.tfrecords-00012-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00013-of-00040.gz
Child.chr21.labeled.tfrecords-00013-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00014-of-00040.gz
Child.chr21.labeled.tfrecords-00014-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00015-of-00040.gz
Child.chr21.labeled.tfrecords-00015-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00016-of-00040.gz
Child.chr21.labeled.tfrecords-00016-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00017-of-00040.gz
Child.chr21.labeled.tfrecords-00017-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00018-of-00040.gz
Child.chr21.labeled.tfrecords-00018-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00019-of-00040.gz
Child.chr21.labeled.tfrecords-00019-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00020-of-00040.gz
Child.chr21.labeled.tfrecords-00020-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00021-of-00040.gz
Child.chr21.labeled.tfrecords-00021-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00022-of-00040.gz
Child.chr21.labeled.tfrecords-00022-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00023-of-00040.gz
Child.chr21.labeled.tfrecords-00023-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00024-of-00040.gz
Child.chr21.labeled.tfrecords-00024-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00025-of-00040.gz
Child.chr21.labeled.tfrecords-00025-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00026-of-00040.gz
Child.chr21.labeled.tfrecords-00026-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00027-of-00040.gz
Child.chr21.labeled.tfrecords-00027-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00028-of-00040.gz
Child.chr21.labeled.tfrecords-00028-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00029-of-00040.gz
Child.chr21.labeled.tfrecords-00029-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00030-of-00040.gz
Child.chr21.labeled.tfrecords-00030-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00031-of-00040.gz
Child.chr21.labeled.tfrecords-00031-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00032-of-00040.gz
Child.chr21.labeled.tfrecords-00032-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00033-of-00040.gz
Child.chr21.labeled.tfrecords-00033-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00034-of-00040.gz
Child.chr21.labeled.tfrecords-00034-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00035-of-00040.gz
Child.chr21.labeled.tfrecords-00035-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00036-of-00040.gz
Child.chr21.labeled.tfrecords-00036-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00037-of-00040.gz
Child.chr21.labeled.tfrecords-00037-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00038-of-00040.gz
Child.chr21.labeled.tfrecords-00038-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00039-of-00040.gz
Child.chr21.labeled.tfrecords-00039-of-00040.gz.example_info.json
Confirm that SLURM jobs completed successfully
Expected Output | Benchmarking:
less ../TUTORIAL/demo/summary/Human_tutorial.SLURM.job_numbers.csv
AnalysisName,RunName,Parent,Phase,JobList
Baseline-v1.4.0,Human_tutorial,Father,make_examples,27669522
Baseline-v1.4.0,Human_tutorial,Father,beam_shuffle,27669523
Baseline-v1.4.0,Human_tutorial,Father,make_examples,27669524
Baseline-v1.4.0,Human_tutorial,Father,beam_shuffle,27669525
# The JobList column will differ based on SLURM job numbers
The following will provide a conservative estimate for the --max-examples
and --est-examples
parameters to ensure shuffling easily fits within your available memory.
source ./scripts/setup/modules.sh
source ./scripts/start_conda.sh # Ensure the previously built conda env is active
python3 triotrain/model_training/tutorial/estimate.py \
--vcf-file ../TUTORIAL/demo/Human_tutorial/test_Father/test1_chr21.vcf.gz \
-g Father \
--demo-mode \
--env-file ../TUTORIAL/demo/envs/run1.env
Expected Output | Estimating TrioTrain Parameters:
===== start of triotrain/model_training/tutorial/estimate.py @ 2023-06-29 11:25:27 =====
2023-06-29 11:25:27 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: number of REF/REF variants found | 44,452
2023-06-29 11:25:27 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: number of PASS variants found | 114,332
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: default maximum examples per region | 200,000
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: default value for --max-examples is appropriate
2023-06-29 11:25:28 AM - [INFO] - adding Demo_TotalVariants='114332'
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: added 'Demo_TotalVariants=114332' to env file
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: number of examples made | 75,902
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: calculated examples per variant | 0.664
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: prevent underestimating which creates too many examples per region by rounding up to the nearest 0.5 | 1.0
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: default examples per variant | 1.5
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: difference between default and calculated examples per variant | 0.836
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: when running TrioTrain outside of this tutorial, please use '--est-examples=1.0'
2023-06-29 11:25:28 AM - [INFO] - adding Est_Examples='1.0'
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: added 'Est_Examples=1.0' to env file
===== end of triotrain/model_training/tutorial/estimate.py @ 2023-06-29 11:25:28 =====
9. Run TrioTrain with a Human Trio
Now that we know how to tailor TrioTrain for our non-bovine species (human), we can move forward with starting the re-training pipeline.
Complete the GIAB TrioTrain by editing the following at the command to include your new SLURM config file:
# You can add the --dry-run flag to this command to confirm the TrioTrain pipeline runs smoothly prior to submitting jobs to the queue
source ./scripts/setup/modules.sh
source ./scripts/start_conda.sh # Ensure the previously built conda env is active
python3 triotrain/run_trio_train.py \
-g Father \
--unmapped-reads chrUn \
--est-examples 1 \
-m triotrain/model_training/tutorial/GIAB.Human_tutorial_metadata.csv \
-n GIAB_Trio \
--num-tests 3 \
--custom-checkpoint triotrain/model_training/pretrained_models/v1.4.0_withIS_withAF/wgs_af.model.ckpt \
--output ../TUTORIAL \
--benchmark \
-r </path/to/new_file.json>
Check | Baseline WGS.AF:
Check | Father:
Check | Mother:
Merge Results | Per-Iteration
Each Test#.total.metrics.csv
output file should contain 57 rows and 2 columns. The metrics within are the raw and proprotional performance metrics from hap.py. After all convert_happy
jobs complete, we will separately merge the results from running the baseline iteration, and both training iterations completed during the GIAB tutorial:
source ./scripts/setup/modules.sh
source ./scripts/start_conda.sh # Ensure the previously built conda env is active
for start_i in $(seq 0 1); do
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: merging processed results from hap.py for GIAB run#${start_i}"
python3 triotrain/summarize/merge_results.py --env ../TUTORIAL/GIAB_Trio/envs/run${start_i}.env -g Father -m triotrain/summarize/data/tutorial_metadata.csv
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: finished merging processed results from hap.py for GIAB run#${start_i}"
done
Check | Baseline WGS.AF
Check | GIAB Trio1
Clean Up Directories | Per-Iteration
Running TrioTrain produces a large volume of temporary files. Run the following at the command line to free up space:
source ./scripts/setup/modules.sh
source ./scripts/start_conda.sh # Ensure the previously built conda env is active
for start_i in $(seq 0 1); do
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: removing temp files for GIAB run#${start_i}"
python3 triotrain/model_training/slurm/clean_tmp.py --env ../TUTORIAL/GIAB_Trio/envs/run${start_i}.env
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: finished removing temp files for GIAB run#${start_i}"
done
Check | Baseline WGS.AF
Check | GIAB Trio1
du -h ../TUTORIAL/GIAB_Trio
``
```bash title="Expected outputs:"
141M ../TUTORIAL/GIAB_Trio/Human_tutorial/logs
402M ../TUTORIAL/GIAB_Trio/Human_tutorial/compare_Mother
1.1M ../TUTORIAL/GIAB_Trio/Human_tutorial/train_Mother/eval_Child
23G ../TUTORIAL/GIAB_Trio/Human_tutorial/train_Mother
394M ../TUTORIAL/GIAB_Trio/Human_tutorial/test_Father
298M ../TUTORIAL/GIAB_Trio/Human_tutorial/compare_Father
1.1M ../TUTORIAL/GIAB_Trio/Human_tutorial/train_Father/eval_Child
23G ../TUTORIAL/GIAB_Trio/Human_tutorial/train_Father
243G ../TUTORIAL/GIAB_Trio/Human_tutorial/examples
391M ../TUTORIAL/GIAB_Trio/Human_tutorial/test_Mother
3.1M ../TUTORIAL/GIAB_Trio/Human_tutorial/jobs
290G ../TUTORIAL/GIAB_Trio/Human_tutorial
100K ../TUTORIAL/GIAB_Trio/summary
50K ../TUTORIAL/GIAB_Trio/envs
290G ../TUTORIAL/GIAB_Trio/