Human GIAB Tutorial

This tutorial completes one round of re-training with a Human trio from GIAB.

Warning | Storage Needs

Completing this tutorial will produce ~2T of intermediate and output data. Ensure you have sufficient space before proceeding!

1. Confirm Successful Configuration

Check | Installation

Run the following at the command line:

cd DV-TrioTrain
ls

Expected outputs:

deepvariant_1.4.0-gpu.sif  docs    hap.py_v0.3.12.sif  miniconda_envs  README.md  triotrain
deepvariant_1.4.0.sif      errors  LICENSE             mkdocs.yml      scripts

triotrain/ directory contains Python modules for the DV-TrioTrain package
scripts/ directory contains Bash helper scripts and functions that can be used as templates

2. Activate Environment

Repeat the first two steps of Configuration:

4. Download pre-trained models

Warning | Download Size

Running a local copy of a container requires us to create a local copy of the model.ckpt files from v1.4.0 of DeepVariant. These checkpoints are the human-trained models produced by Google Genomics Health Group. Details about published models compatible with TrioTrain can be found here.

We need to download (2) two model checkpoints:

the default human model
the WGS.AF human model

Run the following at the command line:

bash scripts/setup/download_models.sh

Example | download_models.sh

#!/bin/bash
# scripts/setup/download_models.sh
"""
NOTES: These model ckpts are only required for warm-starting re-training, and are not used with 'run_deepvariant'

"""
echo -e "=== scripts/setup/download_models.sh > start $(date)"

##======= Downloading default WGS model (without AF channel) ===================##
#   This model allows for ONLY 7 layers in the example images 
#   Meaning, it is NOT compatible examples built with the allele frequency channel!

#   To view in web-browser, execute this to obtain a valid link: 
#       echo "https://console.cloud.google.com/storage/browser/deepvariant/models/DeepVariant/${BIN_VERSION_DV}/DeepVariant-inception_v3-${BIN_VERSION_DV}+data-wgs_standard"

echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Downloading 7-channel WGS model checkpoint (with InsertSize, but without AlleleFreq) - DeepVariant v${BIN_VERSION_DV}"
NO_AF_MODEL_BUCKET="https://storage.googleapis.com/deepvariant/models/DeepVariant/${BIN_VERSION_DV}/DeepVariant-inception_v3-${BIN_VERSION_DV}+data-wgs_standard"

# Use the same file naming convention and enable restarting of any interrupted downloads
curl --create-dirs --continue-at - "${NO_AF_MODEL_BUCKET}/model.ckpt.{data-00000-of-00001,index,example_info.json,meta}" --output "triotrain/model_training/pretrained_models/v${BIN_VERSION_DV}_withIS_noAF/model.ckpt.#1"
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Done - Downloading 7-channel WGS model checkpoint (with InsertSize, but without AlleleFreq)"
##========================================================================##

##======= Downloading custom WGS model (with Allele Frequency channel) ======================##
#   This model allows for 8 layers in the example images 
#   Meaning, it IS compatible compatible examples built with the allele frequency channel!

#   To view in web-browser, execute this to obtain a valid link:  
#       https://console.cloud.google.com/storage/browser/brain-genomics-public/research/allele_frequency/pretrained_model_WGS/1.4.0;tab=objects?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false

echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Downloading 8-channel WGS model checkpoint (with InsertSize, with AlleleFreq) - DeepVariant v${BIN_VERSION_DV}"

AF_MODEL_BUCKET="https://storage.googleapis.com/brain-genomics-public/research/allele_frequency/pretrained_model_WGS/${BIN_VERSION_DV}"

# Use the same file naming convention and enable restarting of any interrupted downloads
curl --create-dirs --continue-at - "${AF_MODEL_BUCKET}/model.ckpt.{data-00000-of-00001,index,meta,example_info.json}" --output "triotrain/model_training/pretrained_models/v${BIN_VERSION_DV}_withIS_withAF/wgs_af.model.ckpt.#1"
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Done - Downloading 8-channel WGS model checkpoint (with InsertSize, with AlleleFreq)"

##========================================================================##

echo -e "=== scripts/setup/download_models.sh > end $(date)"

Check | Default human model

Run the following at the command line:

ls triotrain/model_training/pretrained_models/v1.4.0_withIS_noAF/

Expected outputs:

model.ckpt.data-00000-of-00001  model.ckpt.example_info.json  model.ckpt.index  model.ckpt.meta

Check | WGS.AF human model

Run the following at the command line:

ls triotrain/model_training/pretrained_models/v1.4.0_withIS_withAF/

Expected outputs:

wgs_af.model.ckpt.data-00000-of-00001  wgs_af.model.ckpt.index
wgs_af.model.ckpt.example_info.json    wgs_af.model.ckpt.meta

5. Download Raw Data

Warning | Download Size

We are using two Human trios from the v4.2.1 GIAB benchmarking data.

TrioNumber	TrioName	CommonID	SampleID	Relationship
1	AshkenaziJew	HG002	NA24385	Son
		HG003	NA24149	Father
		HG004	NA24143	Mother
2	HanChinese	HG005	NA24631	Son
		HG006	NA24694	Father
		HG007	NA24695	Mother

We need (5) types of raw data:

Number	Description	Extension
1.	the GRCh38 reference genome	`.fasta`, `.fai`
2.	1kGP Population Allele Frequency, with index file	`.vcf.gz`, `vcf.gz.tbi`
3.	benchmarking files
	per-genome truth callsets, with index files	`.vcf.gz`, `vcf.gz.tbi`
	per genome truth regions file	`.bed`
4.	sample metadata
	checksums file	`.md5`
	index files	`.txt`
5.	the aligned reads files, with index file	`.bam`, `.bai`

Run the following at the command line:

bash scripts/setup/download_GIAB.sh

Example | download_GIAB.sh

#!/bin/bash
# scripts/setup/download_GIAB.sh
echo -e "=== scripts/setup/download_GIAB.sh > start $(date)"

# NOTES --------------------------------------------------------------#
# The DV developers use the following data in their walk-through docs:
# 1) Reference Genome Version: GRCh38_no_alt_analysis_set.fasta
# 2) Benchmark Sample: HGOO3 [AshkenazimTrio-Father], typically only Chr20.
# 3) PopVCF (without genotypes): v3_missing2ref

# However, for a direct comparision between our cattle model and previous models, we need to calculate Mendelian Inheritance Errors (MIE). Therefore, we will be downloading the parents in this trio. We'll also need to download the reference PopVCF used to build the humanWGS_AF model.

#-------------------------------------------------------------------#
#                            GIAB Trio1                             #
#-------------------------------------------------------------------#
# Known as the Ashkenazi Jew Trio
# From Personal Genome Project
# AJ Son = HG002_NA24385_son
# AJ Father = HG003_NA24149_father
# AJ Mother = HG004_NA24143_mother

#-------------------------------------------------------------------#
#                            GIAB Trio2                             #
#-------------------------------------------------------------------#
# Known as the Han Chinese Trio
# From Personal Genome Project
# Chinese Son = HG005_NA24631_son
# Chinese Father = HG006_NA24694_father
# Chinese Mother = HG007_NA24695_mother

# A preprint describing these calls is at https://doi.org/10.1101/2020.07.24.212712.  The paper(s) above can be cited for use of the benchmark, and please cite http://www.nature.com/articles/sdata201625 (doi:10.1038/sdata.2016.25) when using the corresponding sequencing data.

cd triotrain/variant_calling/data/GIAB

##===================================================================
##                         General Files                             
##===================================================================

## --------------- Download Reference Genome ------------------------
## NOTES ----
## Files Downloaded Include:
## 1. GRCh38 reference genome [FASTA]
## 2. GRCh38 index [FAI]

# Create the output dir
install --directory --verbose reference

REFDIR=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/
REFFILE=GCA_000001405.15_GRCh38_no_alt_analysis_set

if [ ! -f ./reference/md5checksums.txt ]; then
    echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading GRCh38 checksum now..."
    curl -s --continue-at - ${REFDIR}/md5checksums.txt -o ./reference/md5checksums.txt
else
    echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | './reference/md5checksums.txt'"
fi

# Define the file extensions to be downloaded:
declare -a Ext=(".fna.gz" ".fna.fai")

if [ ! -f "./reference/GRCh38_no_alt_analysis_set.fasta" ]; then 
    for e in ${Ext[@]}; do
        if [ -f ./reference/md5checksums.txt ]; then
            echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading ${REFFILE}${e} now..."
            curl -s --continue-at - "${REFDIR}/${REFFILE}${e}" -o "./reference/${REFFILE}${e}" 
            echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: checking ${REFFILE}${e} for corruption..."
            check_sum=$(cat ./reference/md5checksums.txt | grep "${REFFILE}${e}")
            old_path="./"
            new_path="./reference/"
            valid_check_sum="${check_sum/$old_path/$new_path}"
            echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: $(echo $valid_check_sum | md5sum -c)"
        else
            echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | '${REFFILE}${e}'"
        fi
    done
fi

if [ -f "./reference/${REFFILE}.fna.gz" ]; then
    echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Unzipping GRCh38 and re-naming reference files now..."
    gunzip -c "./reference/${REFFILE}.fna.gz" > "./reference/GRCh38_no_alt_analysis_set.fasta"
    rm "./reference/${REFFILE}.fna.gz"
    mv "./reference/${REFFILE}.fna.fai" ./reference/GRCh38_no_alt_analysis_set.fasta.fai
else
    echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: reference re-named already... SKIPPING AHEAD"
fi

## --------------- Download Population VCF --------------------------
## NOTES ----
## Files Downloaded Include:
## 1. One-thousand genomes (1kGP) allele frequencies [VCF]
## 2. Allele-frequencies index [TBI]
## These data are ~940GiB to download completly.
## THERE ARE TWO PAGES OF FILES! (v)
## The newest version of WGS_AF model can be viewed on GCP here: https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false

# Create the output dir
mkdir -p allele_freq

# Define where to get the AF data
AF_DIR="https://storage.googleapis.com/brain-genomics-public/research/cohort/1KGP/cohort_dv_glnexus_opt/v3_missing2ref"

for i in {{1..22},X,Y}
do  
    if [ ! -f ./allele_freq/cohort-chr${i}.release_missing2ref.no_calls.vcf.gz ]; then
        echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading chr${i} PopVCF..."
        curl -s --continue-at - ${AF_DIR}/cohort-chr${i}.release_missing2ref.no_calls.vcf.gz -o ./allele_freq/cohort-chr${i}.release_missing2ref.no_calls.vcf.gz 
    else
        echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | 'chr${i} PopVCF'"
    fi

    if [ ! -f ./allele_freq/cohort-chr${i}.release_missing2ref.no_calls.vcf.gz.tbi ]; then
        echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading chr${i} index ..." 
        curl -s --continue-at - ${AF_DIR}/cohort-chr${i}.release_missing2ref.no_calls.vcf.gz.tbi -o ./allele_freq/cohort-chr${i}.release_missing2ref.no_calls.vcf.gz.tbi 
    else
        echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | 'chr${i} index'"
    fi
done

# Merge the chr AF into a genome-wide AF
if [ ! -f "./allele_freq/PopVCF.merge.list" ]; then 
    for i in {{1..22},X,Y}
    do
        echo "././triotrain/variant_calling/data/GIAB/allele_freq/cohort-chr$i.release_missing2ref.no_calls.vcf.gz" >> ./allele_freq/PopVCF.merge.list
    done
fi

if [ ! -f "./allele_freq/concat_PopVCFs.sh" ]; then 
echo -e "source ./scripts/setup/modules.sh
bcftools concat --file-list ./triotrain/variant_calling/data/GIAB/allele_freq/PopVCF.merge.list -Oz -o ./triotrain/variant_calling/data/GIAB/allele_freq/cohort.release_missing2ref.no_calls.vcf.gz
bcftools index ./triotrain/variant_calling/data/GIAB/allele_freq/cohort.release_missing2ref.no_calls.vcf.gz" > ./allele_freq/concat_PopVCFs.sh
fi

##===================================================================
##                       Trio-Specific Files                             
##===================================================================

## --------------- Download GIAB Truth Files ------------------------
## NOTES ---- 
## Currently using v4.2.1
## Files Downloaded Include:
## 1. benchmarking region files [BED]
## 2. benchmarking genotypes [VCF]
## 3. benchmarking genotype index [TBI]
## The benchmarking files from NIST can be found here: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/

# Create output dir
mkdir -p benchmark

# Define where to get the truth data
TRUTHDIR=https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release
benchmark_version="_GRCh38_1_22_v4.2.1_benchmark"
bench_ext=("_noinconsistent.bed" ".vcf.gz" ".vcf.gz.tbi")

#-------------------------------------------------------------------#
#                            GIAB Trio1                             #
#-------------------------------------------------------------------#
declare -A Trio1=(["HG002"]="HG002_NA24385_son" ["HG003"]="HG003_NA24149_father" ["HG004"]="HG004_NA24143_mother")
for t in ${!Trio1[@]}; do
    echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading ${t}=${Trio1[${t}]} benchmarking files now..."
    for e in ${!bench_ext[@]}; do
        if [ ! -f "./benchmark/${t}${benchmark_version}${bench_ext[${e}]}" ]; then
            echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading ${t}${benchmark_version}${bench_ext[${e}]}..."
            curl --continue-at - ${TRUTHDIR}/AshkenazimTrio/${Trio1[${t}]}/NISTv4.2.1/GRCh38/${t}${benchmark_version}${bench_ext[${e}]} -o ./benchmark/${t}${benchmark_version}${bench_ext[${e}]}
            # No MD5 listed on NIST FTP SITE!
        else
            echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | '${t}${benchmark_version}${bench_ext[${e}]}'"
        fi
    done

    # Download the README
    if [ ! -f "./benchmark/${t}_README_v4.2.1.txt" ]; then
        echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading ${t} readme..."
        curl -s --continue-at - ${TRUTHDIR}/AshkenazimTrio/${Trio1[${t}]}/NISTv4.2.1/README_v4.2.1.txt -o ./benchmark/${t}_README_v4.2.1.txt 
    else
        echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | '${t}_README_v4.2.1.txt'"
    fi
done

#-------------------------------------------------------------------#
#                            GIAB Trio2                             #
#-------------------------------------------------------------------#
bench_ext=(".bed" ".vcf.gz" ".vcf.gz.tbi")
declare -A Trio2=(["HG005"]="HG005_NA24631_son" ["HG006"]="HG006_NA24694_father" ["HG007"]="HG007_NA24695_mother")

for t in ${!Trio2[@]}; do
    echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading ${t}=${Trio2[${t}]} benchmarking files now..."

    # Download the MD5
    if [ ! -f "./benchmark/${t}_benchmark.md5" ]; then
        echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading ${t} MD5..."
        curl -s --continue-at - ${TRUTHDIR}/ChineseTrio/${Trio2[${t}]}/NISTv4.2.1/md5.in -o ./benchmark/${t}_benchmark.md5 
    else
        echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | '${t}_benchmark.md5'"
    fi

    for e in ${!bench_ext[@]}; do
        if [ ! -f "./benchmark/${t}${benchmark_version}${bench_ext[${e}]}" ]; then
            echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading ${t}${benchmark_version}${bench_ext[${e}]}..."
            curl --continue-at - ${TRUTHDIR}/ChineseTrio/${Trio2[${t}]}/NISTv4.2.1/GRCh38/${t}${benchmark_version}${bench_ext[${e}]} -o ./benchmark/${t}${benchmark_version}${bench_ext[${e}]}

            echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: checking ./benchmark/${t}${benchmark_version}${bench_ext[${e}]} for corruption..."
            check_sum=$(cat ./benchmark/${t}_benchmark.md5 | grep "${t}${benchmark_version}${bench_ext[${e}]}" | head -n1)
            old_path="GRCh38/"
            new_path="./benchmark/"
            valid_check_sum="${check_sum/$old_path/$new_path}"
            msg=$(echo $valid_check_sum | md5sum -c -)
            if [ $? -eq 0 ]; then 
                echo $(date '+%Y-%m-%d %H:%M:%S') INFO: $msg || error_exit $msg tracker_file.txt
            fi
        else
            echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | '${t}${benchmark_version}${bench_ext[${e}]}'"
        fi
    done

    # Download the README
    if [ ! -f "./benchmark/${t}_README_v4.2.1.txt" ]; then
        echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading ${t} readme..."
        curl -s --continue-at - ${TRUTHDIR}/ChineseTrio/${Trio2[${t}]}/NISTv4.2.1/README_v4.2.1.txt -o ./benchmark/${t}_README_v4.2.1.txt 
    else
        echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | '${t}_README_v4.2.1.txt'"
    fi

done

## --------------- Download GIAB Aligned Reads ----------------------
## NOTES ---- 
## Files Downloaded Include:
## 1. NIST Illumina 2x250bp sequence reads alligned to GRCh38 [BAM]
## 2. Sequence reads index [BAI]
## 3. MD5 checksum [MD5]
## Indexes for various sequencing data and aligned files can be found here: https://github.com/genome-in-a-bottle/giab_data_indexes

## Function for downloading BAM + BAI files -----------
download () {
    file_input=$1

    # split the file_path into an array
    IFS='_' read -r -a result <<< "$file_input"
    trio_name=${result[0]}

    if [ ! -f "./bam/${trio_name}.download" ]; then 
        # echo "source ./scripts/setup/modules.sh" > ./bam/$trio_name.download
        # Command Translation:

        # For each line after the header row:
        # First, use the first column in each row,
        # from the .txt file input, and
        # extract the BAM file path end, called "filename", using
        # the shell cmd called "basename", then
        # close that shell command

        # Next, if "filename" containes the correct reference genome name,
        # print a command to download that file
        # with timestamp wrappers before/after

        # Next, process the MD5 flexibly...
        # For HG002, use the corrected MD5 sum (output)
        # For all others, use the original MD5 sum ($2)

        # Next, write the correct MD5 + filepath to a new file, and
        # print the command for checking the MD5 sum 

        # Next, use the 3rd column in each row, 
        # from the .txt file input, and
        # extract the BAI file path end, called "secondfile", using
        # the shell cmd called "basename", then
        # close that shell command

        # Next, if "filename" containes the correct reference genome name,
        # print a command to download that file
        # with timestamp wrappers before/after

        # Next, process the MD5 flexibly...
        # For HG002, use the corrected MD5 sum (output)
        # For all others, use the original MD5 sum ($2)

        # Next, write the correct MD5 + filepath to a new file, and
        # print the command for checking the MD5 sum.

        awk -F '\t' 'NR!=1 { 
            cmd = "basename " $1
            cmd | getline filename
            close(cmd)
            }

            index(filename, "GRCh38") {
                print "echo $(date \"+%Y-%m-%d %H:%M:%S\") INFO: downloading ["filename"] now..." 
                print "curl -o ./triotrain/variant_calling/data/GIAB/bam/"filename" -C - "$1" --keepalive-time 300"
                print "echo $(date \"+%Y-%m-%d %H:%M:%S\") INFO: downloading ["filename"]... done"
                }

                filename ~ /HG002.GRCh38/ {
                    cmd = "grep -Fw "filename" ./bam/HG002_corrected_md5sums.feb19upload.txt" 
                    cmd | getline result
                    close(cmd)
                    split(result, output, " ")
                    print "echo "output[1]"\t./triotrain/variant_calling/data/GIAB/bam/"filename" > ./triotrain/variant_calling/data/GIAB/bam/"filename".md5"
                }

                filename !~ /HG002/ && filename ~ /GRCh38/ {
                    print "echo "$2"\t./triotrain/variant_calling/data/GIAB/bam/"filename" > ./triotrain/variant_calling/data/GIAB/bam/"filename".md5"
                }

            index(filename, "GRCh38") {
                print "echo $(date \"+%Y-%m-%d %H:%M:%S\") INFO: checking ["filename"] for corruption..."
                print "md5sum -c ./triotrain/variant_calling/data/GIAB/bam/"filename".md5"
                print "echo $(date \"+%Y-%m-%d %H:%M:%S\") INFO: checking ["filename"] for corruption... done"

                print "echo ----------------------------------------------------"
            }

            {
                cmd = "basename " $3
                cmd | getline secondfile
                close(cmd)
            }

            index(secondfile, "GRCh38") {
                print "echo $(date \"+%Y-%m-%d %H:%M:%S\") INFO: downloading ["secondfile"] now..."  
                print "curl -o ./triotrain/variant_calling/data/GIAB/bam/"secondfile" -C - "$3
                print "echo $(date \"+%Y-%m-%d %H:%M:%S\") INFO: downloading ["secondfile"] ... done"
                }

                secondfile ~ /HG002.GRCh38/ {
                    cmd = "grep -Fw "secondfile" ./bam/HG002_corrected_md5sums.feb19upload.txt" 
                    cmd | getline result
                    close(cmd)
                    split(result, output, " ")
                    print "echo "output[1]"\t./triotrain/variant_calling/data/GIAB/bam/"secondfile" > ./triotrain/variant_calling/data/GIAB/bam/"secondfile".md5"
                }

                secondfile !~ /HG002/ && secondfile ~ /GRCh38/ {
                    print "echo "$4"\t./triotrain/variant_calling/data/GIAB/bam/"secondfile" > ./triotrain/variant_calling/data/GIAB/bam/"secondfile".md5"
                }

            index(secondfile, "GRCh38") {
                print "echo $(date \"+%Y-%m-%d %H:%M:%S\") INFO: checking ["secondfile"] for corruption..."
                print "md5sum -c ./triotrain/variant_calling/data/GIAB/bam/"secondfile".md5"
                print "echo $(date \"+%Y-%m-%d %H:%M:%S\") INFO: checking ["secondfile"] for corruption... done"

                print "echo ===================================================="
            }' "./bam/${file_input}"  >> "./bam/${trio_name}.download"
    else
        echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | './triotrain/variant_calling/data/GIAB/bam/${trio_name}.download'"
    fi
    }

## Function for calculating AVG COVERAGE -----------
calc_cov () {
    file_input=$1

    # split the file_path into an array
    IFS='_' read -r -a result <<< "$file_input"
    trio_name=${result[0]}

    if [ ! -f "./bam/${trio_name}.run" ]; then
        # echo "source ./scripts/setup/modules.sh" > ./bam/${trio_name}.run

        # Command Translation:
        # For each line after the header row:
        # First, use the first column in each row,
        # from the .txt file input, and
        # extract the BAM file path end, called "filename", using
        # the shell cmd called "basename", then
        # close that shell command

        # Next, if "filename" containes the correct reference genome name,
        # print a if/then statement to test if job file to run coverage 
        # calculations exists, and create it if it does not.

        # print a if/then statement to test if output from avg_coverage 
        # calculations exists, and create it if it does not.

        awk -F '\t' 'NR!=1 { 
            cmd = "basename " $1 
            cmd | getline filename
            close(cmd)
            }

            index(filename, "GRCh38") {
                cmd = "basename " filename " .bam"
                cmd | getline label
                close(cmd)

                print "if [ ! -f ./triotrain/variant_calling/data/GIAB/bam/"label".coverage.out ]; then"
                print "    echo \"$(date \"+%Y-%m-%d %H:%M:%S\")  INFO: Calculating Coverage for ["label"] now...\"" 
                print "    samtools coverage ./triotrain/variant_calling/data/GIAB/bam/"filename" --output ./triotrain/variant_calling/data/GIAB/bam/"label".coverage.out"
                print "    echo \"$(date \"+%Y-%m-%d %H:%M:%S\")  INFO: Calculating Coverage for ["label"] now... done\""
                print "else"
                print "    echo \"$(date \"+%Y-%m-%d %H:%M:%S\")  INFO: bash file found | ./triotrain/variant_calling/data/GIAB/bam/"label".coverage.out\""
                print "fi"
                print "echo ----------------------------------------------------"
                split(label, output, ".")
                print "if [ ! -f ./triotrain/variant_calling/data/GIAB/bam/"output[1]".avg_coverage.out ]; then"
                print "    echo \"$(date \"+%Y-%m-%d %H:%M:%S\")  INFO: Calculating Average Coverage for ["output[1]"] now...\"" 
                print "    awk '\''{ total += $6; count++ } END { print \""label" AVERAGE COVERAGE = \" total/count }'\'' ./triotrain/variant_calling/data/GIAB/bam/"label".coverage.out > ./triotrain/variant_calling/data/GIAB/bam/"output[1]".avg_coverage.out"
                print "    echo \"$(date \"+%Y-%m-%d %H:%M:%S\")  INFO: Calculating Average Coverage for ["output[1]"] now... done\""
                print "else"
                print "    echo \"$(date \"+%Y-%m-%d %H:%M:%S\")  INFO: output file found | ./triotrain/variant_calling/data/GIAB/bam/"output[1]".avg_coverage.out\""
                print "fi" 

                print "echo ===================================================="
                } ' "./bam/${file_input}"  >> "./bam/${trio_name}.run"

        # if running on an interactive session with lots of memory, 
        # uncomment the line below
        # . ./bam/$1.run
    else
        echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: file found | './triotrain/variant_calling/data/GIAB/bam/${trio_name}.run'"
    fi
    }

# output dir
mkdir -p bam

#-------------------------------------------------------------------#
#                            GIAB Trio1                             #
#-------------------------------------------------------------------#
# BAM and BAI with MD5 checksums can be found here: https://github.com/genome-in-a-bottle/giab_data_indexes/blob/c4d3b95c2ebf14c175151e4723f82e8980722e90/AshkenazimTrio/alignment.index.AJtrio_Illumina_2x250bps_novoalign_GRCh37_GRCh38_NHGRI_06062016

## CHECKSUMS -----------
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading AJ Trio checksum now..."
curl -s -C - https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/AshkenazimTrio/alignment.index.AJtrio_Illumina_2x250bps_novoalign_GRCh37_GRCh38_NHGRI_06062016 -o ./bam/AJtrio_Illumina_2x250bps_novoaligns_GRCh37_GRCh38.txt 

# NOTE: 07 JUNE 2023:
# The checksum for HG002 does not match the ^ above MD5,
# therefore, download the correct one. 

curl -s -C - https://ftp.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_Illumina_2x250bps/novoalign_bams/HG002_corrected_md5sums.feb19upload.txt -o ./bam/HG002_corrected_md5sums.feb19upload.txt
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading AJ Trio checksum now... done"

download "AJtrio_Illumina_2x250bps_novoaligns_GRCh37_GRCh38.txt"
# calc_cov "AJtrio_Illumina_2x250bps_novoaligns_GRCh37_GRCh38.txt" 

#-------------------------------------------------------------------#
#                            GIAB Trio2                             #
#-------------------------------------------------------------------#
# BAM and BAI with MD5 checksums can be found here: https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/ChineseTrio/alignment.index.ChineseTrio_Illumina300X100X_wgs_novoalign_GRCh37_GRCh38_NHGRI_04062016 

## CHECKSUMS -----------
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading HC Trio checksum now..."
curl -s -C - https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/ChineseTrio/alignment.index.ChineseTrio_Illumina300X100X_wgs_novoalign_GRCh37_GRCh38_NHGRI_04062016 -o ./bam/HCtrio_Illumina300X100X_wgs_novoalign_GRCh37_GRCh38.txt 
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: downloading HC Trio checksum now... done"

download "HCtrio_Illumina300X100X_wgs_novoalign_GRCh37_GRCh38.txt"
# calc_cov "HCtrio_Illumina300X100X_wgs_novoalign_GRCh37_GRCh38.txt"

echo -e "=== scripts/setup/download_humanGIABdata.sh > end $(date)"

Check | Data Directories

Run the following at the command line:

ls triotrain/variant_calling/data/GIAB

Expected new directories:

allele_freq  bam  benchmark  reference

Check | allele_freq/

Run the following at the command line:

ls triotrain/variant_calling/data/GIAB/allele_freq/

Expected outputs:

cohort-chr10.release_missing2ref.no_calls.vcf.gz      cohort-chr21.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr10.release_missing2ref.no_calls.vcf.gz.tbi  cohort-chr22.release_missing2ref.no_calls.vcf.gz
cohort-chr11.release_missing2ref.no_calls.vcf.gz      cohort-chr22.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr11.release_missing2ref.no_calls.vcf.gz.tbi  cohort-chr2.release_missing2ref.no_calls.vcf.gz
cohort-chr12.release_missing2ref.no_calls.vcf.gz      cohort-chr2.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr12.release_missing2ref.no_calls.vcf.gz.tbi  cohort-chr3.release_missing2ref.no_calls.vcf.gz
cohort-chr13.release_missing2ref.no_calls.vcf.gz      cohort-chr3.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr13.release_missing2ref.no_calls.vcf.gz.tbi  cohort-chr4.release_missing2ref.no_calls.vcf.gz
cohort-chr14.release_missing2ref.no_calls.vcf.gz      cohort-chr4.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr14.release_missing2ref.no_calls.vcf.gz.tbi  cohort-chr5.release_missing2ref.no_calls.vcf.gz
cohort-chr15.release_missing2ref.no_calls.vcf.gz      cohort-chr5.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr15.release_missing2ref.no_calls.vcf.gz.tbi  cohort-chr6.release_missing2ref.no_calls.vcf.gz
cohort-chr16.release_missing2ref.no_calls.vcf.gz      cohort-chr6.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr16.release_missing2ref.no_calls.vcf.gz.tbi  cohort-chr7.release_missing2ref.no_calls.vcf.gz
cohort-chr17.release_missing2ref.no_calls.vcf.gz      cohort-chr7.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr17.release_missing2ref.no_calls.vcf.gz.tbi  cohort-chr8.release_missing2ref.no_calls.vcf.gz
cohort-chr18.release_missing2ref.no_calls.vcf.gz      cohort-chr8.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr18.release_missing2ref.no_calls.vcf.gz.tbi  cohort-chr9.release_missing2ref.no_calls.vcf.gz
cohort-chr19.release_missing2ref.no_calls.vcf.gz      cohort-chr9.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr19.release_missing2ref.no_calls.vcf.gz.tbi  cohort-chrX.release_missing2ref.no_calls.vcf.gz
cohort-chr1.release_missing2ref.no_calls.vcf.gz       cohort-chrX.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr1.release_missing2ref.no_calls.vcf.gz.tbi   cohort-chrY.release_missing2ref.no_calls.vcf.gz
cohort-chr20.release_missing2ref.no_calls.vcf.gz      cohort-chrY.release_missing2ref.no_calls.vcf.gz.tbi
cohort-chr20.release_missing2ref.no_calls.vcf.gz.tbi  concat_PopVCFs.sh
cohort-chr21.release_missing2ref.no_calls.vcf.gz      PopVCF.merge.list

Check | bam/

Run the following at the command line:

ls triotrain/variant_calling/data/GIAB/bam

Expected outputs:

AJtrio.download                                        HCtrio.download  
AJtrio_Illumina_2x250bps_novoaligns_GRCh37_GRCh38.txt  HCtrio_Illumina300X100X_wgs_novoalign_GRCh37_GRCh38.txt
HG002_corrected_md5sums.feb19upload.txt

Check | benchmark/

Run the following at the command line:

ls triotrain/variant_calling/data/GIAB/benchmark/

Expected outputs:

HG002_GRCh38_1_22_v4.2.1_benchmark.bed         HG005_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz      HG005_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi  HG005_README_v4.2.1.txt
HG002_README_v4.2.1.txt                      HG006_benchmark.md5
HG003_GRCh38_1_22_v4.2.1_benchmark.bed         HG006_GRCh38_1_22_v4.2.1_benchmark.bed
HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz      HG006_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi  HG006_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
HG003_README_v4.2.1.txt                      HG006_README_v4.2.1.txt
HG004_GRCh38_1_22_v4.2.1_benchmark.bed         HG007_benchmark.md5
HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz      HG007_GRCh38_1_22_v4.2.1_benchmark.bed
HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi  HG007_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
HG004_README_v4.2.1.txt                      HG007_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
HG005_benchmark.md5                          HG007_README_v4.2.1.txt
HG005_GRCh38_1_22_v4.2.1_benchmark.bed

Check | reference/

Run the following at the command line:

ls triotrain/variant_calling/data/GIAB/reference/

Expected outputs:

GRCh38_no_alt_analysis_set.fasta  GRCh38_no_alt_analysis_set.fasta.fai  md5checksums.txt

Check | Processing Raw Data Scripts

In addition to the raw data, download_GIAB.sh also creates (3) bash scripts to process raw data into the formats expected by the tutorial:

concat_PopVCFs.sh + input file PopVCF.merge.list — merges the per-chr VCFs into a single file
AJtrio.download — downloads GIAB Trio1
HCtrio.download — downloads GIAB Trio2

6. Process Raw Data

Note

These scripts can either be wrapped with SBATCH, or run interactively at the command line if you have enough memory. However, each script can take awhile to complete, particularly the .download scripts (1hr+).

a. Merge the PopVCF

Warning | Download Size

We need to create a single, genome-wide PopVCF from the raw per-chromosome PopVCF files. The per-chr Population VCFs were produced by the One Thousand Genomes Project (1kGP) and used to train the Human WGS.AF model. You can view the raw files on Google Cloud Platform.

Run the following at the command line:

bash triotrain/variant_calling/data/GIAB/allele_freq/concat_PopVCFs.sh

Example | concat_PopVCFs.sh

source ./scripts/setup/modules.sh
bcftools concat --file-list ./triotrain/variant_calling/data/GIAB/allele_freq/PopVCF.merge.list -Oz -o ./triotrain/variant_calling/data/GIAB/allele_freq/cohort.release_missing2ref.no_calls.vcf.gz
bcftools index ./triotrain/variant_calling/data/GIAB/allele_freq/cohort.release_missing2ref.no_calls.vcf.gz

Check Intermediate Data | allele_freq/

Run the following at the command line:

ls triotrain/variant_calling/data/GIAB/allele_freq/ | grep cohort.release

Expected outputs:

cohort.release_missing2ref.no_calls.vcf.gz
cohort.release_missing2ref.no_calls.vcf.gz.csi

b. Download Sequence Data

Warning | Download Size

Warning

Given the large file size, the NIST FTP server runs slowly causing curl to timeout. You may need to run these scripts repeatedly until all data is transfered completely.

We need to download the large sequence data files, and confirm they are not corrupted by checking the MD5 checksums. These BAM/BAI files orginate from the GIAB FTP site. An index of GIAB data created with these samples can be found on GitHub.

1. Run the following at the command line:

bash triotrain/variant_calling/data/GIAB/bam/AJtrio.download

Example | AJtrio.download

echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG002.GRCh38.2x250.bam] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG002.GRCh38.2x250.bam -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_Illumina_2x250bps/novoalign_bams/HG002.GRCh38.2x250.bam --keepalive-time 300
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG002.GRCh38.2x250.bam]... done
echo 56c30eaa4e2f25ff0ac80ef30e09d78e   ./triotrain/variant_calling/data/GIAB/bam/HG002.GRCh38.2x250.bam > ./triotrain/variant_calling/data/GIAB/bam/HG002.GRCh38.2x250.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG002.GRCh38.2x250.bam] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG002.GRCh38.2x250.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG002.GRCh38.2x250.bam] for corruption... done
echo ----------------------------------------------------
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG002.GRCh38.2x250.bam.bai] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG002.GRCh38.2x250.bam.bai -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_Illumina_2x250bps/novoalign_bams/HG002.GRCh38.2x250.bam.bai
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG002.GRCh38.2x250.bam.bai] ... done
echo a3c2b449df6509ca83fbd3fea22b9aee   ./triotrain/variant_calling/data/GIAB/bam/HG002.GRCh38.2x250.bam.bai > ./triotrain/variant_calling/data/GIAB/bam/HG002.GRCh38.2x250.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG002.GRCh38.2x250.bam.bai] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG002.GRCh38.2x250.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG002.GRCh38.2x250.bam.bai] for corruption... done
echo ====================================================
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG003.GRCh38.2x250.bam] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG003.GRCh38.2x250.bam -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG003_NA24149_father/NIST_Illumina_2x250bps/novoalign_bams/HG003.GRCh38.2x250.bam --keepalive-time 300
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG003.GRCh38.2x250.bam]... done
echo 1b1ad2e8f5eba062086890cceb9671ab   ./triotrain/variant_calling/data/GIAB/bam/HG003.GRCh38.2x250.bam > ./triotrain/variant_calling/data/GIAB/bam/HG003.GRCh38.2x250.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG003.GRCh38.2x250.bam] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG003.GRCh38.2x250.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG003.GRCh38.2x250.bam] for corruption... done
echo ----------------------------------------------------
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG003.GRCh38.2x250.bam.bai] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG003.GRCh38.2x250.bam.bai -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG003_NA24149_father/NIST_Illumina_2x250bps/novoalign_bams/HG003.GRCh38.2x250.bam.bai
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG003.GRCh38.2x250.bam.bai] ... done
echo a2697b3ebb9b5ac6d9a27734406a211c   ./triotrain/variant_calling/data/GIAB/bam/HG003.GRCh38.2x250.bam.bai > ./triotrain/variant_calling/data/GIAB/bam/HG003.GRCh38.2x250.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG003.GRCh38.2x250.bam.bai] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG003.GRCh38.2x250.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG003.GRCh38.2x250.bam.bai] for corruption... done
echo ====================================================
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG004.GRCh38.2x250.bam] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG004.GRCh38.2x250.bam -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG004_NA24143_mother/NIST_Illumina_2x250bps/novoalign_bams/HG004.GRCh38.2x250.bam --keepalive-time 300
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG004.GRCh38.2x250.bam]... done
echo e084422a267170b550c9704ad61db9b2   ./triotrain/variant_calling/data/GIAB/bam/HG004.GRCh38.2x250.bam > ./triotrain/variant_calling/data/GIAB/bam/HG004.GRCh38.2x250.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG004.GRCh38.2x250.bam] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG004.GRCh38.2x250.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG004.GRCh38.2x250.bam] for corruption... done
echo ----------------------------------------------------
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG004.GRCh38.2x250.bam.bai] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG004.GRCh38.2x250.bam.bai -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG004_NA24143_mother/NIST_Illumina_2x250bps/novoalign_bams/HG004.GRCh38.2x250.bam.bai
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG004.GRCh38.2x250.bam.bai] ... done
echo 4827d6457b202f85482787964a1bc6c7   ./triotrain/variant_calling/data/GIAB/bam/HG004.GRCh38.2x250.bam.bai > ./triotrain/variant_calling/data/GIAB/bam/HG004.GRCh38.2x250.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG004.GRCh38.2x250.bam.bai] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG004.GRCh38.2x250.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG004.GRCh38.2x250.bam.bai] for corruption... done
echo ====================================================

Check Intermediate Data | HG002:

Run the following at the command line:

ls triotrain/variant_calling/data/GIAB/bam/ | grep HG002

Expected outputs:

HG002_corrected_md5sums.feb19upload.txt
HG002.GRCh38.2x250.bam
HG002.GRCh38.2x250.bam.bai
HG002.GRCh38.2x250.bam.bai.md5
HG002.GRCh38.2x250.bam.md5

Check Intermediate Data | HG003:

Run the following at the command line:

ls triotrain/variant_calling/data/GIAB/bam/ | grep HG003

Expected outputs:

HG003.GRCh38.2x250.bam
HG003.GRCh38.2x250.bam.bai
HG003.GRCh38.2x250.bam.bai.md5
HG003.GRCh38.2x250.bam.md5

Check Intermediate Data | HG004:

Run the following at the command line:

ls triotrain/variant_calling/data/GIAB/bam/ | grep HG004

Expected outputs:

HG004.GRCh38.2x250.bam
HG004.GRCh38.2x250.bam.bai
HG004.GRCh38.2x250.bam.bai.md5
HG004.GRCh38.2x250.bam.md5

2. Run the following at the command line:

bash triotrain/variant_calling/data/GIAB/bam/HCtrio.download

Example | HCtrio.download

echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ChineseTrio/HG005_NA24631_son/HG005_NA24631_son_HiSeq_300x/NHGRI_Illumina300X_Chinesetrio_novoalign_bams/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam --keepalive-time 300
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam]... done
echo 50080d14ba49462cfc6348e1fb41b5b3   ./triotrain/variant_calling/data/GIAB/bam/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam > ./triotrain/variant_calling/data/GIAB/bam/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam] for corruption... done
echo ----------------------------------------------------
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ChineseTrio/HG005_NA24631_son/HG005_NA24631_son_HiSeq_300x/NHGRI_Illumina300X_Chinesetrio_novoalign_bams/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai] ... done
echo 2d38162ba61e909d08e4ed438096174e   ./triotrain/variant_calling/data/GIAB/bam/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai > ./triotrain/variant_calling/data/GIAB/bam/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai] for corruption... done
echo ====================================================
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ChineseTrio/HG006_NA24694-huCA017E_father/NA24694_Father_HiSeq100x/NHGRI_Illumina100X_Chinesetrio_novoalign_bams/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam --keepalive-time 300
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam]... done
echo 3a7c239988862e5a4556eb8781058b1f   ./triotrain/variant_calling/data/GIAB/bam/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam > ./triotrain/variant_calling/data/GIAB/bam/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam] for corruption... done
echo ----------------------------------------------------
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ChineseTrio/HG006_NA24694-huCA017E_father/NA24694_Father_HiSeq100x/NHGRI_Illumina100X_Chinesetrio_novoalign_bams/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai] ... done
echo 7bfe1f0988d27898befeaa4673d23a47   ./triotrain/variant_calling/data/GIAB/bam/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai > ./triotrain/variant_calling/data/GIAB/bam/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai] for corruption... done
echo ====================================================
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ChineseTrio/HG007_NA24695-hu38168_mother/NA24695_Mother_HiSeq100x/NHGRI_Illumina100X_Chinesetrio_novoalign_bams/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam --keepalive-time 300
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam]... done
echo 3b8442105a4a45f71c3c1bda8af238dd   ./triotrain/variant_calling/data/GIAB/bam/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam > ./triotrain/variant_calling/data/GIAB/bam/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam] for corruption... done
echo ----------------------------------------------------
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai] now...
curl -o ./triotrain/variant_calling/data/GIAB/bam/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai -C - ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ChineseTrio/HG007_NA24695-hu38168_mother/NA24695_Mother_HiSeq100x/NHGRI_Illumina100X_Chinesetrio_novoalign_bams/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: downloading [HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai] ... done
echo 0c218cecd8cde9ad272de6d77bbd110f   ./triotrain/variant_calling/data/GIAB/bam/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai > ./triotrain/variant_calling/data/GIAB/bam/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai] for corruption...
md5sum -c ./triotrain/variant_calling/data/GIAB/bam/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai.md5
echo $(date "+%Y-%m-%d %H:%M:%S") INFO: checking [HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai] for corruption... done
echo ====================================================

Check Intermediate Data | HG005:

Run the following at the command line:

ls triotrain/variant_calling/data/GIAB/bam/ | grep HG005

Expected outputs:

HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam
HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai
HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.bai.md5
HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam.md5

Check Intermediate Data | HG006:

Run the following at the command line:

ls triotrain/variant_calling/data/GIAB/bam/ | grep HG006

Expected outputs:

HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam
HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai
HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai.md5
HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.md5

Check Intermediate Data | HG007:

Run the following at the command line:

ls triotrain/variant_calling/data/GIAB/bam/ | grep HG007

Expected outputs:

HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam
HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai
HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.bai.md5
HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam.md5

7. Create TrioTrain Inputs

There are (3) required input files we must create before we can run TrioTrain. Complete details about all required data can be found in the TrioTrain User Guide.

a. Reference Dictionary File (`.dict`)

Warning

This step is specific to the Human reference genome GRCh38 since cattle-specific input files are pre-packaged with TrioTrain. If you are working with a new species, you will need to create this file for your reference genome.

We need a reference dictionary file in the same directory as the reference genome. This file defines the valid genomic coordinates for TrioTrain's region shuffling.

By default, region shuffling will only use the autosomes and X chromosome. However, you can expand or contract the shuffling area by providing an alternative region file for a particular trio by providing an existing BED file under the RegionsFile column within the metadata file (.csv).

Run the following at the command line:

picard CreateSequenceDictionary \
    --REFERENCE ./triotrain/variant_calling/data/GIAB/reference/GRCh38_no_alt_analysis_set.fasta \
    --OUTPUT ./triotrain/variant_calling/data/GIAB/reference/GRCh38_no_alt_analysis_set.dict \
    --SPECIES human

Check | Dictionary File

Run the following at the command line:

ls triotrain/variant_calling/data/GIAB/reference/ | grep .dict

Expected output:

GRCh38_no_alt_analysis_set.dict

b. Metadata file (`.csv`)

We also need a metadata file to tell TrioTrain where to find all of previously downloaded Human GIAB data. This file contains pedigree information, and the absolute paths for file inputs. Absolute paths are required to help the Apptainer/Singularity containers identify local files. Formatting specifications for this required input can be found in the TrioTrain User Guide.

For the tutorial, we've created a helper script to automatically create an example of this file. This script uses expectations of where the tutorial data are stored to add local paths. However, outside of the tutorial this file is a user-created input.

Run the following at the command line:

source ./scripts/setup/modules.sh
source ./scripts/start_conda.sh                 # Ensure the previously built conda env is active
python3 ./triotrain/model_training/tutorial/create_metadata.py

Example | create_metadata.py

#!/usr/bin/python3
"""
description: creates a tutorial metadata file from the Human GIAB data downloaded during the walk-through
example:
    python3 triotrain/model_training/tutorial/create_metadata.py
"""

import os
from pathlib import Path
from sys import path

import regex

# get the absolute path to the triotrain/ dir
abs_path = Path(__file__).resolve()
module_path = str(abs_path.parent.parent.parent)
path.append(module_path)

from helpers.dictionary import add_to_dict
from helpers.files import WriteFiles
from helpers.utils import get_logger
from helpers.wrapper import Wrapper, timestamp

# Collect start time
Wrapper(__file__, "start").wrap_script(timestamp())

# Create error log
current_file = os.path.basename(__file__)
module_name = os.path.splitext(current_file)[0]
logger = get_logger(module_name)

defaults = {
    "RunOrder": 1,
    "RunName": "Human_tutorial",
    "ChildSampleID": "NA24385",
    "ChildLabID": "HG002",
    "FatherSampleID": "NA24149",
    "FatherLabID": "HG003",
    "MotherSampleID": "NA24695",
    "MotherLabID": "HG004",
    "ChildSex": "M",
    "RefFASTA": "/triotrain/variant_calling/data/GIAB/reference/GRCh38_no_alt_analysis_set.fasta",
    "PopVCF": "/triotrain/variant_calling/data/GIAB/allele_freq/cohort.release_missing2ref.no_calls.vcf.gz",
    "RegionsFile": "NA",
    "ChildReadsBAM": "/triotrain/variant_calling/data/GIAB/bam/HG002.GRCh38.2x250.bam",
    "ChildTruthVCF": "/triotrain/variant_calling/data/GIAB/benchmark/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz",
    "ChildCallableBED": "/triotrain/variant_calling/data/GIAB/benchmark/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed",
    "FatherReadsBAM": "/triotrain/variant_calling/data/GIAB/bam/HG003.GRCh38.2x250.bam",
    "FatherTruthVCF": "/triotrain/variant_calling/data/GIAB/benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz",
    "FatherCallableBED": "/triotrain/variant_calling/data/GIAB/benchmark/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed",
    "MotherReadsBAM": "/triotrain/variant_calling/data/GIAB/bam/HG004.GRCh38.2x250.bam",
    "MotherTruthVCF": "/triotrain/variant_calling/data/GIAB/benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz",
    "MotherCallableBED": "/triotrain/variant_calling/data/GIAB/benchmark/HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed",
    "Test1ReadsBAM": "/triotrain/variant_calling/data/GIAB/bam/HG005.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam",
    "Test1TruthVCF": "/triotrain/variant_calling/data/GIAB/benchmark/HG005_GRCh38_1_22_v4.2.1_benchmark.vcf.gz",
    "Test1CallableBED": "/triotrain/variant_calling/data/GIAB/benchmark/HG005_GRCh38_1_22_v4.2.1_benchmark.bed",
    "Test2ReadsBAM": "/triotrain/variant_calling/data/GIAB/bam/HG006.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam",
    "Test2TruthVCF": "/triotrain/variant_calling/data/GIAB/benchmark/HG006_GRCh38_1_22_v4.2.1_benchmark.vcf.gz",
    "Test2CallableBED": "/triotrain/variant_calling/data/GIAB/benchmark/HG006_GRCh38_1_22_v4.2.1_benchmark.bed",
    "Test3ReadsBAM": "/triotrain/variant_calling/data/GIAB/bam/HG007.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.100x.bam",
    "Test3TruthVCF": "/triotrain/variant_calling/data/GIAB/benchmark/HG007_GRCh38_1_22_v4.2.1_benchmark.vcf.gz",
    "Test3CallableBED": "/triotrain/variant_calling/data/GIAB/benchmark/HG007_GRCh38_1_22_v4.2.1_benchmark.bed",
}

cwd = Path.cwd()
output_dict = dict()

for k, v in defaults.items():
    match = regex.search(r"\/triotrain\/variant_calling\/.*", str(v))
    if match:
        add_to_dict(
            update_dict=output_dict,
            new_key=k,
            new_val=f"{cwd}{v}",
            logger=logger,
            logger_msg="[tutorial]",
        )
    else:
        add_to_dict(
            update_dict=output_dict,
            new_key=k,
            new_val=v,
            logger=logger,
            logger_msg="[tutorial]",
        )

output_file = WriteFiles(
    path_to_file=str(cwd / "triotrain" / "model_training" / "tutorial"),
    file=f"GIAB.Human_tutorial_metadata.csv",
    logger=logger,
    logger_msg="[tutorial]",
)

output_file.check_missing()

if output_file.file_exists:
    logger.info("[tutorial]: tutorial metadata file already exists... SKIPPING AHEAD")
else:
    logger.info("[tutorial]: creating a tutorial metadata file now...")
    output_file.add_rows(col_names=output_dict.keys(), data_dict=output_dict)
    output_file.check_missing()
    if output_file.file_exists:
        logger.info("[tutorial]: successfully created the tutorial metadata file")

# Collect end time
Wrapper(__file__, "start").wrap_script(timestamp())

Check | Tutorial Metadata File

Run the following at the command line:

ls triotrain/model_training/tutorial

Expected output:

create_metadata.py  estimate.py  GIAB.Human_tutorial_metadata.csv  __init__.py  resources_used.json

c. SLURM Resource Config File (`.json`)

The last required input we need for TrioTrain is the SLURM resource config file (.json). This file tells TrioTrain what resources to request from your HPC cluster when submitting SLURM jobs. [

Example | Resource Config File

triotrain/model_training/tutorial/resources_used.json

{
    "make_examples": {
        "partition": "hpc5,hpc6,BioCompute",
        "nodes": 1,
        "ntasks": 40,
        "mem": 379067,
        "CPUmem": 9000,
        "time": "0-2:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "beam_shuffle": {
        "partition": "hpc5,BioCompute",
        "nodes": 1,
        "ntasks": 40,
        "mem": 379067,
        "time": "0-2:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "re_shuffle": {
        "partition": "BioCompute,Lewis",
        "nodes": 1,
        "ntasks": 1,
        "mem": "200G",
        "time": "0-10:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "train_eval": {
        "partition": "gpu3",
        "gres": "gpu:2",
        "nodes": 1,
        "ntasks": 16,
        "mem": "0",
        "time": "2-00:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "select_ckpt": {
        "partition": "BioCompute,Lewis",
        "nodes": 1,
        "ntasks": 1,
        "mem": 500,
        "time": "0-00:30:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "call_variants": {
        "partition": "hpc5,BioCompute",
        "nodes": 1,
        "ntasks": 40,
        "mem": 379067,
        "time": "2-00:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "compare_happy": {
        "partition": "BioCompute,hpc5",
        "nodes": 1,
        "ntasks": 40,
        "mem": "300G",
        "time": "0-12:00:00",
        "account": "biocommunity",
        "email": "jakth2@mail.missouri.edu"
    },
    "convert_happy": {
        "partition": "BioCompute,hpc3,hpc5,Lewis",
        "nodes": 1,
        "ntasks": 24,
        "mem": "120G",
        "time": "0-05:00:00",
        "account": "biocommunity",
        "email": "jakth2@mail.missouri.edu"
    },
    "show_examples": {
        "partition": "BioCompute,Lewis",
        "nodes": 1,
        "mem": "1G",
        "time": "0-02:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "summary": {
        "partition": "hpc5,hpc6,Lewis",
        "nodes": 1,
        "ntasks": 4,
        "mem": "40G",
        "time": "0-10:00:00",
        "account": "schnabellab",
        "email": "jakth2@mail.missouri.edu"
    }
}

The hardware listed above for each phase (e.g. mem, ntasks , gpus, etc.) vary, as some phases are memory intensive. These values should not be interpreted as the minimum or the optimum resources required for each phase. The MU Lewis Research Computing Cluster is heterogenous, the example config file requests resources to maximize the number of compute nodes when running memory-intensive phases.

For the tutorial, copy the above example into a new file, and manually edit the SBATCH parameters to match your HPC cluster (i.e. changing the partition list, account, and email address). This new resource config file is passed to TrioTrain via -r </path/to/new_file.json>

8. Run Shuffling Demo

Shuffling the labeled examples is a critical step to re-training DeepVariant because the model assumes successive training images are independent of one another. DeepVariant includes an Apache Beam pipeline that puts training examples in random genomic order. However, in our experience, getting the Python SDK for Beam, Apache Spark and SLURM to cooperate is a dependency nightmare.

Our alternative approach, referred to as "Region Shuffling", splits the autosomes and the X chromosome into subset regions before making the labeled examples. Using the previously created .dict file created for the reference, TrioTrain will determine how many regions are required automatically.

Warning

For our analyses in bovine genomes, we exclude parts of the genome from our region shuffling, including:

the Y chromosome and the Mitochondrial genome — due to limitations with the ARS-UCD1.2_Btau5.0.1Y reference genome
the unmapped reads — due to a large volume of contigs containing a small amount of variants that can not be easily distributed across 60-150+ shuffling regions

TrioTrain will automatically search the reference genome .dict file for contigs labeled with Y and M/MT and will ignore them prior to defining shuffling regions.

Use the --unmapped-reads flag to provide a contig prefix pattern to exclude. The default value (NKLS) is approprate for the ARS-UCD1.2_Btau5.0.1Y reference genome.

Note | Identifying Contig Labels

Run the following at the command line:

cat ARS-UCD1.2_Btau5.0.1Y.dict | awk '{if($1 ~ "@SQ") print $2}' | cut -c4-7 | sort | uniq -c | sort -k1

Check output

A shuffling region is defined in a non-overlapping, 0-based BED file. The goal is to create a subset of labeled examples which are representative of the whole genome, yet small enough to be shuffled within the memory of a single compute node. After defining the shuffling regions, labeled and shuffled examples are created by running parallel SLURM jobs. In our testing with bovine genomes, constraining each region to produce ~200,000 labeled examples works well.

In our experience, numerous small jobs running quickly minimizes wall time between phases. We also priorize a large number of regions, as this further increases the randomness during the final re_shuffle processes that randomizes the order regions are given to the model.

Example | Computing Power

Given ~370G mem across 40 cores, DeepVariant will shuffle each region's examples in around 6 minutes, which can be submitted in parallel to reduce wall time.

Note | Output Volume

A typical cattle genome with ~8 million variants results in around ~65 regions, resulting in (65 regions x 40 cores) tfrecord output files. TrioTrain repeats this process for shuffling, resulting in another (65 regions x 40 cores) shuffled tfrecord output files. This is repeated for all 3 genomes within a trio resulting in a large volume of individual files (65 regions x 40 cores x 2 phases per genome).

The final step of our approach, referrred to as re_shuffle, condenses the 40 tfrecord files for a region, reducing the number of files fed to DeepVariant per genome back down to 65. To further improve shuffling, we randomize the order regions are provided to DeepVariant.

If your species of interest produces a large volume of variants (> 8 million / genome), TrioTrain can easily overload your cluster's I/O capabilities, or write more than 10,000 files to a directory, or overwhelm the SLURM scheduler by submitting thousands of jobs simultaneously. Future versions of TrioTrain will address these challenges, but proceed with caution; your cluster's SysAdmin will thank you!

This shuffling demo enables you to adjust TrioTrain's parameters for estimating how many shuffling regions to use for each genome. The two primary flags which control Region Shuffling are:

--max-examples — defines the upper limit of total number of examples to produce for each region; values greater than the default (200,000) will increase the wall time.
--est-examples — defines expectations of the number of examples DeepVariant produces per genomic variant; the default value (1.5) is appropriate for bovine genomes, but with the human GIAB samples, we found a value of 1 to be more appropriate.

Additionally, TrioTrain bases the Region Shuffling calculations on the total number of variants found in each genome. This value can vary drastically across training genomes, so TrioTrain uses the number of PASS variants within the corresponding TruthVCF. A larger number of truth variants results in more regions made per region.

Before running TrioTrain in a new mamalian genome or on a new HPC cluster, we strongly recommend completing the region shuffling demo using the smallest chromosome possible using the --demo-chr=<region_literal> flag.

Note | Working in Other Species

The appropriate values for max-examples and est-examples may vary widely across species, as the number of examples DeepVariant produces depends on several factors including:

genome size
the reference genome used
how many variants are identified in an individual genome

For example, the F1-hybrid offspring in cattle are crosses between diverent lineages, resulting in an abnormal amount heterzygous genotypes compared to typical cattle genomes. Instead of typical 8 million variants per genome, these samples produced 20+ million variants per genome.

For this tutorial, complete the demo for Human genomes by editing the following at the command to include your new SLURM config file:

Run the following at the command line:

# You can add the --dry-run flag to this command to confirm the TrioTrain pipeline runs smoothly before submitting jobs to the queue
source ./scripts/setup/modules.sh
source ./scripts/start_conda.sh                 # Ensure the previously built conda env is active
python3 triotrain/run_trio_train.py                                                                         \
    -g Father                                                                                               \
    --unmapped-reads chrUn                                                                                  \
    -m triotrain/model_training/tutorial/GIAB.Human_tutorial_metadata.csv                                   \
    -n demo                                                                                                 \
    --demo-mode                                                                                             \
    --demo-chr chr21                                                                                        \
    --num-tests 3                                                                                           \
    --custom-checkpoint triotrain/model_training/pretrained_models/v1.4.0_withIS_withAF/wgs_af.model.ckpt   \
    --output ../TUTORIAL                                                                                    \
    -r </path/to/new_file.json>

This demo will produce and submit the following SLURM jobs:

make_examples — for both Father & Child
beam_shuffle — for both Father & Child
re_shuffle — for both Father & Child
call_variants — for just the Father

There are (6) types of output files from running the demo:

File Extension	Number of Files (Shards)	Genome Used
`labeled.tfrecords*.gz`	N	Father, Child
`labeled.tfrecords*.gz.example_info.json`	N	Father, Child
`labeled.shuffled*.tfrecord.gz`	N	Father, Child
`labeled.shuffled.dataset_config.pbtxt`	1	Father, Child
`labeled.shuffled.merged.dataset_config.pbtxt`	1	Father, Child
`labeled.shuffled.merged.tfrecord.gz`	1	Father, Child
N = number of CPUs requested in your `resources_used.json` file, shards are numbered 0 — (N - 1).

Expected Output | Father Shuffling:

Run the following at the command line:

ls ../TUTORIAL/demo/Human_tutorial/examples/ | grep Father

Check output

Father.chr21.labeled.shuffled-00000-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00001-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00002-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00003-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00004-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00005-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00006-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00007-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00008-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00009-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00010-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00011-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00012-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00013-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00014-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00015-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00016-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00017-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00018-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00019-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00020-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00021-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00022-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00023-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00024-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00025-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00026-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00027-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00028-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00029-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00030-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00031-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00032-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00033-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00034-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00035-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00036-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00037-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00038-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled-00039-of-00040.tfrecord.gz
Father.chr21.labeled.shuffled.dataset_config.pbtxt
Father.chr21.labeled.tfrecords-00000-of-00040.gz
Father.chr21.labeled.tfrecords-00000-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00001-of-00040.gz
Father.chr21.labeled.tfrecords-00001-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00002-of-00040.gz
Father.chr21.labeled.tfrecords-00002-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00003-of-00040.gz
Father.chr21.labeled.tfrecords-00003-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00004-of-00040.gz
Father.chr21.labeled.tfrecords-00004-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00005-of-00040.gz
Father.chr21.labeled.tfrecords-00005-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00006-of-00040.gz
Father.chr21.labeled.tfrecords-00006-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00007-of-00040.gz
Father.chr21.labeled.tfrecords-00007-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00008-of-00040.gz
Father.chr21.labeled.tfrecords-00008-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00009-of-00040.gz
Father.chr21.labeled.tfrecords-00009-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00010-of-00040.gz
Father.chr21.labeled.tfrecords-00010-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00011-of-00040.gz
Father.chr21.labeled.tfrecords-00011-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00012-of-00040.gz
Father.chr21.labeled.tfrecords-00012-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00013-of-00040.gz
Father.chr21.labeled.tfrecords-00013-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00014-of-00040.gz
Father.chr21.labeled.tfrecords-00014-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00015-of-00040.gz
Father.chr21.labeled.tfrecords-00015-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00016-of-00040.gz
Father.chr21.labeled.tfrecords-00016-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00017-of-00040.gz
Father.chr21.labeled.tfrecords-00017-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00018-of-00040.gz
Father.chr21.labeled.tfrecords-00018-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00019-of-00040.gz
Father.chr21.labeled.tfrecords-00019-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00020-of-00040.gz
Father.chr21.labeled.tfrecords-00020-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00021-of-00040.gz
Father.chr21.labeled.tfrecords-00021-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00022-of-00040.gz
Father.chr21.labeled.tfrecords-00022-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00023-of-00040.gz
Father.chr21.labeled.tfrecords-00023-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00024-of-00040.gz
Father.chr21.labeled.tfrecords-00024-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00025-of-00040.gz
Father.chr21.labeled.tfrecords-00025-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00026-of-00040.gz
Father.chr21.labeled.tfrecords-00026-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00027-of-00040.gz
Father.chr21.labeled.tfrecords-00027-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00028-of-00040.gz
Father.chr21.labeled.tfrecords-00028-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00029-of-00040.gz
Father.chr21.labeled.tfrecords-00029-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00030-of-00040.gz
Father.chr21.labeled.tfrecords-00030-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00031-of-00040.gz
Father.chr21.labeled.tfrecords-00031-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00032-of-00040.gz
Father.chr21.labeled.tfrecords-00032-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00033-of-00040.gz
Father.chr21.labeled.tfrecords-00033-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00034-of-00040.gz
Father.chr21.labeled.tfrecords-00034-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00035-of-00040.gz
Father.chr21.labeled.tfrecords-00036-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00037-of-00040.gz
Father.chr21.labeled.tfrecords-00037-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00038-of-00040.gz
Father.chr21.labeled.tfrecords-00038-of-00040.gz.example_info.json
Father.chr21.labeled.tfrecords-00039-of-00040.gz
Father.chr21.labeled.tfrecords-00039-of-00040.gz.example_info.json

Confirm that SLURM jobs completed successfully

Run the following at the command line:

cat ../TUTORIAL/demo/Human_tutorial/logs/tracking-Baseline-v1.4.0.log | grep SUCCESS | grep Father | wc -l

Check output

Expected Output | Child Shuffling:

Run the following at the command line:

ls ../TUTORIAL/demo/Human_tutorial/examples/ | grep Child

Check output

Child.chr21.labeled.shuffled-00000-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00001-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00002-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00003-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00004-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00005-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00006-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00007-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00008-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00009-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00010-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00011-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00012-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00013-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00014-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00015-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00016-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00017-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00018-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00019-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00020-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00021-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00022-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00023-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00024-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00025-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00026-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00027-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00028-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00029-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00030-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00031-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00032-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00033-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00034-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00035-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00036-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00037-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00038-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled-00039-of-00040.tfrecord.gz
Child.chr21.labeled.shuffled.dataset_config.pbtxt
Child.chr21.labeled.tfrecords-00000-of-00040.gz
Child.chr21.labeled.tfrecords-00000-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00001-of-00040.gz
Child.chr21.labeled.tfrecords-00001-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00002-of-00040.gz
Child.chr21.labeled.tfrecords-00002-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00003-of-00040.gz
Child.chr21.labeled.tfrecords-00003-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00004-of-00040.gz
Child.chr21.labeled.tfrecords-00004-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00005-of-00040.gz
Child.chr21.labeled.tfrecords-00005-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00006-of-00040.gz
Child.chr21.labeled.tfrecords-00006-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00007-of-00040.gz
Child.chr21.labeled.tfrecords-00007-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00008-of-00040.gz
Child.chr21.labeled.tfrecords-00008-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00009-of-00040.gz
Child.chr21.labeled.tfrecords-00009-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00010-of-00040.gz
Child.chr21.labeled.tfrecords-00010-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00011-of-00040.gz
Child.chr21.labeled.tfrecords-00011-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00012-of-00040.gz
Child.chr21.labeled.tfrecords-00012-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00013-of-00040.gz
Child.chr21.labeled.tfrecords-00013-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00014-of-00040.gz
Child.chr21.labeled.tfrecords-00014-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00015-of-00040.gz
Child.chr21.labeled.tfrecords-00015-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00016-of-00040.gz
Child.chr21.labeled.tfrecords-00016-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00017-of-00040.gz
Child.chr21.labeled.tfrecords-00017-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00018-of-00040.gz
Child.chr21.labeled.tfrecords-00018-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00019-of-00040.gz
Child.chr21.labeled.tfrecords-00019-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00020-of-00040.gz
Child.chr21.labeled.tfrecords-00020-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00021-of-00040.gz
Child.chr21.labeled.tfrecords-00021-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00022-of-00040.gz
Child.chr21.labeled.tfrecords-00022-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00023-of-00040.gz
Child.chr21.labeled.tfrecords-00023-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00024-of-00040.gz
Child.chr21.labeled.tfrecords-00024-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00025-of-00040.gz
Child.chr21.labeled.tfrecords-00025-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00026-of-00040.gz
Child.chr21.labeled.tfrecords-00026-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00027-of-00040.gz
Child.chr21.labeled.tfrecords-00027-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00028-of-00040.gz
Child.chr21.labeled.tfrecords-00028-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00029-of-00040.gz
Child.chr21.labeled.tfrecords-00029-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00030-of-00040.gz
Child.chr21.labeled.tfrecords-00030-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00031-of-00040.gz
Child.chr21.labeled.tfrecords-00031-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00032-of-00040.gz
Child.chr21.labeled.tfrecords-00032-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00033-of-00040.gz
Child.chr21.labeled.tfrecords-00033-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00034-of-00040.gz
Child.chr21.labeled.tfrecords-00034-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00035-of-00040.gz
Child.chr21.labeled.tfrecords-00035-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00036-of-00040.gz
Child.chr21.labeled.tfrecords-00036-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00037-of-00040.gz
Child.chr21.labeled.tfrecords-00037-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00038-of-00040.gz
Child.chr21.labeled.tfrecords-00038-of-00040.gz.example_info.json
Child.chr21.labeled.tfrecords-00039-of-00040.gz
Child.chr21.labeled.tfrecords-00039-of-00040.gz.example_info.json

Confirm that SLURM jobs completed successfully

Run the following at the command line:

cat ../TUTORIAL/demo/Human_tutorial/logs/tracking-Baseline-v1.4.0.log | grep SUCCESS | grep Child | wc -l

Check output

Expected Output | Benchmarking:

Run the following at the command line:

less ../TUTORIAL/demo/summary/Human_tutorial.SLURM.job_numbers.csv

Check output

AnalysisName,RunName,Parent,Phase,JobList
Baseline-v1.4.0,Human_tutorial,Father,make_examples,27669522
Baseline-v1.4.0,Human_tutorial,Father,beam_shuffle,27669523
Baseline-v1.4.0,Human_tutorial,Father,make_examples,27669524
Baseline-v1.4.0,Human_tutorial,Father,beam_shuffle,27669525
# The JobList column will differ based on SLURM job numbers

The following will provide a conservative estimate for the --max-examples and --est-examples parameters to ensure shuffling easily fits within your available memory.

Run the following at the command line:

source ./scripts/setup/modules.sh
source ./scripts/start_conda.sh                 # Ensure the previously built conda env is active
python3 triotrain/model_training/tutorial/estimate.py                           \
    --vcf-file ../TUTORIAL/demo/Human_tutorial/test_Father/test1_chr21.vcf.gz   \
    -g Father                                                                   \
    --demo-mode                                                                 \
    --env-file ../TUTORIAL/demo/envs/run1.env

Expected Output | Estimating TrioTrain Parameters:

===== start of triotrain/model_training/tutorial/estimate.py @ 2023-06-29  11:25:27 =====
2023-06-29 11:25:27 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: number of REF/REF variants found | 44,452
2023-06-29 11:25:27 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: number of PASS variants found | 114,332
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: default maximum examples per region | 200,000
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: default value for --max-examples is appropriate
2023-06-29 11:25:28 AM - [INFO] - adding Demo_TotalVariants='114332'
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: added 'Demo_TotalVariants=114332' to env file
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: number of examples made | 75,902
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: calculated examples per variant | 0.664
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: prevent underestimating which creates too many examples per region by rounding up to the nearest 0.5 | 1.0
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: default examples per variant | 1.5
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: difference between default and calculated examples per variant | 0.836
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: when running TrioTrain outside of this tutorial, please use '--est-examples=1.0'
2023-06-29 11:25:28 AM - [INFO] - adding Est_Examples='1.0'
2023-06-29 11:25:28 AM - [INFO] - [DEMO] - [TRIO1] - [count_variants] - [Father]: added 'Est_Examples=1.0' to env file
===== end of triotrain/model_training/tutorial/estimate.py @ 2023-06-29  11:25:28 =====

9. Run TrioTrain with a Human Trio

Now that we know how to tailor TrioTrain for our non-bovine species (human), we can move forward with starting the re-training pipeline.

Complete the GIAB TrioTrain by editing the following at the command to include your new SLURM config file:

Run the following at the command line:

# You can add the --dry-run flag to this command to confirm the TrioTrain pipeline runs smoothly prior to submitting jobs to the queue
source ./scripts/setup/modules.sh
source ./scripts/start_conda.sh                 # Ensure the previously built conda env is active
python3 triotrain/run_trio_train.py                                                                         \
    -g Father                                                                                               \
    --unmapped-reads chrUn                                                                                  \
    --est-examples 1                                                                                        \
    -m triotrain/model_training/tutorial/GIAB.Human_tutorial_metadata.csv                                   \
    -n GIAB_Trio                                                                                            \
    --num-tests 3                                                                                           \
    --custom-checkpoint triotrain/model_training/pretrained_models/v1.4.0_withIS_withAF/wgs_af.model.ckpt   \
    --output ../TUTORIAL                                                                                    \
    --benchmark                                                                                             \
    -r </path/to/new_file.json>

Need to Re-Submit a SLURM job

Check | Baseline WGS.AF:

Run the following at the command line:

ls ../TUTORIAL/baseline_v1.4.0_withIS_withAF/ | grep total | grep total

Expected outputs:

Test1.total.metrics.csv
Test2.total.metrics.csv
Test3.total.metrics.csv

Check | Father:

Run the following at the command line:

ls ../TUTORIAL/GIAB_Trio/Human_tutorial/compare_Father/ | grep total

Expected outputs:

Test1.total.metrics.csv
Test2.total.metrics.csv
Test3.total.metrics.csv

Check | Mother:

Run the following at the command line:

ls ../TUTORIAL/GIAB_Trio/Human_tutorial/compare_Mother/ | grep total

Expected outputs:

Test1.total.metrics.csv
Test2.total.metrics.csv
Test3.total.metrics.csv

Merge Results | Per-Iteration

Each Test#.total.metrics.csv output file should contain 57 rows and 2 columns. The metrics within are the raw and proprotional performance metrics from hap.py. After all convert_happy jobs complete, we will separately merge the results from running the baseline iteration, and both training iterations completed during the GIAB tutorial:

Run the following at the command line:

source ./scripts/setup/modules.sh
source ./scripts/start_conda.sh                 # Ensure the previously built conda env is active
for start_i in $(seq 0 1); do
    echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: merging processed results from hap.py for GIAB run#${start_i}"
    python3 triotrain/summarize/merge_results.py --env ../TUTORIAL/GIAB_Trio/envs/run${start_i}.env -g Father -m triotrain/summarize/data/tutorial_metadata.csv
    echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: finished merging processed results from hap.py for GIAB run#${start_i}"
done

Check | Baseline WGS.AF

Run the following at the command line:

ls ../TUTORIAL/baseline_v1.4.0_withIS_withAF/ | grep All

Expected outputs:

DV1.4_WGS.AF_human.AllTests.total.metrics.csv

Check | GIAB Trio1

Run the following at the command line:

ls ../TUTORIAL/GIAB_Trio/summary | grep All

Expected outputs:

Trio1.AllTests.total.metrics.csv

Clean Up Directories | Per-Iteration

Running TrioTrain produces a large volume of temporary files. Run the following at the command line to free up space:

Run the following at the command line:

source ./scripts/setup/modules.sh
source ./scripts/start_conda.sh                 # Ensure the previously built conda env is active
for start_i in $(seq 0 1); do
    echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: removing temp files for GIAB run#${start_i}"
    python3 triotrain/model_training/slurm/clean_tmp.py --env ../TUTORIAL/GIAB_Trio/envs/run${start_i}.env
    echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: finished removing temp files for GIAB run#${start_i}"
done

Check | Baseline WGS.AF

Run the following at the command line:

du -h ../TUTORIAL/baseline_v1.4.0_withIS_withAF/

Expected outputs:

700M    ../TUTORIAL/baseline_v1.4.0_withIS_withAF/

Check | GIAB Trio1

Run the following at the command line:

du -h ../TUTORIAL/GIAB_Trio
``

```bash title="Expected outputs:"
141M    ../TUTORIAL/GIAB_Trio/Human_tutorial/logs
402M    ../TUTORIAL/GIAB_Trio/Human_tutorial/compare_Mother
1.1M    ../TUTORIAL/GIAB_Trio/Human_tutorial/train_Mother/eval_Child
23G     ../TUTORIAL/GIAB_Trio/Human_tutorial/train_Mother
394M    ../TUTORIAL/GIAB_Trio/Human_tutorial/test_Father
298M    ../TUTORIAL/GIAB_Trio/Human_tutorial/compare_Father
1.1M    ../TUTORIAL/GIAB_Trio/Human_tutorial/train_Father/eval_Child
23G     ../TUTORIAL/GIAB_Trio/Human_tutorial/train_Father
243G    ../TUTORIAL/GIAB_Trio/Human_tutorial/examples
391M    ../TUTORIAL/GIAB_Trio/Human_tutorial/test_Mother
3.1M    ../TUTORIAL/GIAB_Trio/Human_tutorial/jobs
290G    ../TUTORIAL/GIAB_Trio/Human_tutorial
100K    ../TUTORIAL/GIAB_Trio/summary
50K     ../TUTORIAL/GIAB_Trio/envs
290G    ../TUTORIAL/GIAB_Trio/

Next - Comparing Models

Last update: March 8, 2024

Human GIAB Tutorial

1. Confirm Successful Configuration

2. Activate Environment

4. Download pre-trained models

5. Download Raw Data

6. Process Raw Data

a. Merge the PopVCF

b. Download Sequence Data

7. Create TrioTrain Inputs

a. Reference Dictionary File (.dict)

b. Metadata file (.csv)

c. SLURM Resource Config File (.json)

8. Run Shuffling Demo

9. Run TrioTrain with a Human Trio

Merge Results | Per-Iteration

Clean Up Directories | Per-Iteration

a. Reference Dictionary File (`.dict`)

b. Metadata file (`.csv`)

c. SLURM Resource Config File (`.json`)