Configuration
Warning
Steps that require some system-specific edits for your HPC cluster will have a warning label.
To ensure reproducibility, all configuration steps described below are included in a helper script, build.sh
.
Example | build.sh
#!/bin/bash
# scripts/setup/build.sh
# NOTE: Begin an interactive session first!
# source scripts/start_interactive.sh
echo -e "=== scripts/setup/build.sh > start $(date)"
# Load cluster-specific modules
# NOTE: You will need to change this bash script to
# match your own system modules available
# Reach out to your cluster's sys admin for
# installation guidelines
source scripts/setup/modules.sh
# NOTE: both are required, since can't run the
# GPU version used for training
# on a non-GPU hardware
# Install GPU-specific apptainer container
bash scripts/setup/build_containers.sh DeepVariant-GPU
# Install CPU-specific apptainer container
bash scripts/setup/build_containers.sh DeepVariant-CPU
# Install the happ.py apptainer container
bash scripts/setup/build_happy.sh
# Install the conda env needed for python package 'triotrain'
source scripts/setup/build_beam.sh
# Download the appropriate shuffling script from Google Genomoics Health Group
bash scripts/setup/download_shuffle.sh
# Download pre-trained models
bash scripts/setup/download_models.sh
# Download GIAB trio data v4.2.1 for benchmarking
bash scripts/setup/download_GIAB.sh
# RUN INTERACTIVELY TO MAKE SURE IT WORKS!
bash triotrain/variant_calling/data/GIAB/bam/AJtrio.download
bash triotrain/variant_calling/data/GIAB/bam/HCtrio.download
# then calculate coverage with an SBATCH job
# bash triotrain/scripts/setup/run_jobs.sh
# Create the rtg-tools reference files for the Human ref genome GRCh38
# NOTE: this must be run after download_GIAB!
bash scripts/setup/setup_rtg_tools.sh
echo -e "=== scripts/setup/build.sh > end $(date)"
1. Begin an interactive session first
Requires Customization
We request resource in a SLURM "interactive session" to allow us to run code at the command line and avoid running resource-intensive code on the login node, which could negatively impact other users.
Option 1: Manual
Use the following command template, make edits to match your system's resources (i.e. add a valid partition and fair-share account).
srun --pty -p <partition_name> --time=0-06:00:00 --exclusive --mem=0 -A <account_name> /bin/bash
Option 2: Automated
For repeatedly switching between different interactive session, we use the same syntax as above, but editing the provided template to match your system's resources (i.e. add a valid partition and fair-share account).
Example | start_interactive.sh
#!/bin/bash
# scripts/start_interactive.sh
# An example script of requesting interactive resources for the Lewis SLURM Cluster
# NOTE: You will need to change this to match your own setup, such as
# altering the partition name and qos (i.e. 'Interactive') or,
# altering your account (i.e. 'schnabellab')
# srun --pty -p gpu3 --time=0-04:00:00 -A animalsci /bin/bash
# srun --pty -p hpc6 --time=0-04:00:00 --mem=0 --exclusive -A animalsci /bin/bash
# srun --pty -p Interactive --qos=Interactive --time=0-04:00:00 --mem=0 --exclusive -A animalsci /bin/bash
# srun --pty -p Interactive --qos=Interactive --time=0-04:00:00 --mem=30G -A schnabellab /bin/bash
srun --pty -p Lewis --time=0-04:00:00 --mem=30G -A schnabellab /bin/bash
# srun --pty -p BioCompute --time=0-06:00:00 --exclusive --mem=0 -A schnabellab /bin/bash
2. Load cluster-specific modules
Requires Customization
This executable is how TrioTrain finds the required software on your local HPC. TrioTrain will repeatedly use this script to load all modules and the required bash helper functions. Edit the provided template to match your system (i.e. add a valid module name).
Example | modules.sh
#!/usr/bin/bash
## scripts/setup/modules.sh
echo "=== scripts/setup/modules.sh start > $(date)"
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Wiping modules... "
module purge
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Done wipe modules"
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Loading modules... "
# Enable loading of pkgs from prior manager
module load rss/rss-2020
# Update to a newer, but still old, version of Curl
module load curl/7.72.0
# Update to a newer version of git,
# Required for Git extensions on VSCode
module load git/2.29.0
# Enable "conda activate" rather than,
# using "source activate"
module load miniconda3/4.9
export CONDA_BASE=$(conda info --base)
# System Requirement to use 'conda activate'
source ${CONDA_BASE}/etc/profile.d/conda.sh
conda deactivate
# Modules required for re-training
module load java/openjdk/java-1.8.0-openjdk
module load singularity/singularity
module load picard/2.26.10
# Modules required for post-procesing variants
module load cuda/11.1.0
module load bcftools/1.14
module load htslib/1.14
module load samtools/1.14
module load gcc/10.2.0
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Done Loading Modules"
echo -e "$(date '+%Y-%m-%d %H:%M:%S') INFO: Conda Base Environment:\n${CONDA_BASE}"
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Python Version:"
python3 --version
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Java Version:"
java -version
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Apptainer Version:"
apptainer --version
# Source DeepVariant version and CACHE Dir
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Adding Apptainer variables... "
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: This step is required to build DeepVariant image(s)"
if [ -z "$1" ]
then
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Using defaults, DeepVariant version 1.4.0"
export BIN_VERSION_DV="1.4.0"
export BIN_VERSION_DT="1.4.0"
else
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Using inputs, DeepVariant version $1"
export BIN_VERSION_DV="$1"
export BIN_VERSION_DT="$1"
fi
export APPTAINER_CACHEDIR="${PWD}/APPTAINER_CACHE"
export APPTAINER_TMPDIR="${PWD}/APPTAINER_TMPDIR"
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Done adding Apptainer variables"
# Confirm that it worked
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: DeepVariant Version: ${BIN_VERSION_DV}"
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Apptainer Cache: ${APPTAINER_CACHEDIR}"
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Apptainer Tmp: ${APPTAINER_TMPDIR}"
# Activating the Bash Sub-Routine to handle errors
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Loading bash helper functions... "
source scripts/setup/helper_functions.sh
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Done Loading bash helper functions"
echo "=== scripts/setup/modules.sh > end $(date)"
Alternate Versions of DeepVariant
Providing a valid version number as the first argument to modules.sh
will change the version used. Using any version greater than v1.4.0 is untested!
3. Install Apptainer/Singularity containers
We need local copies of the two (2) versions of DeepVariant containers, and one (1) container for hap.py
:
- GPU-specific container used for training
- CPU-specific container used for all other steps
hap.py
- we strongly recommend using a containerized version as this tool uses the depreciated Python v2.7 making it incompatible with either DeepVariant containers, and the TrioTrain conda environment.
# Install GPU-specific DV apptainer container
bash scripts/setup/build_containers.sh DeepVariant-GPU
# Install CPU-specific DV apptainer container
bash scripts/setup/build_containers.sh DeepVariant-CPU
# Install the happ.py apptainer container
bash scripts/setup/build_happy.sh
Example | build_containers.sh
#!/bin/bash
# scripts/setup/build_containers.sh
echo "=== scripts/setup/build_containers.sh > start $(date)" $1
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Creating Apptainer CACHE/ and TMP/, if needed"
install --directory --verbose ${APPTAINER_CACHEDIR}
install --directory --verbose ${APPTAINER_TMPDIR}
# Only want to build these Apptainer Image(s) once!
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Installing Container Image(s), if necessary"
if [[ $1 == 'DeepVariant-CPU' ]]; then
version="${BIN_VERSION_DV}"
image_name="deepvariant_${version}"
docker_name="${version}"
command="run_deepvariant"
elif [[ $1 == 'DeepVariant-GPU' ]]; then
version="${BIN_VERSION_DV}"
image_name="deepvariant_${version}-gpu"
docker_name="${version}-gpu"
command="run_deepvariant"
elif [[ $1 == 'DeepTrio-CPU' ]]; then
version="${BIN_VERSION_DT}"
image_name="deepvariant_deeptrio-${version}"
docker_name="deeptrio-${version}"
command="deeptrio/run_deeptrio"
elif [[ $1 == 'DeepTrio-GPU' ]]; then
version="${BIN_VERSION_DT}"
image_name="deepvariant_deeptrio-${version}-gpu"
docker_name="deeptrio-${BIN_VERSION_DT}-gpu"
command="deeptrio/run_deeptrio"
else
echo -e "$(date '+%Y-%m-%d %H:%M:%S') ERROR: Invalid argument [$1] provided.\n$(date '+%Y-%m-%d %H:%M:%S') INFO: Choices: [ DeepVariant-CPU, DeepTrio-CPU, DeepVariant-GPU, DeepTrio-GPU ]\nExiting... "
exit 1
fi
if test -x ./${image_name}.sif; then
echo -e "$(date '+%Y-%m-%d %H:%M:%S') INFO: Image [${image_name}.sif] has already been installed"
apptainer run -B /usr/lib/locale/:/usr/lib/locale/ ${image_name}.sif /"opt/deepvariant/bin/${command}" --version
else
echo -e "$(date '+%Y-%m-%d %H:%M:%S') INFO: Image [${image_name}.sif] needs to be installed"
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Apptainer Image will go here: ${PWD}"
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Building Apptainer Image now... "
apptainer pull docker://google/deepvariant:"${docker_name}"
echo "Done: Building Apptainer Image"
fi
echo "=== scripts/setup/build_containers.sh > end $(date)"
Example | build_happy.sh
#!/bin/bash
# build_happy.sh
echo "=== scripts/setup/build_happy.sh > start $(date)"
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Creating Apptainer CACHE/ and TMP/, if needed"
install --directory --verbose ${APPTAINER_CACHEDIR}
install --directory --verbose ${APPTAINER_TMPDIR}
# Only want to build these Apptainer Image(s) once!
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Installing Hap.py Container Image(s), if necessary..."
if test -x ./hap.py_v0.3.12.sif; then
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Image [hap.py_v0.3.12.sif] has already been installed"
apptainer run -B /usr/lib/locale/:/usr/lib/locale/ hap.py_v0.3.12.sif /opt/hap.py/bin/hap.py --help
else
echo "$(date '+%Y-%m-%d %H:%M:%S')INFO: Image [hap.py_v0.3.12.sif] needs to be installed"
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Apptainer Image will go here: ${PWD}"
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Building Apptainer Image now... "
apptainer pull docker://jmcdani20/hap.py:v0.3.12
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Done Building Hap.py Apptainer Image"
fi
echo "=== scripts/setup/build_happy.sh > end $(date)"
Container Versions
Both the GPU and CPU containers are required, since you can't run the GPU version used for training on a non-GPU hardware.
4. Install the Conda environment
Warning
CAUTION: TrioTrain and DeepVariant require highly specific package versions, and TrioTrain assumes that a pre-built conda environment is located here: ./miniconda_envs/beam_v2.30
. We are unable to support users opt to make significant changes, or deviate the conda env path at this time.
This conda environment includes the DeepVariant requirements, such as Apache Beam, Tensorflow, etc. The conda environment can take awhile to build. We recommend requesting ample memory during your interactive session before proceeding.
source scripts/setup/build_beam.sh
# `source` is used instead of `bash` to by-pass system issues with `conda activate`
# specific to MU Lewis, which may not be required for your system.
Example | build_beam.sh
#!/bin/bash
# scripts/setup/build_beam.sh
echo -e "=== scripts/setup/build_beam.sh > start $(date)"
##--- NOTE: ----##
## You must have an interactive session
## with more mem than defaults to work!
##--------------##
if [ ! -d ./miniconda_envs/beam_v2.30 ] ; then
# If missing an enviornment called "beam_v2.30",
# initalize this env with only the anaconda package
conda create --yes --prefix ./miniconda_envs/beam_v2.30
fi
# Then, activate the new environment
source ${CONDA_BASE}/etc/profile.d/conda.sh
conda deactivate
conda activate ./miniconda_envs/beam_v2.30
##--- Configure an environment-specific .condarc file ---##
## NOTE: Only performed once:
# Changes the (env) prompt to avoid printing the full path
conda config --env --set env_prompt '({name})'
# Put the package download channels in a specific order
conda config --env --add channels defaults
conda config --env --add channels bioconda
conda config --env --add channels conda-forge
# Download packages flexibly
conda config --env --set channel_priority flexible
# Install the project-specific packages
# in the currently active env
conda install -p ./miniconda_envs/beam_v2.30 -y -c conda-forge python=3.8 pandas numpy python-dotenv python-snappy tensorflow=2.5 apache-beam=2.30 regex spython natsort rtg-tools
# Deactivate the conda env to continue with build process
conda deactivate
###===== Notes about Beam specific packages =====###
### Python = Apache Beam Python SDK only supports v3.6-3.8
### Scipy = scientific libraries for Python
### DotEnv = enables environment variable configuration across bash and python
### Snappy = A fast compressor/decompressor (required)
### Apache-Beam = unified programming model for batch and stream processes
### Tensorflow = eval metrics visualizations via TensorBoard for CPU, use v2.5.0 for DV v1.4
### Regex = required for update regular expression handling
### Spython = interface between Singularity/Apptainer bash commands and Python
### Natsort = enables sorting of file iterators
### RTG-Tools = required for Mendelian Inhertiance Error calculations performed for summarize
echo -e "=== scripts/setup/build_beam.sh > end $(date)"
5. Download the Beam shuffling script
Creates a local copy of the appropriate shuffling script from Google Genomoics Health Group.
Example | download_shuffle.sh
#!/bin/bash
# scripts/setup/download_shuffle.sh
echo -e "=== scripts/setup/download_shuffle.sh > start $(date)"
##======= Download Shuffle Script =================================##
export SHUFFLE_VERSION=${BIN_VERSION_DV:0:3}
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO: Downloading Google Beam Shuffling Script - v${SHUFFLE_VERSION}"
curl -C - https://raw.githubusercontent.com/google/deepvariant/r${SHUFFLE_VERSION}/tools/shuffle_tfrecords_beam.py -o triotrain/model_training/prep/shuffle_tfrecords_beam.py
##=================================================================##
echo -e "=== scripts/setup/download_shuffle.sh > end $(date)"
Install TrioTrain Complete the Human GIAB Tutorial