TrioTrain Data

Data Assumptions

TrioTrain and DeepVariant use several input file formats; however, all files must:

exist before the execution of the pipeline
be compatible with the reference genome provided to the pipeline
be sorted and indexed
contain only one sample per file

Required Raw Data

Reference Genome
- must be in FASTA format
- includes the corresponding .fai index file generated with samtools faidx and located in the same directory
- includes the corresponding .dict file generated with picard and located in the same directory
- (OPTIONALLY) includes the corresponding Sequence Data File (SDF) generated with rtg-tools format and located at the same path in a sub-directory called "rtg_tools" — required for calculating Mendelian Inhertiance Errors with testing genomes
Aligned Reads File(s)
- must be aligned to the reference genome above
- can be in either BAM format or CRAM format
- includes the corresponding .bai or .csi index file located in the same directory
Benchmarking Variant File(s)
- also referred to as "truth genotypes", or "gold-standard genotypes"
- must be in in VCF format and compressed with bgzip
- includes a corresponding .tbi index generated with tabix and located in the same directory
- excludes any homozygous reference genotypes and any sites that violate Mendelian inheritance expectations
Benchmarking Region File(s)
- also referred to as "callable regions"
- must be in BED format
- must be compatible with the specified reference genome
- compressed files will be decompressed
- use 0-based coordinates
Starting DeepVariant Model Checkpoint
- used for warm-start a new model initializing weights with a previous model
- can either be downloaded from Google Cloud Platform (GCP) or created previously by a prior TrioTrain iteration
- Checkpoints consists of four (4) files all located in the same directory:
  1. .data-00000-of-00001
  2. .index
  3. .meta
  4. .example_info.json — defines which features to include as channels within the images given to DeepVariant in tfRecord format
  Note
  
  Examples made with different channel(s), a different tfRecord shape, or a different DeepVariant version can be incompatible with your chosen starting model. Get details about model features compatible with TrioTrain, such as shape, version and channels here.
  
  You can check the shape of a model's examples with:
  
  jq '.' <model_name>.example_info.json.
(OPTIONAL) Population Allele Frequencies
- must be in VCF format and compressed with bgzip
- includes a corresponding .tbi index generated with tabix and located in the same directory
- genotypes should be removed

Note

Our automated, cattle-optimized GATK Best Practices workflow used to generate our input files automatically performs realignment and recalibration with Base Quality Score Recalibration (BQSR). BQSR is not required or recommended for using the single-step variant caller from DeepVariant, as it may decrease the accuracy.

However, re-training involves a small proportion of the total genomes processed by UMAG group (55 of 5,500+). Thus, removing BQSR would decrease the quality of the entire cohort's GATK genotypes used in other research. The impact of including BQSR in our truth labels was not evaluated further during TrioTrain's development.

TrioTrain-Specific Inputs

Configuring SLURM Resources

SLURM resources are handled by TrioTrain via a resource configuration file (.json).

Example | Resource Config File

triotrain/model_training/tutorial/resources_used.json

{
    "make_examples": {
        "partition": "hpc5,hpc6,BioCompute",
        "nodes": 1,
        "ntasks": 40,
        "mem": 379067,
        "CPUmem": 9000,
        "time": "0-2:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "beam_shuffle": {
        "partition": "hpc5,BioCompute",
        "nodes": 1,
        "ntasks": 40,
        "mem": 379067,
        "time": "0-2:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "re_shuffle": {
        "partition": "BioCompute,Lewis",
        "nodes": 1,
        "ntasks": 1,
        "mem": "200G",
        "time": "0-10:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "train_eval": {
        "partition": "gpu3",
        "gres": "gpu:2",
        "nodes": 1,
        "ntasks": 16,
        "mem": "0",
        "time": "2-00:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "select_ckpt": {
        "partition": "BioCompute,Lewis",
        "nodes": 1,
        "ntasks": 1,
        "mem": 500,
        "time": "0-00:30:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "call_variants": {
        "partition": "hpc5,BioCompute",
        "nodes": 1,
        "ntasks": 40,
        "mem": 379067,
        "time": "2-00:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "compare_happy": {
        "partition": "BioCompute,hpc5",
        "nodes": 1,
        "ntasks": 40,
        "mem": "300G",
        "time": "0-12:00:00",
        "account": "biocommunity",
        "email": "jakth2@mail.missouri.edu"
    },
    "convert_happy": {
        "partition": "BioCompute,hpc3,hpc5,Lewis",
        "nodes": 1,
        "ntasks": 24,
        "mem": "120G",
        "time": "0-05:00:00",
        "account": "biocommunity",
        "email": "jakth2@mail.missouri.edu"
    },
    "show_examples": {
        "partition": "BioCompute,Lewis",
        "nodes": 1,
        "mem": "1G",
        "time": "0-02:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "summary": {
        "partition": "hpc5,hpc6,Lewis",
        "nodes": 1,
        "ntasks": 4,
        "mem": "40G",
        "time": "0-10:00:00",
        "account": "schnabellab",
        "email": "jakth2@mail.missouri.edu"
    }
}

Resource Config Format

Contains nested dictionaries in the following format:

{"phase_name": {
    "SLURM_SBATCH_PARAMETER": "value",
    "SLURM_SBATCH_PARAMETER": "value",
    "SLURM_SBATCH_PARAMETER": "value",
    }
}

There are (8) required phases within TrioTrain's SLURM config file. Valid phase_names for these include:

make_examples
beam_shuffle
re_shuffle
train_eval
select_ckpt
call_variants
compare_happy
convert_happy

Additionally, there are (3) optional phase names for TrioTrain's supplementary analyes that include:

show_examples — for running TrioTrain in 'demo' mode
summary_stats — for calculating per-VCF stats for each test genome
mie_summary — for calculating Mendelian Inheritance Error rate in trio-binned test genomes

The value for each phase_name is a nested dictionary that contains key:value pairs of parameters for running SBATCH job files. You can view valid SBATCH options in the SLURM documentation.

Example | Resource Config File

triotrain/model_training/tutorial/resources_used.json

{
    "make_examples": {
        "partition": "hpc5,hpc6,BioCompute",
        "nodes": 1,
        "ntasks": 40,
        "mem": 379067,
        "CPUmem": 9000,
        "time": "0-2:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "beam_shuffle": {
        "partition": "hpc5,BioCompute",
        "nodes": 1,
        "ntasks": 40,
        "mem": 379067,
        "time": "0-2:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "re_shuffle": {
        "partition": "BioCompute,Lewis",
        "nodes": 1,
        "ntasks": 1,
        "mem": "200G",
        "time": "0-10:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "train_eval": {
        "partition": "gpu3",
        "gres": "gpu:2",
        "nodes": 1,
        "ntasks": 16,
        "mem": "0",
        "time": "2-00:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "select_ckpt": {
        "partition": "BioCompute,Lewis",
        "nodes": 1,
        "ntasks": 1,
        "mem": 500,
        "time": "0-00:30:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "call_variants": {
        "partition": "hpc5,BioCompute",
        "nodes": 1,
        "ntasks": 40,
        "mem": 379067,
        "time": "2-00:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "compare_happy": {
        "partition": "BioCompute,hpc5",
        "nodes": 1,
        "ntasks": 40,
        "mem": "300G",
        "time": "0-12:00:00",
        "account": "biocommunity",
        "email": "jakth2@mail.missouri.edu"
    },
    "convert_happy": {
        "partition": "BioCompute,hpc3,hpc5,Lewis",
        "nodes": 1,
        "ntasks": 24,
        "mem": "120G",
        "time": "0-05:00:00",
        "account": "biocommunity",
        "email": "jakth2@mail.missouri.edu"
    },
    "show_examples": {
        "partition": "BioCompute,Lewis",
        "nodes": 1,
        "mem": "1G",
        "time": "0-02:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "summary": {
        "partition": "hpc5,hpc6,Lewis",
        "nodes": 1,
        "ntasks": 4,
        "mem": "40G",
        "time": "0-10:00:00",
        "account": "schnabellab",
        "email": "jakth2@mail.missouri.edu"
    }
}

Providing required data to TrioTrain

Input files are handled by the primary input file for TrioTrain, a metadata file in .csv format. This input file includes trio pedigree information, and the absolute file paths for the local data you want to give DeepVariant.

Different metadata files are used to define different re-training approaches. For example, you can alter the order in which trios are given to DeepVariant by varying the row order in two different metadata files.

Metadata Assumptions

The first row includes column headers which will become variable names within TrioTrain
Each row corresponds to one complete family trio resulting in (2) re-training iterations, one for each parent
Row order determines the sequential order of how trios seen by DeepVariant
There are (24) REQUIRED columns that must be in the order specified in the Metadata Format section below

Note

If the data are available, you can perform additional iterations of TrioTrain by adding rows for each additional trio.

Likewise, further test replicates can be achieved by adding columns in sets of three [BAM,TruthVCF,TruthBED] for each additional test genome.

Minimum Data Required

At a minimum, the metadata file must provide absolute paths to the following input files:

TrioTrain performs two iterations of re-training, one for each parent in a trio which requires:
- Three (3) aligned read data .bam files, with the corresponding .bai index.
- Three (3) benchmark .vcf.gz files, with the corresponding .vcf.gz.tbi index.
- Three (3) benchmark region .bed files.
TrioTrain tests the model produced for each iteration using a set of genomes previously unseen by the model. Ideally, these testing samples should consist of individuals outside of the family and requires:
- One or more (1+) aligned read data .bam files, with the corresponding .bai index.
- One or more (1+) benchmark .vcf.gz files, with the corresponding .vcf.gz.tbi index.
- One or more (1+) benchmark .bed files.

Metadata Format

Column Number	Column Name	Description	Data Type
1	RunOrder	Sequential number for each trio	integer
2	RunName	A unique name for the trio's output directory	string without spaces
3	ChildSampleID	A primary, unique identifier for a child; must match the SampleID in the child’s `VCF/BAM/BED` files	alpha-numeric characters
4	ChildLabID	A secondary, unique ID for a child ; `default=ChildSampleID`	alpha-numeric characters
5	FatherSampleID	A primary, unique identifier for a father; must match the SampleID in the father’s `VCF/BAM/BED` files	alpha-numeric characters
6	FatherLabID	A secondary, unique ID for a father; `default=FatherSampleID`	alpha-numeric characters
7	MotherSampleID	A primary, unique identifier for a mother; must match the SampleID in the mother’s `VCF/BAM/BED` files	alpha-numeric characters
8	MotherLabID	A secondary, unique ID for a mother; `default=MotherSampleID`	alpha-numeric characters
9	ChildSex	The sex of the child, where `F=female, M=male, U=unknown`	`F`, `M`, `U`
10	RefFASTA	The absolute path to the reference file	`/path/to/file`
11	PopVCF	The absolute path to the population allele frequency file; if blank, allele frequency information will not be included in the TensorFlow records during example image creation	`/path/to/file`
12	RegionsFile	a `.bed` file where each line represents a genomic region for shuffling; each shuffling region produce a set of file shards which depends upon the number of CPUs requested via SLURM; over-rides RegionShuffling if included	`/path/to/file`
13	ChildReadsBAM	The absolute path to the child's aligned reads	`/path/to/file`
14	ChildTruthVCF	The absolute path to the child's truth genotypes	`/path/to/file`
15	ChildCallableBED	The absolute path to the child's callable regions	`/path/to/file`
16	FatherReadsBAM	The absolute path to the fathers's aligned reads	`/path/to/file`
17	FatherTruthVCF	The absolute path to the father's truth genotypes	`/path/to/file`
18	FatherCallableBED	The absolute path to the father's callable regions	`/path/to/file`
19	MotherReadsBAM	The absolute path to the mother's aligned reads	`/path/to/file`
20	MotherVCF	The absolute path to the mother's truth genotypes	`/path/to/file`
21	MotherCallableBED	The absolute path to the mother's callable regions	`/path/to/file`
22	Test1ReadsBAM	The absolute path to a test genome's aligned reads	`/path/to/file`
23	Test1TruthVCF	The absolute path to a test genome's truth genotypes	`/path/to/file`
24	Test1CallableBED	The absolute path to a test genome's callable regions	`/path/to/file`

Adding more test genomes

Each additional testing genome can be supplied by adding three (3) more columns in the following order:

Column Number	Column Name	Description	Data Type
25	Test#ReadsBAM	The absolute path to a test genome's aligned reads	`/path/to/file`
26	Test#TruthVCF	The absolute path to a test genome's truth genotypes	`/path/to/file`
27	Test#CallableBED	The absolute path to a test genome's callable regions	`/path/to/file`

Note

The # in Test# does not correspond to the order each test is performed, as testing is performed in parallel. However, the number for each test genomes must be sequential to provide a unique label for output files.

TrioTrain Outputs

TODO: add a description here!

Last update: March 8, 2024