Skip to content

TrioTrain Data

Data Assumptions

TrioTrain and DeepVariant use several input file formats; however, all files must:

  • exist before the execution of the pipeline
  • be compatible with the reference genome provided to the pipeline
  • be sorted and indexed
  • contain only one sample per file

Required Raw Data

  1. Reference Genome

    • must be in FASTA format
    • includes the corresponding .fai index file generated with samtools faidx and located in the same directory
    • includes the corresponding .dict file generated with picard and located in the same directory
    • (OPTIONALLY) includes the corresponding Sequence Data File (SDF) generated with rtg-tools format and located at the same path in a sub-directory called "rtg_tools" — required for calculating Mendelian Inhertiance Errors with testing genomes
  2. Aligned Reads File(s)

    • must be aligned to the reference genome above
    • can be in either BAM format or CRAM format
    • includes the corresponding .bai or .csi index file located in the same directory
  3. Benchmarking Variant File(s)

    • also referred to as "truth genotypes", or "gold-standard genotypes"
    • must be in in VCF format and compressed with bgzip
    • includes a corresponding .tbi index generated with tabix and located in the same directory
    • excludes any homozygous reference genotypes and any sites that violate Mendelian inheritance expectations
  4. Benchmarking Region File(s)

    • also referred to as "callable regions"
    • must be in BED format
    • must be compatible with the specified reference genome
    • compressed files will be decompressed
    • use 0-based coordinates
  5. Starting DeepVariant Model Checkpoint

    • used for warm-start a new model initializing weights with a previous model
    • can either be downloaded from Google Cloud Platform (GCP) or created previously by a prior TrioTrain iteration
    • Checkpoints consists of four (4) files all located in the same directory:

      1. .data-00000-of-00001
      2. .index
      3. .meta
      4. .example_info.json — defines which features to include as channels within the images given to DeepVariant in tfRecord format

      Note

      Examples made with different channel(s), a different tfRecord shape, or a different DeepVariant version can be incompatible with your chosen starting model. Get details about model features compatible with TrioTrain, such as shape, version and channels here.

      You can check the shape of a model's examples with:

      jq '.' <model_name>.example_info.json.

  6. (OPTIONAL) Population Allele Frequencies

    • must be in VCF format and compressed with bgzip
    • includes a corresponding .tbi index generated with tabix and located in the same directory
    • genotypes should be removed

Note

Our automated, cattle-optimized GATK Best Practices workflow used to generate our input files automatically performs realignment and recalibration with Base Quality Score Recalibration (BQSR). BQSR is not required or recommended for using the single-step variant caller from DeepVariant, as it may decrease the accuracy.

However, re-training involves a small proportion of the total genomes processed by UMAG group (55 of 5,500+). Thus, removing BQSR would decrease the quality of the entire cohort's GATK genotypes used in other research. The impact of including BQSR in our truth labels was not evaluated further during TrioTrain's development.

TrioTrain-Specific Inputs

Configuring SLURM Resources

SLURM resources are handled by TrioTrain via a resource configuration file (.json).

Example | Resource Config File
triotrain/model_training/tutorial/resources_used.json
{
    "make_examples": {
        "partition": "hpc5,hpc6,BioCompute",
        "nodes": 1,
        "ntasks": 40,
        "mem": 379067,
        "CPUmem": 9000,
        "time": "0-2:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "beam_shuffle": {
        "partition": "hpc5,BioCompute",
        "nodes": 1,
        "ntasks": 40,
        "mem": 379067,
        "time": "0-2:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "re_shuffle": {
        "partition": "BioCompute,Lewis",
        "nodes": 1,
        "ntasks": 1,
        "mem": "200G",
        "time": "0-10:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "train_eval": {
        "partition": "gpu3",
        "gres": "gpu:2",
        "nodes": 1,
        "ntasks": 16,
        "mem": "0",
        "time": "2-00:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "select_ckpt": {
        "partition": "BioCompute,Lewis",
        "nodes": 1,
        "ntasks": 1,
        "mem": 500,
        "time": "0-00:30:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "call_variants": {
        "partition": "hpc5,BioCompute",
        "nodes": 1,
        "ntasks": 40,
        "mem": 379067,
        "time": "2-00:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "compare_happy": {
        "partition": "BioCompute,hpc5",
        "nodes": 1,
        "ntasks": 40,
        "mem": "300G",
        "time": "0-12:00:00",
        "account": "biocommunity",
        "email": "jakth2@mail.missouri.edu"
    },
    "convert_happy": {
        "partition": "BioCompute,hpc3,hpc5,Lewis",
        "nodes": 1,
        "ntasks": 24,
        "mem": "120G",
        "time": "0-05:00:00",
        "account": "biocommunity",
        "email": "jakth2@mail.missouri.edu"
    },
    "show_examples": {
        "partition": "BioCompute,Lewis",
        "nodes": 1,
        "mem": "1G",
        "time": "0-02:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "summary": {
        "partition": "hpc5,hpc6,Lewis",
        "nodes": 1,
        "ntasks": 4,
        "mem": "40G",
        "time": "0-10:00:00",
        "account": "schnabellab",
        "email": "jakth2@mail.missouri.edu"
    }
}

Resource Config Format

Contains nested dictionaries in the following format:

{"phase_name": {
    "SLURM_SBATCH_PARAMETER": "value",
    "SLURM_SBATCH_PARAMETER": "value",
    "SLURM_SBATCH_PARAMETER": "value",
    }
}

There are (8) required phases within TrioTrain's SLURM config file. Valid phase_names for these include:

  1. make_examples
  2. beam_shuffle
  3. re_shuffle
  4. train_eval
  5. select_ckpt
  6. call_variants
  7. compare_happy
  8. convert_happy

Additionally, there are (3) optional phase names for TrioTrain's supplementary analyes that include:

  1. show_examples — for running TrioTrain in 'demo' mode
  2. summary_stats — for calculating per-VCF stats for each test genome
  3. mie_summary — for calculating Mendelian Inheritance Error rate in trio-binned test genomes

The value for each phase_name is a nested dictionary that contains key:value pairs of parameters for running SBATCH job files. You can view valid SBATCH options in the SLURM documentation.

Example | Resource Config File
triotrain/model_training/tutorial/resources_used.json
{
    "make_examples": {
        "partition": "hpc5,hpc6,BioCompute",
        "nodes": 1,
        "ntasks": 40,
        "mem": 379067,
        "CPUmem": 9000,
        "time": "0-2:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "beam_shuffle": {
        "partition": "hpc5,BioCompute",
        "nodes": 1,
        "ntasks": 40,
        "mem": 379067,
        "time": "0-2:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "re_shuffle": {
        "partition": "BioCompute,Lewis",
        "nodes": 1,
        "ntasks": 1,
        "mem": "200G",
        "time": "0-10:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "train_eval": {
        "partition": "gpu3",
        "gres": "gpu:2",
        "nodes": 1,
        "ntasks": 16,
        "mem": "0",
        "time": "2-00:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "select_ckpt": {
        "partition": "BioCompute,Lewis",
        "nodes": 1,
        "ntasks": 1,
        "mem": 500,
        "time": "0-00:30:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "call_variants": {
        "partition": "hpc5,BioCompute",
        "nodes": 1,
        "ntasks": 40,
        "mem": 379067,
        "time": "2-00:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "compare_happy": {
        "partition": "BioCompute,hpc5",
        "nodes": 1,
        "ntasks": 40,
        "mem": "300G",
        "time": "0-12:00:00",
        "account": "biocommunity",
        "email": "jakth2@mail.missouri.edu"
    },
    "convert_happy": {
        "partition": "BioCompute,hpc3,hpc5,Lewis",
        "nodes": 1,
        "ntasks": 24,
        "mem": "120G",
        "time": "0-05:00:00",
        "account": "biocommunity",
        "email": "jakth2@mail.missouri.edu"
    },
    "show_examples": {
        "partition": "BioCompute,Lewis",
        "nodes": 1,
        "mem": "1G",
        "time": "0-02:00:00",
        "account": "animalsci",
        "email": "jakth2@mail.missouri.edu"
    },
    "summary": {
        "partition": "hpc5,hpc6,Lewis",
        "nodes": 1,
        "ntasks": 4,
        "mem": "40G",
        "time": "0-10:00:00",
        "account": "schnabellab",
        "email": "jakth2@mail.missouri.edu"
    }
}

Providing required data to TrioTrain

Input files are handled by the primary input file for TrioTrain, a metadata file in .csv format. This input file includes trio pedigree information, and the absolute file paths for the local data you want to give DeepVariant.

Different metadata files are used to define different re-training approaches. For example, you can alter the order in which trios are given to DeepVariant by varying the row order in two different metadata files.

Metadata Assumptions

  • The first row includes column headers which will become variable names within TrioTrain
  • Each row corresponds to one complete family trio resulting in (2) re-training iterations, one for each parent
  • Row order determines the sequential order of how trios seen by DeepVariant
  • There are (24) REQUIRED columns that must be in the order specified in the Metadata Format section below

Note

If the data are available, you can perform additional iterations of TrioTrain by adding rows for each additional trio.

Likewise, further test replicates can be achieved by adding columns in sets of three [BAM,TruthVCF,TruthBED] for each additional test genome.

Minimum Data Required

At a minimum, the metadata file must provide absolute paths to the following input files:

  1. TrioTrain performs two iterations of re-training, one for each parent in a trio which requires:

    • Three (3) aligned read data .bam files, with the corresponding .bai index.
    • Three (3) benchmark .vcf.gz files, with the corresponding .vcf.gz.tbi index.
    • Three (3) benchmark region .bed files.
  2. TrioTrain tests the model produced for each iteration using a set of genomes previously unseen by the model. Ideally, these testing samples should consist of individuals outside of the family and requires:

    • One or more (1+) aligned read data .bam files, with the corresponding .bai index.
    • One or more (1+) benchmark .vcf.gz files, with the corresponding .vcf.gz.tbi index.
    • One or more (1+) benchmark .bed files.

Metadata Format

Column Number Column Name Description Data Type
1 RunOrder Sequential number for each trio integer
2 RunName A unique name for the trio's output directory string without spaces
3 ChildSampleID A primary, unique identifier for a child; must match the SampleID in the child’s VCF/BAM/BED files alpha-numeric characters
4 ChildLabID A secondary, unique ID for a child ; default=ChildSampleID alpha-numeric characters
5 FatherSampleID A primary, unique identifier for a father; must match the SampleID in the father’s VCF/BAM/BED files alpha-numeric characters
6 FatherLabID A secondary, unique ID for a father; default=FatherSampleID alpha-numeric characters
7 MotherSampleID A primary, unique identifier for a mother; must match the SampleID in the mother’s VCF/BAM/BED files alpha-numeric characters
8 MotherLabID A secondary, unique ID for a mother; default=MotherSampleID alpha-numeric characters
9 ChildSex The sex of the child, where F=female, M=male, U=unknown F, M, U
10 RefFASTA The absolute path to the reference file /path/to/file
11 PopVCF The absolute path to the population allele frequency file; if blank, allele frequency information will not be included in the TensorFlow records during example image creation /path/to/file
12 RegionsFile a .bed file where each line represents a genomic region for shuffling; each shuffling region produce a set of file shards which depends upon the number of CPUs requested via SLURM; over-rides RegionShuffling if included /path/to/file
13 ChildReadsBAM The absolute path to the child's aligned reads /path/to/file
14 ChildTruthVCF The absolute path to the child's truth genotypes /path/to/file
15 ChildCallableBED The absolute path to the child's callable regions /path/to/file
16 FatherReadsBAM The absolute path to the fathers's aligned reads /path/to/file
17 FatherTruthVCF The absolute path to the father's truth genotypes /path/to/file
18 FatherCallableBED The absolute path to the father's callable regions /path/to/file
19 MotherReadsBAM The absolute path to the mother's aligned reads /path/to/file
20 MotherVCF The absolute path to the mother's truth genotypes /path/to/file
21 MotherCallableBED The absolute path to the mother's callable regions /path/to/file
22 Test1ReadsBAM The absolute path to a test genome's aligned reads /path/to/file
23 Test1TruthVCF The absolute path to a test genome's truth genotypes /path/to/file
24 Test1CallableBED The absolute path to a test genome's callable regions /path/to/file

Adding more test genomes

Each additional testing genome can be supplied by adding three (3) more columns in the following order:

Column Number Column Name Description Data Type
25 Test#ReadsBAM The absolute path to a test genome's aligned reads /path/to/file
26 Test#TruthVCF The absolute path to a test genome's truth genotypes /path/to/file
27 Test#CallableBED The absolute path to a test genome's callable regions /path/to/file

Note

The # in Test# does not correspond to the order each test is performed, as testing is performed in parallel. However, the number for each test genomes must be sequential to provide a unique label for output files.

TrioTrain Outputs

TODO: add a description here!


Last update: March 8, 2024