Genomic Data

Danforth Center genomics pipeline

Outlined below are the steps taken to create a raw vcf file from paired end raw FASTQ files. This was done for each sequenced accession so a HTCondor DAG Workflow was written to streamline the processing of those ~200 accessions. While some cpu and memory parameters have been included within the example steps below those parameters varied from sample to sample and the workflow has been honed to accomodate that variation. This pipeline is subject to modification based on software updates and changes to software best practices.

Software versions:

Preparing reference genome

Download Sorghum bicolor v3.1 from Phytozome

Generate:

BWA index:

bwa index –a bwtsw Sbicolor_313_v3.0.fa

fasta file index:

Sequence dictionary:

Quality trimming and filtering of paired end reads

Aligning reads to the reference

Convert and Sort bam

Mark Duplicates

Index bam files

Find intervals to analyze

Realign

Variant Calling with GATK HaplotypeCaller

Above this point is the workflow for the creation of the gVCF files for this project. The following additional steps were used to create the Hapmap file

Combining gVCFs with GATK CombineGVCFs

NOTE: This project has 363 gvcfs: multiple instances of CombineGVCFs, with unique subsets of gvcf files, were run in parallel to speed up this step below are examples

Joint genotyping on gVCF files with GATK GenotypeGVCFs

Applying hard SNP filters with GATK VariantFiltration

Filter and recode VCF with VCFtools

Adapt VCF for use with Tassel5

Convert VCF to Hapmap with Tassel5

Last updated