Genomic Protocols
Genomic data includes whole-genome resequencing data from the HudsonAlpha Institute for Biotechnology, Alabama for 384 samples for accessions from the sorghum Bioenergy Association Panel (BAP) and genotyping-by-sequencing (GBS) data from Kansas State University for 768 samples from a population of sorghum recombinant inbred lines (RIL).

Danforth Center genomics pipeline

Outlined below are the steps taken to create a raw vcf file from paired end raw FASTQ files. This was done for each sequenced accession so a HTCondor DAG Workflow was written to streamline the processing of those ~200 accessions. While some cpu and memory parameters have been included within the example steps below those parameters varied from sample to sample and the workflow has been honed to accomodate that variation. This pipeline is subject to modification based on software updates and changes to software best practices.

Software versions:

Preparing reference genome

Download Sorghum bicolor v3.1 from Phytozome
Generate:

BWA index:

1
bwa index –a bwtsw Sbicolor_313_v3.0.fa
Copied!

fasta file index:

1
samtools faidx Sbicolor_313_v3.0.fa
Copied!

Sequence dictionary:

1
java –jar picard.jar CreateSequenceDictionary R=Sbicolor_313_v3.0.fa O=Sbicolor_313_v3.0.dict
Copied!

Quality trimming and filtering of paired end reads

1
bbduk2 in=SampleA_R1.fastq in2=SampleA_R2.fastq out=SampleA_R1.PE.fastq.gz \
2
out2=SampleA_R2.PE.fastq.gz k=23 mink=11 hdist=1 tpe tbo qtrim=rl trimq=20 \
3
minlen=20 rref=adapters_file.fa lref=adapters_file.fa
Copied!

Aligning reads to the reference

1
bwa mem –M \
2
–R “@RG\tIDSAMPLEA_RG1\tPL:illumina\tPU:FLOWCELL_BARCODE.LANE.SAMPLE_BARCODE_RG_UNIT\tLB:libraryprep-lib1\tSM:SAMPLEA” \
3
Sbicolor_313_v3.0.fa SampleA_R1.PE.fastq.gz SampleA_R2.PE.fastq.gz > SAMPLEA.bwa.sam
Copied!

Convert and Sort bam

1
Samtools view –bS SAMPLEA.bwa.sam | samtools sort - SAMPLEA.bwa.sorted
Copied!

Mark Duplicates

1
java –Xmx8g –jar picard.jar MarkDuplicates MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 \
2
REMOVE_DUPLICATES=true INPUT=SAMPLEA.bwa.sorted.bam OUTPUT=SAMPLEA.dedup.bam \
3
METRICS_FILES=SAMPLEA.dedup.metrics
Copied!

Index bam files

1
samtools index SAMPLEA.dedup.bam
Copied!

Find intervals to analyze

1
java –Xmx8g –jar GenomeAnalysisTK.jar –T RealignerTargetCreator \
2
–R Sbicolor_313_v3.0.fa –I SAMPLEA.dedup.bam –o SAMPLEA.realignment.intervals
Copied!

Realign

1
java –Xmx8g –jar GenomeAnalysisTK.jar –T IndelRealigner –R Sbicolor_313_v3.0.fa \
2
–I SAMPLEA.dedup.bam –targetIntervals SAMPLEA.realignment.intervals –o SAMPLEA.dedup.realigned.bam
Copied!

Variant Calling with GATK HaplotypeCaller

1
java –Xmx8g –jar GenomeAnalysisTK.jar –T HaplotypeCaller –R Sbicolor_313_v3.0.fa \
2
–I SAMPLEA.dedup.realigned.bam --emitRefConfidence GVCF --pcr_indel_model NONE \
3
-o SAMPLEA.output.raw.snps.indels.g.vcf
Copied!
Above this point is the workflow for the creation of the gVCF files for this project. The following additional steps were used to create the Hapmap file

Combining gVCFs with GATK CombineGVCFs

NOTE: This project has 363 gvcfs: multiple instances of CombineGVCFs, with unique subsets of gvcf files, were run in parallel to speed up this step below are examples
1
java –Xmx8g –jar GenomeAnalysisTK.jar –T CombineGVCFs –R Sbicolor_313_v3.0.fa \
2
-V SAMPLEA.output.raw.snps.indels.g.vcf --variant SAMPLEB.output.raw.snps.indels.g.vcf\
3
-V SAMPLEC.output.raw.snps.indels.g.vcf -o SamplesABC_combined_gvcfs.vcf
4
5
java –Xmx8g –jar GenomeAnalysisTK.jar –T CombineGVCFs –R Sbicolor_313_v3.0.fa \
6
--variant SAMPLED.output.raw.snps.indels.g.vcf -V SAMPLEE.output.raw.snps.indels.g.vcf \
7
-V SAMPLEF.output.raw.snps.indels.g.vcf -o SamplesDEF_combined_gvcfs.vcf
Copied!

Joint genotyping on gVCF files with GATK GenotypeGVCFs

1
java –Xmx8g –jar GenomeAnalysisTK.jar –T GenotypeGVCFs –R Sbicolor_313_v3.0.fa \
2
-V SamplesABC_combined_gvcfs.vcf -V SamplesDEF_combined_gvcfs.vcf -o all_combined_Genotyped_lines.vcf
Copied!

Applying hard SNP filters with GATK VariantFiltration

1
java –Xmx8g –jar GenomeAnalysisTK.jar –T VariantFiltration –R Sbicolor_313_v3.0.fa \
2
-V all_combined_Genotyped_lines.vcf -o all_combined_Genotyped_lines_filtered.vcf \
3
--filterExpression "QD < 2.0" --filterName "QD" --filterExpression "FS > 60.0" \
4
--filterName "FS" --filterExpression "MQ < 40.0" --filterName "MQ" --filterExpression "MQRankSum < -12.5" \
5
--filterName "MQRankSum" --filterExpression "ReadPosRankSum < -8.0" --filterName "ReadPosRankSum"
Copied!

Filter and recode VCF with VCFtools

1
vcftools --vcf all_combined_Genotyped_lines_filtered.vcf --min-alleles 2 --max-alleles 2 \
2
--out all_combined_Genotyped_lines_vcftools.filtered.recode.vcf --max-missing 0.2 --recode
Copied!

Adapt VCF for use with Tassel5

1
tassel-5-standalone/run_pipeline.pl -Xms75G -Xmx265G -SortGenotypeFilePlugin \
2
-inputFile all_combined_Genotyped_lines_vcftools.filtered.recode.vcf \
3
-outFile all_combined_Genotyped_lines_vcftools.filtered.recode.sorted.vcf -fileType VCF
Copied!

Convert VCF to Hapmap with Tassel5

1
tassel-5-standalone/run_pipeline.pl -Xms75G -Xmx290G -fork1 -vcf \
2
all_combined_Genotyped_lines_vcftools.filtered.recode.sorted.vcf -export -exportType Hapmap -runfork1
Copied!

CoGe genomics pipeline

CoGe has integrated the tools that make up the Danforth Center’s variant calling pipeline into their easy point and click GUI, allowing users to reproduce a majority of the TERRA SNP analysis. Below, we detail how to run sequence data through CoGe’s system.
    If this is your initial attempt, you will need to create a Genome.
      1.
      Under Tools, click Load Genome or use this link.
    Under Tools, click Load Experiment or use this link.
    Select Data: to use the TERRA data click Community Data or choose from CoGe’s other data options.
    Select Options: This outlines CoGe’s choices for data processing and analysis. To reproduce pipeline used to create the TERRA SNPs, you can reference the exact tools and parameters used in the Danforth analysis above and enter the appropriate values into their equivalent drop downs or fields.
For the TERRA SNP the following were used:
FASTQ Format
    Read Type: Paired-end
    Encoding: phred33
Trimming
    Trimmer: BBDuk
    BBDuk parameters: k=23, mink=11, hdist=1, check mark both tpe and tbo, qtrim=rl, trimq=20, minlen=20, set trim adapters to both ends
Alignment
    Aligner: BWA-MEM
    BWA-MEM parameters: check mark -M, fill in Read Groups ID (identifier), PL (sequence platform), LB (library prep), SM (sample name)
SNP Analysis
    Check mark Enable which expands this section
    Method: GATK HaplotypeCaller (single-sample GVCF) using the default parameters but you can choose to use Realign reads around INDELS
General Options
    Checkmark both options to add results to your notebook and receive an email when pipeline has completed.
Describe Experiment: Enter an experiment name (required), your data processing version ie 1 for first time, Source if using TERRA Data, it’s TERRA (required), and Genome (required and if you start typing it will find your loaded genome but be sure to verify version and id .)
Last modified 1yr ago