Danforth Center genomics pipeline
Outlined below are the steps taken to create a raw vcf file from paired end raw FASTQ files. This was done for each sequenced accession so a HTCondor DAG Workflow was written to streamline the processing of those ~200 accessions. While some cpu and memory parameters have been included within the example steps below those parameters varied from sample to sample and the workflow has been honed to accomodate that variation. This pipeline is subject to modification based on software updates and changes to software best practices.
Software versions:
Preparing reference genome
Download Sorghum bicolor v3.1 from Phytozome
Generate:
BWA index:
Copy bwa index –a bwtsw Sbicolor_313_v3.0.fa
fasta file index:
Copy samtools faidx Sbicolor_313_v3.0.fa
Sequence dictionary:
Copy java –jar picard.jar CreateSequenceDictionary R=Sbicolor_313_v3.0.fa O=Sbicolor_313_v3.0.dict
Quality trimming and filtering of paired end reads
Copy bbduk2 in=SampleA_R1.fastq in2=SampleA_R2.fastq out=SampleA_R1.PE.fastq.gz \
out2=SampleA_R2.PE.fastq.gz k= 23 mink= 11 hdist= 1 tpe tbo qtrim=rl trimq= 20 \
minlen= 20 rref=adapters_file.fa lref=adapters_file.fa
Aligning reads to the reference
Copy bwa mem –M \
–R “@RG \t IDSAMPLEA_RG1 \t PL:illumina \t PU:FLOWCELL_BARCODE.LANE.SAMPLE_BARCODE_RG_UNIT \t LB:libraryprep-lib1 \t SM:SAMPLEA” \
Sbicolor_313_v3.0.fa SampleA_R1.PE.fastq.gz SampleA_R2.PE.fastq.gz > SAMPLEA.bwa.sam
Convert and Sort bam
Copy Samtools view –bS SAMPLEA.bwa.sam | samtools sort - SAMPLEA.bwa.sorted
Mark Duplicates
Copy java –Xmx8g –jar picard.jar MarkDuplicates MAX_FILE_HANDLES_FOR_READ_ENDS_MAP= 1000 \
REMOVE_DUPLICATES= true INPUT=SAMPLEA.bwa.sorted.bam OUTPUT=SAMPLEA.dedup.bam \
METRICS_FILES=SAMPLEA.dedup.metrics
Index bam files
Copy samtools index SAMPLEA.dedup.bam
Find intervals to analyze
Copy java –Xmx8g –jar GenomeAnalysisTK.jar –T RealignerTargetCreator \
–R Sbicolor_313_v3.0.fa –I SAMPLEA.dedup.bam –o SAMPLEA.realignment.intervals
Realign
Copy java –Xmx8g –jar GenomeAnalysisTK.jar –T IndelRealigner –R Sbicolor_313_v3.0.fa \
–I SAMPLEA.dedup.bam –targetIntervals SAMPLEA.realignment.intervals –o SAMPLEA.dedup.realigned.bam
Variant Calling with GATK HaplotypeCaller
Copy java –Xmx8g –jar GenomeAnalysisTK.jar –T HaplotypeCaller –R Sbicolor_313_v3.0.fa \
–I SAMPLEA.dedup.realigned.bam --emitRefConfidence GVCF --pcr_indel_model NONE \
-o SAMPLEA.output.raw.snps.indels.g.vcf
Above this point is the workflow for the creation of the gVCF files for this project. The following additional steps were used to create the Hapmap file
Combining gVCFs with GATK CombineGVCFs
NOTE: This project has 363 gvcfs: multiple instances of CombineGVCFs, with unique subsets of gvcf files, were run in parallel to speed up this step below are examples
Copy java –Xmx8g –jar GenomeAnalysisTK.jar –T CombineGVCFs –R Sbicolor_313_v3.0.fa \
-V SAMPLEA.output.raw.snps.indels.g.vcf --variant SAMPLEB.output.raw.snps.indels.g.vcf \
-V SAMPLEC.output.raw.snps.indels.g.vcf -o SamplesABC_combined_gvcfs.vcf
java –Xmx8g –jar GenomeAnalysisTK.jar –T CombineGVCFs –R Sbicolor_313_v3.0.fa \
--variant SAMPLED.output.raw.snps.indels.g.vcf -V SAMPLEE.output.raw.snps.indels.g.vcf \
-V SAMPLEF.output.raw.snps.indels.g.vcf -o SamplesDEF_combined_gvcfs.vcf
Joint genotyping on gVCF files with GATK GenotypeGVCFs
Copy java –Xmx8g –jar GenomeAnalysisTK.jar –T GenotypeGVCFs –R Sbicolor_313_v3.0.fa \
-V SamplesABC_combined_gvcfs.vcf -V SamplesDEF_combined_gvcfs.vcf -o all_combined_Genotyped_lines.vcf
Applying hard SNP filters with GATK VariantFiltration
Copy java –Xmx8g –jar GenomeAnalysisTK.jar –T VariantFiltration –R Sbicolor_313_v3.0.fa \
-V all_combined_Genotyped_lines.vcf -o all_combined_Genotyped_lines_filtered.vcf \
--filterExpression "QD < 2.0" --filterName "QD" --filterExpression "FS > 60.0" \
--filterName "FS" --filterExpression "MQ < 40.0" --filterName "MQ" --filterExpression "MQRankSum < -12.5" \
--filterName "MQRankSum" --filterExpression "ReadPosRankSum < -8.0" --filterName "ReadPosRankSum"
Filter and recode VCF with VCFtools
Copy vcftools --vcf all_combined_Genotyped_lines_filtered.vcf --min-alleles 2 --max-alleles 2 \
--out all_combined_Genotyped_lines_vcftools.filtered.recode.vcf --max-missing 0.2 --recode
Adapt VCF for use with Tassel5
Copy tassel-5-standalone/run_pipeline.pl -Xms75G -Xmx265G -SortGenotypeFilePlugin \
-inputFile all_combined_Genotyped_lines_vcftools.filtered.recode.vcf \
-outFile all_combined_Genotyped_lines_vcftools.filtered.recode.sorted.vcf -fileType VCF
Convert VCF to Hapmap with Tassel5
Copy tassel-5-standalone/run_pipeline.pl -Xms75G -Xmx290G -fork1 -vcf \
all_combined_Genotyped_lines_vcftools.filtered.recode.sorted.vcf -export -exportType Hapmap -runfork1