TERRA-REF Documentation
WebsiteGitHubTutorials
revisions
revisions
  • Introduction
  • Data Sources
  • Software
  • Scientific Objectives and Experimental Design
    • Protocols
      • Controlled Environment Protocols
      • Manual Field Data Protocols
      • Phenotractor Protocols
      • Sensor Calibration
      • Template Protocol
      • UAV Protocols
    • Experimental Design
      • Experimental Design Danforth
        • Sorghum Lines Danforth
      • Experimental Design Genomics
        • Sorghum Lines Genomics Year 1
        • Sorghum Lines Genomics Year 1 (continued)
        • Sorghum Lines Genomics Year 2
      • Experimental Design MAC
  • User Manual
    • What Data is Available
    • Data Products
      • Environmental conditions
      • Fluorescence intensity imaging
      • Genomics data
      • Geospatial information
      • Hyperspectral imaging data
      • Infrared heat imaging data
      • Multispectral imaging data
      • Meteorological data
      • Phenotype data
      • Point Cloud Data
    • How to Access Data
      • Using Clowder (Sensor and Genoomics data)
      • Using Globus (Sensor and Genomics data)
      • Using BETYdb (trait data, experimental metadata)
        • Accessing BETYdb via ArcMap and other GIS software
      • Using CoGe (Genomics)
      • Using CyVerse (Genomics)
      • Using Analysis Workbench (all data)
    • Data Use Policy
    • Manuscripts and Authorship Guidelines
    • Release / reprocessing schedule
  • Technical Documentation
    • Data Standards
      • Existing Data Standards
      • Agronomic and Phenotype Data Standards
      • Genomic Data Standards
      • Sensor Data Standards
      • Data Standards Committee
    • Directory Structure
    • Data Storage
    • Data Transfer
    • Data Processing Pipeline
      • Geospatial Time Series Structure
    • Data Backup
    • Data Collection
    • Data Product Creation
      • Genomic Data
      • Hyperspectral Data
    • Quality Assurance and Quality Control
    • Systems Configuration
  • Developer Manual
    • Submitting data to Clowder
    • Submitting data to BETYdb
    • Submitting Data to CoGe
    • Developing Clowder Extractors
  • Tutorials
  • Appendix
    • Code of Conduct
    • Collaboration Tools
    • Glossary
Powered by GitBook
On this page
  • Danforth Center genomics pipeline
  • Software versions:
  • Preparing reference genome
  • Quality trimming and filtering of paired end reads
  • Aligning reads to the reference
  • Convert and Sort bam
  • Mark Duplicates
  • Index bam files
  • Find intervals to analyze
  • Realign
  • Variant Calling with GATK HaplotypeCaller
  • Combining gVCFs with GATK CombineGVCFs
  • Joint genotyping on gVCF files with GATK GenotypeGVCFs
  • Applying hard SNP filters with GATK VariantFiltration
  • Filter and recode VCF with VCFtools
  • Adapt VCF for use with Tassel5
  • Convert VCF to Hapmap with Tassel5
Export as PDF
  1. Technical Documentation
  2. Data Product Creation

Genomic Data

PreviousData Product CreationNextHyperspectral Data

Last updated 7 years ago

Danforth Center genomics pipeline

Outlined below are the steps taken to create a raw vcf file from paired end raw FASTQ files. This was done for each sequenced accession so a HTCondor DAG Workflow was written to streamline the processing of those ~200 accessions. While some cpu and memory parameters have been included within the example steps below those parameters varied from sample to sample and the workflow has been honed to accomodate that variation. This pipeline is subject to modification based on software updates and changes to software best practices.

Software versions:

Preparing reference genome

Download Sorghum bicolor v3.1 from

Generate:

BWA index:

bwa index –a bwtsw Sbicolor_313_v3.0.fa

fasta file index:

samtools faidx Sbicolor_313_v3.0.fa

Sequence dictionary:

java –jar picard.jar CreateSequenceDictionary R=Sbicolor_313_v3.0.fa O=Sbicolor_313_v3.0.dict

Quality trimming and filtering of paired end reads

bbduk2 in=SampleA_R1.fastq in2=SampleA_R2.fastq out=SampleA_R1.PE.fastq.gz \
  out2=SampleA_R2.PE.fastq.gz k=23 mink=11 hdist=1 tpe tbo qtrim=rl trimq=20 \
  minlen=20 rref=adapters_file.fa lref=adapters_file.fa

Aligning reads to the reference

bwa mem –M \
  –R “@RG\tIDSAMPLEA_RG1\tPL:illumina\tPU:FLOWCELL_BARCODE.LANE.SAMPLE_BARCODE_RG_UNIT\tLB:libraryprep-lib1\tSM:SAMPLEA” \
  Sbicolor_313_v3.0.fa SampleA_R1.PE.fastq.gz SampleA_R2.PE.fastq.gz > SAMPLEA.bwa.sam

Convert and Sort bam

Samtools view –bS SAMPLEA.bwa.sam | samtools sort - SAMPLEA.bwa.sorted

Mark Duplicates

java –Xmx8g –jar picard.jar MarkDuplicates MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 \
  REMOVE_DUPLICATES=true INPUT=SAMPLEA.bwa.sorted.bam OUTPUT=SAMPLEA.dedup.bam \
  METRICS_FILES=SAMPLEA.dedup.metrics

Index bam files

samtools index SAMPLEA.dedup.bam

Find intervals to analyze

java –Xmx8g –jar GenomeAnalysisTK.jar –T RealignerTargetCreator \
  –R Sbicolor_313_v3.0.fa –I SAMPLEA.dedup.bam –o SAMPLEA.realignment.intervals

Realign

java –Xmx8g –jar GenomeAnalysisTK.jar –T IndelRealigner –R Sbicolor_313_v3.0.fa \
  –I SAMPLEA.dedup.bam –targetIntervals SAMPLEA.realignment.intervals –o SAMPLEA.dedup.realigned.bam

Variant Calling with GATK HaplotypeCaller

java –Xmx8g –jar GenomeAnalysisTK.jar –T HaplotypeCaller –R Sbicolor_313_v3.0.fa \
  –I SAMPLEA.dedup.realigned.bam --emitRefConfidence GVCF --pcr_indel_model NONE \
  -o SAMPLEA.output.raw.snps.indels.g.vcf

Above this point is the workflow for the creation of the gVCF files for this project. The following additional steps were used to create the Hapmap file

Combining gVCFs with GATK CombineGVCFs

NOTE: This project has 363 gvcfs: multiple instances of CombineGVCFs, with unique subsets of gvcf files, were run in parallel to speed up this step below are examples

java –Xmx8g –jar GenomeAnalysisTK.jar –T CombineGVCFs –R Sbicolor_313_v3.0.fa \
  -V SAMPLEA.output.raw.snps.indels.g.vcf --variant SAMPLEB.output.raw.snps.indels.g.vcf\
  -V SAMPLEC.output.raw.snps.indels.g.vcf -o SamplesABC_combined_gvcfs.vcf

java –Xmx8g –jar GenomeAnalysisTK.jar –T CombineGVCFs –R Sbicolor_313_v3.0.fa \
  --variant SAMPLED.output.raw.snps.indels.g.vcf -V SAMPLEE.output.raw.snps.indels.g.vcf \
  -V SAMPLEF.output.raw.snps.indels.g.vcf -o SamplesDEF_combined_gvcfs.vcf

Joint genotyping on gVCF files with GATK GenotypeGVCFs

java –Xmx8g –jar GenomeAnalysisTK.jar –T GenotypeGVCFs –R Sbicolor_313_v3.0.fa \
  -V SamplesABC_combined_gvcfs.vcf -V SamplesDEF_combined_gvcfs.vcf -o all_combined_Genotyped_lines.vcf

Applying hard SNP filters with GATK VariantFiltration

java –Xmx8g –jar GenomeAnalysisTK.jar –T VariantFiltration –R Sbicolor_313_v3.0.fa \
  -V all_combined_Genotyped_lines.vcf -o all_combined_Genotyped_lines_filtered.vcf \
  --filterExpression "QD < 2.0" --filterName "QD" --filterExpression "FS > 60.0" \
  --filterName "FS" --filterExpression "MQ < 40.0" --filterName "MQ" --filterExpression "MQRankSum < -12.5" \
  --filterName "MQRankSum" --filterExpression "ReadPosRankSum < -8.0" --filterName "ReadPosRankSum"

Filter and recode VCF with VCFtools

vcftools --vcf all_combined_Genotyped_lines_filtered.vcf --min-alleles 2 --max-alleles 2 \
  --out all_combined_Genotyped_lines_vcftools.filtered.recode.vcf --max-missing 0.2 --recode

Adapt VCF for use with Tassel5

tassel-5-standalone/run_pipeline.pl -Xms75G -Xmx265G -SortGenotypeFilePlugin \
  -inputFile all_combined_Genotyped_lines_vcftools.filtered.recode.vcf \
  -outFile all_combined_Genotyped_lines_vcftools.filtered.recode.sorted.vcf -fileType VCF

Convert VCF to Hapmap with Tassel5

tassel-5-standalone/run_pipeline.pl -Xms75G -Xmx290G -fork1 -vcf \
  all_combined_Genotyped_lines_vcftools.filtered.recode.sorted.vcf -export -exportType Hapmap -runfork1
BBDuk2 version 36.67
bwa v 0.7.12-r1039
samtools v 1.3.1
picard-tools-2.0.1
GATK v3.5-0-g36282e4
VCFtools (0.1.14)
Tassel Version: 5.2.27
Phytozome