1 of 3

Data Product Creation

Data Product Levels

Data products are processed at various levels ranging from Level 0 to Level 4. Level 0 products are raw data at full instrument resolution. At higher levels, the data are converted into more useful parameters and formats. These are derived from NASA and NEON

Level

Description

Reconstructed, unprocessed, full resolution instrument data; artifacts and duplicates removed.

Level 0 plus time-referenced and annotated with calibration coefficients and georeferencing parameters (level 0 is fully recoverable from level 1a data).

Level 1a processed to sensor units (level 0 not recoverable)

Derived variables (e. g., NDVI, height, fluorescence) at the level 1 resolution.

Level 2 mapped to uniform grid, missing points gap filled; overlapping images combined

'phenotypes' derived variables associated with a particular plant or genotype rather than a spatial location

1 Earth Observing System Data Processing Levels, NASA 2 National Ecological Observatory Network Data Processing

Genomic Data

Danforth Center genomics pipeline

Outlined below are the steps taken to create a raw vcf file from paired end raw FASTQ files. This was done for each sequenced accession so a HTCondor DAG Workflow was written to streamline the processing of those ~200 accessions. While some cpu and memory parameters have been included within the example steps below those parameters varied from sample to sample and the workflow has been honed to accomodate that variation. This pipeline is subject to modification based on software updates and changes to software best practices.

Software versions:

Preparing reference genome

Generate:

BWA index:

bwa index –a bwtsw Sbicolor_313_v3.0.fa

fasta file index:

samtools faidx Sbicolor_313_v3.0.fa

Sequence dictionary:

java –jar picard.jar CreateSequenceDictionary R=Sbicolor_313_v3.0.fa O=Sbicolor_313_v3.0.dict

Quality trimming and filtering of paired end reads

bbduk2 in=SampleA_R1.fastq in2=SampleA_R2.fastq out=SampleA_R1.PE.fastq.gz \
  out2=SampleA_R2.PE.fastq.gz k=23 mink=11 hdist=1 tpe tbo qtrim=rl trimq=20 \
  minlen=20 rref=adapters_file.fa lref=adapters_file.fa

Aligning reads to the reference

bwa mem –M \
  –R “@RG\tIDSAMPLEA_RG1\tPL:illumina\tPU:FLOWCELL_BARCODE.LANE.SAMPLE_BARCODE_RG_UNIT\tLB:libraryprep-lib1\tSM:SAMPLEA” \
  Sbicolor_313_v3.0.fa SampleA_R1.PE.fastq.gz SampleA_R2.PE.fastq.gz > SAMPLEA.bwa.sam

Convert and Sort bam

Samtools view –bS SAMPLEA.bwa.sam | samtools sort - SAMPLEA.bwa.sorted

Mark Duplicates

java –Xmx8g –jar picard.jar MarkDuplicates MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 \
  REMOVE_DUPLICATES=true INPUT=SAMPLEA.bwa.sorted.bam OUTPUT=SAMPLEA.dedup.bam \
  METRICS_FILES=SAMPLEA.dedup.metrics

Index bam files

samtools index SAMPLEA.dedup.bam

Find intervals to analyze

java –Xmx8g –jar GenomeAnalysisTK.jar –T RealignerTargetCreator \
  –R Sbicolor_313_v3.0.fa –I SAMPLEA.dedup.bam –o SAMPLEA.realignment.intervals

Realign

java –Xmx8g –jar GenomeAnalysisTK.jar –T IndelRealigner –R Sbicolor_313_v3.0.fa \
  –I SAMPLEA.dedup.bam –targetIntervals SAMPLEA.realignment.intervals –o SAMPLEA.dedup.realigned.bam

Variant Calling with GATK HaplotypeCaller

java –Xmx8g –jar GenomeAnalysisTK.jar –T HaplotypeCaller –R Sbicolor_313_v3.0.fa \
  –I SAMPLEA.dedup.realigned.bam --emitRefConfidence GVCF --pcr_indel_model NONE \
  -o SAMPLEA.output.raw.snps.indels.g.vcf

Above this point is the workflow for the creation of the gVCF files for this project. The following additional steps were used to create the Hapmap file

Combining gVCFs with GATK CombineGVCFs

NOTE: This project has 363 gvcfs: multiple instances of CombineGVCFs, with unique subsets of gvcf files, were run in parallel to speed up this step below are examples

java –Xmx8g –jar GenomeAnalysisTK.jar –T CombineGVCFs –R Sbicolor_313_v3.0.fa \
  -V SAMPLEA.output.raw.snps.indels.g.vcf --variant SAMPLEB.output.raw.snps.indels.g.vcf\
  -V SAMPLEC.output.raw.snps.indels.g.vcf -o SamplesABC_combined_gvcfs.vcf

java –Xmx8g –jar GenomeAnalysisTK.jar –T CombineGVCFs –R Sbicolor_313_v3.0.fa \
  --variant SAMPLED.output.raw.snps.indels.g.vcf -V SAMPLEE.output.raw.snps.indels.g.vcf \
  -V SAMPLEF.output.raw.snps.indels.g.vcf -o SamplesDEF_combined_gvcfs.vcf

Joint genotyping on gVCF files with GATK GenotypeGVCFs

java –Xmx8g –jar GenomeAnalysisTK.jar –T GenotypeGVCFs –R Sbicolor_313_v3.0.fa \
  -V SamplesABC_combined_gvcfs.vcf -V SamplesDEF_combined_gvcfs.vcf -o all_combined_Genotyped_lines.vcf

Applying hard SNP filters with GATK VariantFiltration

java –Xmx8g –jar GenomeAnalysisTK.jar –T VariantFiltration –R Sbicolor_313_v3.0.fa \
  -V all_combined_Genotyped_lines.vcf -o all_combined_Genotyped_lines_filtered.vcf \
  --filterExpression "QD < 2.0" --filterName "QD" --filterExpression "FS > 60.0" \
  --filterName "FS" --filterExpression "MQ < 40.0" --filterName "MQ" --filterExpression "MQRankSum < -12.5" \
  --filterName "MQRankSum" --filterExpression "ReadPosRankSum < -8.0" --filterName "ReadPosRankSum"

Filter and recode VCF with VCFtools

vcftools --vcf all_combined_Genotyped_lines_filtered.vcf --min-alleles 2 --max-alleles 2 \
  --out all_combined_Genotyped_lines_vcftools.filtered.recode.vcf --max-missing 0.2 --recode

Adapt VCF for use with Tassel5

tassel-5-standalone/run_pipeline.pl -Xms75G -Xmx265G -SortGenotypeFilePlugin \
  -inputFile all_combined_Genotyped_lines_vcftools.filtered.recode.vcf \
  -outFile all_combined_Genotyped_lines_vcftools.filtered.recode.sorted.vcf -fileType VCF

Convert VCF to Hapmap with Tassel5

tassel-5-standalone/run_pipeline.pl -Xms75G -Xmx290G -fork1 -vcf \
  all_combined_Genotyped_lines_vcftools.filtered.recode.sorted.vcf -export -exportType Hapmap -runfork1

Hyperspectral Data

The TERRA hyperspectral data pipeline processes imagery from hyperspectral camera, and ancillary metadata. The pipeline converts the "raw" ENVI-format imagery into netCDF4/HDF5 format with (currently) lossless compression that reduces their size by ~20%. The pipeline also adds suitable ancillary metadata to make the netCDF image files truly self-describing. At the end of the pipeline, the files are typically [ready for xxx]/[uploaded to yyy]/[zzz].

Installation

Software dependencies

The pipeline currently depends on three pre-requisites: netCDF Operators (NCO). Python netCDF4.

Pipeline source code

Once the pre-requisite libraries above have been installed, the pipeline itself may be installed by checking-out the TERRAREF computing-pipeline repository. The relevant scripts for hyperspectral imagery are:

Main script terraref.sh* JSON metadata->netCDF4 script JsonDealer.py

Setup

The pipeline works with input from any location (directories, files, or stdin). Supply the raw image filename(s) (e.g., meat_raw), and the pipeline derives the ancillary filename(s) from this (e.g., meat_raw.hdr, meat_metadata.json). When specifying a directory without a specifice filename, the pipeline processes all files with the suffix "_raw".

shmkdir ~/terrarefcd ~/terrarefgit clone git@github.com:terraref/computing-pipeline.gitgit clone git@github.com:terraref/documentation.git

Run the Hyperspectral Pipeline

shterraref.sh -i ${DATA}/terraref/foo_raw -O ${DATA}/terrarefterraref.sh -I /projects/arpae/terraref/raw_data/lemnatec_field -O /projects/arpae/terraref/outputs/lemnatec_field