arrow-left

All pages
gitbookPowered by GitBook
1 of 18

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Data Standards

hashtag
Overview

TERRA’s data standards facilitate the exchange of genomic and phenomic data across teams and external researchers. Applying common standards makes it easier to exchange analytical methods and data across domains and to leverage existing tools.

When practical, existing conventions and standards have been used to create data standards. Spatial data adopts Federal Geographic Data Committee (FGDC) and Open Geospatial Consortium (OGC) data and meta-data standards. CF variable naming convention was adopted for meteorological data and biophysical data. Data formats and variable naming conventions were adapted from NEON and NASA.

Feedback from data creators and users were used to define the types of data formats, semantics, and interfaces, file formats, and representations of space, time, and genetic identity based on existing standards, commonly used file formats, and user needs.

We anticipate that standards and data formats will evolve over time as we clarify use cases, develop new sensors and analytical pipelines, and build tools for data format conversion and feature extraction and tracking provenance. Each year we will re-convene to assess our standards based on user needs. The Standards Committee will assess the trade-off between the upfront cost of adoption with the long-term value of the data products, algorithms, and tools that will be developed as part of the TERRA program. The specifications for these data products will be developed iteratively over the course of the project in coordination with TERRA funded projects. The focus will be to take advantage of existing tools based on these standards, and to develop data translation interfaces where necessary.

hashtag
See also

Technical Documentation

This section includes the following:

Agronomic and Phenotype Data Standards
Environmental Data Standards
Genomic Data Standards
Sensor Data Standards

  • Data Standards
    Directory Structure
    Data Storage
    Data Standards Committee
    Data Backup
    Data Transfer
    Data Processing Pipeline
    Data Collection
    Data Product Creation
    Sensor Calibration
    Quality Assurance and Quality Control
    Systems Configurationarrow-up-right

    Quality Assurance and Quality Control

    Logging

    Automated checks

    visualizations

    testing and continuous integration framework

    checking that scans align with plots

    https://github.com/terraref/computing-pipeline/issues/76arrow-up-right
    https://github.com/terraref/computing-pipeline/issues/153arrow-up-right

    Data Collection

    hashtag
    Maricopa Agricultural Center, Arizona

    • The Lemnatec Scanalyzer Field Gantry System

      • Sensor missions

      • Scientific Motivation

      • What sensors, how often etc.

    • Tractor

    • UAV

    • Manually Collected Field Data

    hashtag
    Automated controlled-environment phenotyping, Missouri

    The Scanalyzer 3D platform consists of multiple digital imaging chambers connected to the Conviron growth house by a conveyor belt system, resulting in a continuous imaging loop. Plants are imaged from the top and/or multiple sides, followed by digital construction of images for analysis.

    • RGB imaging allows visualization and quantification of plant color and structural morphology, such as leaf area, stem diameter and plant height.

    • NIR imaging enables visualization of water distribution in plants in the near infrared spectrum of 900–1700 nm.

    • Fluorescent imaging uses red light excitation to visualize chlorophyll fluorescence between 680 – 900 nm. The system is equipped with a dark adaptation tunnel preceding the fluorescent imaging chamber, allowing the analysis of photosystem II efficiency.

    Capturing images

    This video overview will help explain the capture system:

    The LemnaTec software suite is used to program and control the Scanalyzer platform, analyze the digital images and mine resulting data. Data and images are saved and stored on a secure server for further review or reanalysis.

    You can read more about the Danforth Plant Sciences Center Bellwether Foundation Phenotyping Facility on the .

    hashtag
    Kansas State University

    hashtag
    HudsonAlpha - Genomics

    Genomic Data Standards

    hashtag
    Overview

    Genomic data have reached a high level of standardization in the scientific community. Today, all high-impact journals typically ask the author to deposit their genomic data in either or both of these databases before publication.

    Below are the most widely accepted formats that are relevant to the data and analyses generated in TERRA-REF.

    hashtag
    Raw reads + quality scores

    Raw reads + quality scores are stored in FASTQ formatarrow-up-right. FASTQ files can be manipulated for QC with FASTX-Toolkitarrow-up-right

    hashtag
    Reference genome assembly

    Reference genome assembly (for alignment of reads or BLAST) is in FASTA formatarrow-up-right. FASTA files generally need indexing and formatting that can be done by aligners, BLAST, or other applications that provide built-in commands for this purpose.

    hashtag
    Sequence alignment

    Sequence alignments are in BAM format – in addition to the nucleotide sequence, the BAM format contains fields to describe mapping and read quality. BAM files are binary files but can be visualized with IGVarrow-up-right. If needed, BAM can be converted in SAM (text file) with SAMtoolsarrow-up-right

    BAM is the preferred format for sra database (sequence read archive).

    hashtag
    SNP and genotype variants

    SNP and genotype variants are in VCF formatarrow-up-right. VCF contains all information about read mapping and SNP and genotype calling quality. VCF files are typically manipulated with vcftoolsarrow-up-right

    VCF format is also the format required by dbSNP, the largest public repository all SNPs.

    hashtag
    Genomic coordinates

    Genomic coordinates are given in a BED format – gives the start and end positions of a feature in the genome (for single nucleotides, start = end). BED filesarrow-up-right can be edited with bedtoolsarrow-up-right.

    hashtag
    See Also

    • Genomics Data Pipeline

    • Genomics Data Products

    https://docs.google.com/document/d/1iP8b97kmOyPmETQI_aWbgV_1V6QiKYLblq1jIqXLJ84/edit#heading=h.3w6iuawxkjl6arrow-up-right
    https://github.com/terraref/reference-data/issues/45arrow-up-right
    DDPSC websitearrow-up-right

    Data Product Creation

    hashtag
    Data Product Levels

    Data products are processed at various levels ranging from Level 0 to Level 4. Level 0 products are raw data at full instrument resolution. At higher levels, the data are converted into more useful parameters and formats. These are derived from NASA and NEON

    Level

    Description

    1 2

    Data Standards Committee

    The Standards Committee is responsible for defining and advising the development of data products and access protocols for the ARPA-E TERRA program. The committee consists of twelve core participants: one representative from each of the six funded projects and six independent experts. The committee will meet virtually each month and in person each year to discuss, develop, and revise data products, interfaces, and computing infrastructure.

    hashtag
    Roles and responsibilities

    TERRA Project Standards Committee representatives are expected to represent the interests of their TERRA team, their research community, and the institutions for which they work. External participants were chosen to represent specific areas of expertise and will provide feedback and guidance to help make the TERRA platform interoperable with existing and emerging sensing, informatics, and computing platforms.

    Directory Structure

    The data processing pipeline transmits data from origination sites to a controlled directory structure on the CyberGIS supercomputer.

    The data is generally structured as follows:

    ...where raw outputs from sensors per site are stored in a raw_data subdirectory and corresponding outputs from different extractor algorithms are stored in Level_1 (and eventually Level_2, etc) subdirectories.

    When possible, sensor directories will be divided into days and then into individual datasets.

    This directory structure is visible when accessing data via the Globus interface.

    Sensor Data Standards

    hashtag
    Current Practice

    In the TERRA-REF release, sensor metadata is generally stored and exchanged using formats defined by LemnaTec. Sensor metadata is stored in metadata.json files for each dataset. This information is ingested into Clowder and available via the "Metadata" tab .

    Manufacturer information about devices and sensors are available via Clowder in the collection. This collection includes datasets representing each sensor or calibration target containing specifications\/datasheets, calibration certificates, and associated reference data.

    Data Processing Pipeline

    hashtag
    Maricopa Agricultural Center, Arizona

    hashtag
    Automated controlled-environment phenotyping, Missouri

    0

    Reconstructed, unprocessed, full resolution instrument data; artifacts and duplicates removed.

    1a

    Level 0 plus time-referenced and annotated with calibration coefficients and georeferencing parameters (level 0 is fully recoverable from level 1a data).

    1b

    Level 1a processed to sensor units (level 0 not recoverable)

    2

    Derived variables (e. g., NDVI, height, fluorescence) at the level 1 resolution.

    3

    Level 2 mapped to uniform grid, missing points gap filled; overlapping images combined

    4

    'phenotypes' derived variables associated with a particular plant or genotype rather than a spatial location

    Earth Observing System Data Processing Levels, NASAarrow-up-right
    National Ecological Observatory Network Data Processingarrow-up-right
    Fixed metadata

    Authoritative fixed sensor metadata is available for each of the sensor datasets. This has been extended to include factory calibrated spectral response and relative spectral response information. For more information, please see the sensor-metadataarrow-up-right repository on Github.

    Runtime metadata

    Runtime metadata for each sensor run is stored in the metadata.json files in each sensor output directory.

    Reference data

    Additional reference data is available for some sensors:

    • Factory calibration data for the LabSphere and SphereOptics calibration targets.

    • Relative spectral response (RSR) information for sensors

    • Calibration data for the environmental logger

    • Dark\/white reference data for the SWIR and VNIR sensors.

    hashtag
    Standardization Efforts

    The TERRA-REF team is currently investigating available standards for the representation of sensor information. Preliminary work has been done using OGC SensorML vocabularies in a custom JSON-LD context. For more information, please see the sensor-metadataarrow-up-right repository on Github.

    metadata.jsonld API endpointarrow-up-right
    Devices and Sensors Informationarrow-up-right
    /sites
      /ua-mac
        /raw_data
          /sensor1
            /timestamp
              /dataset
          /sensor2
          ...
        /Level_1
          /extractor1_outputs
          /extractor2_outputs
          ...
      /danforth
        /raw_data
          /sensor3
          ...
        /Level_1
          /extractor3_outputs
    ROGERarrow-up-right
    hashtag
    Specific duties
    • Participate in monthly to quarterly teleconferences with the committee.

    • Provide expert advice.

    • Provide feedback from other intersted parties.

    • Participate in, or send delegate to, annual two-day workshops.

    hashtag
    Annual Meetings

    If we can efficiently agree on and adopt conventions, we will have more flexibility to use these workshops to train researchers, remove obstacles, and identify opportunities. This will be an opportunity for researchers to work with developers at NCSA and from the broader TERRA informatics and computing teams to identify what works, prioritize features, and move forward on research questions that require advanced computing.

    hashtag
    Project Timeline

    • August 2015: Establish committee, form a data plan

    • January 2016: v0 file standards

    • January 2017: v1 file standards, sample data sets

    • January 2018: mock data cube generator, standardized data products, simulated data

    • January 2019: standardized data products, simulated data

    hashtag
    Data Standards Participants

    • TERRA Project Representatives (6)

    • ARPA-E Program Representatives (2)

    • Board of External Advisors (6)

    (numbers in parentheses are targets, for which we have funding)

    hashtag
    People

    Name

    Institution

    Email

    Coordinators

    David Lee

    ARPA-E

    david.lee2_at_hq.doe.gov

    David LeBauer

    UIUC / NCSA

    dlebauer_at_illinois.edu

    TERRA Project Representatives

    At two points in the processing pipeline, metadata derived from collected data is inserted into BETYdb:

    • At the start of the transfer process, metadata collected and derived during Danforth's initial processing will be pushed.

    • After transfer to NCSA, extractors running in Clowder will derive further metadata that will be pushed. This is a subset of the metadata that will also be stored in Clowder's database. The complete metadata definitions are still being determined, but will likely include:

      • plant identifiers

      • experiment and experimenter

      • plant age, date, growth medium, and treatment

      • camera metadata

    hashtag
    Kansas State University

    hashtag
    HudsonAlpha - Genomics

    Data Storage

    • Blue Waters Nearline: NCSA 300PB+ Tape Archive (2PB Allocation)

    • ROGER: CyberGIS R&D server for GIS applications, 5PB storage + variety of nodes, including large memory. roger.ncsa.illinois.edu (1PB Allocation)

    Hyperspectral Data

    The TERRA hyperspectral data pipeline processes imagery from hyperspectral camera, and ancillary metadata. The pipeline converts the "raw" ENVI-format imagery into netCDF4/HDF5 format with (currently) lossless compression that reduces their size by ~20%. The pipeline also adds suitable ancillary metadata to make the netCDF image files truly self-describing. At the end of the pipeline, the files are typically [ready for xxx]/[uploaded to yyy]/[zzz].

    hashtag
    Installation

    Software dependencies

    The pipeline currently depends on three pre-requisites: _[_netCDF Operators (NCO)]( .

    Paul Bartlett

    Near Earth Autonomy

    paul_at_nearearthautonomy.com

    Jeff White

    USDA ALARC

    Jeffrey.White_at_ars.usda.gov

    Melba Crawford

    Purdue

    melbac_at_purdue.edu

    Mike Gore

    Cornell

    mag87_at_cornell.edu

    Matt Colgan

    Blue River

    matt.c_at_bluerivert.com

    Christer Janssen

    Pacific Northwest National Laboratory

    georg.jansson_at_pnnl.gov

    Barnabas Poczos

    Carnegie Mellon

    bapoczos_at_cs.cmu.edu

    Alex Thomasson

    Texas A&M University

    thomasson_at_tamu.edu

    External Advisors

    Cheryl Porter

    ICASA / AgMIP / USDA

    Shawn Serbin

    Brookhaven National Lab

    sserbin_at_bnl.gov

    Shelly Petroy

    NEON

    spetroy_at_neoninc.org

    Christine Laney

    NEON

    claney_at_neoninc.org

    Carolyn J. Lawrence-Dill

    Iowa State

    triffid_at_iastate.edu

    Eric Lyons

    University of Arizona / iPlant

    ericlyons_at_email.arizona.edu

    https://github.com/terraref/computing-pipeline/issues/87arrow-up-right
    Pipeline source code

    Once the pre-requisite libraries above have been installed, the pipeline itself may be installed by checking-out the TERRAREF computing-pipeline repository. The relevant scripts for hyperspectral imagery are:

    • Main script terraref.sharrow-up-right* JSON metadata->netCDF4 script JsonDealer.pyarrow-up-right

    Setup

    The pipeline works with input from any location (directories, files, or stdin). Supply the raw image filename(s) (e.g., meat_raw), and the pipeline derives the ancillary filename(s) from this (e.g., meat_raw.hdr, meat_metadata.json). When specifying a directory without a specifice filename, the pipeline processes all files with the suffix "_raw".

    shmkdir ~/terrarefcd ~/terrarefgit clone [email protected]:terraref/computing-pipeline.gitgit clone [email protected]:terraref/documentation.git

    Run the Hyperspectral Pipeline

    shterraref.sh -i ${DATA}/terraref/foo_raw -O ${DATA}/terrarefterraref.sh -I /projects/arpae/terraref/raw_data/lemnatec_field -O /projects/arpae/terraref/outputs/lemnatec_field

    http://nco.sf.net)_._arrow-up-right
    Python netCDF4arrow-up-right

    Data Transfer

    hashtag
    Maricopa Agricultural Center, Arizona

    Environmental Sensors Log of files transfered from Arizona to NCSAarrow-up-right

    Transferring ima

    Data is sent to the gantry-cache server located inside the main UA-MAC building's telecom room via FTP over a private 10GbE interface. Path to each file being transferred is logged to /var/log/xferlog. Docker container running on the gantry-cache reads through this log file, tracking the last line it has read and scans the file regularly looking for more lines. File paths are scraped from the log and are bundled into groups of 500 to be transferred to the Spectrum Scale file systems that backs the ROGER cluster at NCSA via the Globus Python API. The log file is rolled daily and compressed to keep size in check. Sensor directories on the gantry-cache are white listed for being monitored to prevent accidental or junk data from being ingested into the Clowder pipeline.

    A Docker container in the terra-clowder VM running in ROGER's Openstack environment gets pinged about incoming transfers and watches for when they complete, once completed the same files are queued to be ingested into Clowder.

    Once files have been successfully received by the ROGER Globus endpoint, the files are then removed from the gantry-cache server by the Docker container running on the gantry-cache server. A clean up script walks the gantry-cache daily looking for files older than two days that have not been transferred and queues any if found.

    hashtag
    Automated controlled-environment phenotyping, Missouri

    Transferring images

    Processes at Danforth monitor the database repository where images captured from the Scanalyzer are stored. After initial processing, files are transferred to NCSA servers for additional metadata extraction, indexing and storage.

    At the start of the transfer process, metadata collected and derived during Danforth's initial processing will be pushed.

    The current "beta" Python script can be viewed . During transfer tests of data from Danforth's sorghum pilot experiment, 2,725 snapshots containing 10 images each were uploaded in 775 minutes (3.5 snapshots\/minute).

    Transfer volumes

    The Danforth Center transfers approximately X GB of data to NCSA per week.

    hashtag
    Kansas State University

    hashtag
    HudsonAlpha - Genomics

    Agronomic and Phenotype Data Standards

    hashtag
    Current Practice

    In TERRA-REF v0 release, agronomic and phenotype data is stored and exchanged using the BETYdb APIarrow-up-right. Agronomic data is stored in the sites, managements, and treatments tables. Phenotype data is stored in the traits, variables, and methods tables. Data is ingested and accessed via the BETYdb API formats.

    hashtag
    Standardization Efforts

    In cooperation with participants from , the , and groups, the TERRA-REF team is pursuing the development of a format to facilitate the exchange of data across systems based on the ICASA Vocabulary and AgMIP JSON Data Objects. An initial draft of this format is available for comment on

    In addition, we plan to enable the TERRA-REF databases to import and export data via the .

     #Data Transfer
    on GitHubarrow-up-right
    AgMIParrow-up-right
    Crop Ontologyarrow-up-right
    Agronomy Ontology arrow-up-right
    Github.arrow-up-right
    Plant Breeding API (BRAPI)arrow-up-right
    LemnaTec Video Screnshot

    Genomic Data

    hashtag
    Danforth Center genomics pipeline

    Outlined below are the steps taken to create a raw vcf file from paired end raw FASTQ files. This was done for each sequenced accession so a HTCondor DAG Workflow was written to streamline the processing of those ~200 accessions. While some cpu and memory parameters have been included within the example steps below those parameters varied from sample to sample and the workflow has been honed to accomodate that variation. This pipeline is subject to modification based on software updates and changes to software best practices.

    hashtag
    Software versions:

    hashtag
    Preparing reference genome

    Download Sorghum bicolor v3.1 from

    Generate:

    hashtag
    BWA index:

    hashtag
    fasta file index:

    hashtag
    Sequence dictionary:

    hashtag
    Quality trimming and filtering of paired end reads

    hashtag
    Aligning reads to the reference

    hashtag
    Convert and Sort bam

    hashtag
    Mark Duplicates

    hashtag
    Index bam files

    hashtag
    Find intervals to analyze

    hashtag
    Realign

    hashtag
    Variant Calling with GATK HaplotypeCaller

    Above this point is the workflow for the creation of the gVCF files for this project. The following additional steps were used to create the Hapmap file

    hashtag
    Combining gVCFs with GATK CombineGVCFs

    NOTE: This project has 363 gvcfs: multiple instances of CombineGVCFs, with unique subsets of gvcf files, were run in parallel to speed up this step below are examples

    hashtag
    Joint genotyping on gVCF files with GATK GenotypeGVCFs

    hashtag
    Applying hard SNP filters with GATK VariantFiltration

    hashtag
    Filter and recode VCF with VCFtools

    hashtag
    Adapt VCF for use with Tassel5

    hashtag
    Convert VCF to Hapmap with Tassel5

    Geospatial Time Series Structure

    Several extractors push data to the Clowder Geostreams API, which allows registration of data streams that accumulate datapoints over time. These streams can then be queried, visualized and downloaded to get time series of various measurements across plots and sensors.

    TERRA-REF organizes data into three levels:

    • Location (e.g. plot, or a stationary sensor)

      • Information stream (a particular instrument's data, or a subset of one instrument's data)

        • Datapoint (a single observation from the information stream at a particular point in time)

    hashtag
    Sensor destinations

    Here, the various streams that are used in the pipeline and their contents are listed.

    • Location group

      • Stream name

        • Datapoint property [units / sample value]

    GATK v3.5-0-g36282e4arrow-up-right

  • VCFtools (0.1.14)arrow-up-right

  • Tassel Version: 5.2.27arrow-up-right

  • BBDuk2 version 36.67arrow-up-right
    bwa v 0.7.12-r1039arrow-up-right
    samtools v 1.3.1arrow-up-right
    picard-tools-2.0.1arrow-up-right
    Phytozomearrow-up-right
    bwa index –a bwtsw Sbicolor_313_v3.0.fa
    samtools faidx Sbicolor_313_v3.0.fa
    java –jar picard.jar CreateSequenceDictionary R=Sbicolor_313_v3.0.fa O=Sbicolor_313_v3.0.dict
    bbduk2 in=SampleA_R1.fastq in2=SampleA_R2.fastq out=SampleA_R1.PE.fastq.gz \
      out2=SampleA_R2.PE.fastq.gz k=23 mink=11 hdist=1 tpe tbo qtrim=rl trimq=20 \
      minlen=20 rref=adapters_file.fa lref=adapters_file.fa
    bwa mem –M \
      –R “@RG\tIDSAMPLEA_RG1\tPL:illumina\tPU:FLOWCELL_BARCODE.LANE.SAMPLE_BARCODE_RG_UNIT\tLB:libraryprep-lib1\tSM:SAMPLEA” \
      Sbicolor_313_v3.0.fa SampleA_R1.PE.fastq.gz SampleA_R2.PE.fastq.gz > SAMPLEA.bwa.sam
    Samtools view –bS SAMPLEA.bwa.sam | samtools sort - SAMPLEA.bwa.sorted
    java –Xmx8g –jar picard.jar MarkDuplicates MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 \
      REMOVE_DUPLICATES=true INPUT=SAMPLEA.bwa.sorted.bam OUTPUT=SAMPLEA.dedup.bam \
      METRICS_FILES=SAMPLEA.dedup.metrics
    samtools index SAMPLEA.dedup.bam
    java –Xmx8g –jar GenomeAnalysisTK.jar –T RealignerTargetCreator \
      –R Sbicolor_313_v3.0.fa –I SAMPLEA.dedup.bam –o SAMPLEA.realignment.intervals
    java –Xmx8g –jar GenomeAnalysisTK.jar –T IndelRealigner –R Sbicolor_313_v3.0.fa \
      –I SAMPLEA.dedup.bam –targetIntervals SAMPLEA.realignment.intervals –o SAMPLEA.dedup.realigned.bam
    java –Xmx8g –jar GenomeAnalysisTK.jar –T HaplotypeCaller –R Sbicolor_313_v3.0.fa \
      –I SAMPLEA.dedup.realigned.bam --emitRefConfidence GVCF --pcr_indel_model NONE \
      -o SAMPLEA.output.raw.snps.indels.g.vcf
    java –Xmx8g –jar GenomeAnalysisTK.jar –T CombineGVCFs –R Sbicolor_313_v3.0.fa \
      -V SAMPLEA.output.raw.snps.indels.g.vcf --variant SAMPLEB.output.raw.snps.indels.g.vcf\
      -V SAMPLEC.output.raw.snps.indels.g.vcf -o SamplesABC_combined_gvcfs.vcf
    
    java –Xmx8g –jar GenomeAnalysisTK.jar –T CombineGVCFs –R Sbicolor_313_v3.0.fa \
      --variant SAMPLED.output.raw.snps.indels.g.vcf -V SAMPLEE.output.raw.snps.indels.g.vcf \
      -V SAMPLEF.output.raw.snps.indels.g.vcf -o SamplesDEF_combined_gvcfs.vcf
    java –Xmx8g –jar GenomeAnalysisTK.jar –T GenotypeGVCFs –R Sbicolor_313_v3.0.fa \
      -V SamplesABC_combined_gvcfs.vcf -V SamplesDEF_combined_gvcfs.vcf -o all_combined_Genotyped_lines.vcf
    java –Xmx8g –jar GenomeAnalysisTK.jar –T VariantFiltration –R Sbicolor_313_v3.0.fa \
      -V all_combined_Genotyped_lines.vcf -o all_combined_Genotyped_lines_filtered.vcf \
      --filterExpression "QD < 2.0" --filterName "QD" --filterExpression "FS > 60.0" \
      --filterName "FS" --filterExpression "MQ < 40.0" --filterName "MQ" --filterExpression "MQRankSum < -12.5" \
      --filterName "MQRankSum" --filterExpression "ReadPosRankSum < -8.0" --filterName "ReadPosRankSum"
    vcftools --vcf all_combined_Genotyped_lines_filtered.vcf --min-alleles 2 --max-alleles 2 \
      --out all_combined_Genotyped_lines_vcftools.filtered.recode.vcf --max-missing 0.2 --recode
    tassel-5-standalone/run_pipeline.pl -Xms75G -Xmx265G -SortGenotypeFilePlugin \
      -inputFile all_combined_Genotyped_lines_vcftools.filtered.recode.vcf \
      -outFile all_combined_Genotyped_lines_vcftools.filtered.recode.sorted.vcf -fileType VCF
    tassel-5-standalone/run_pipeline.pl -Xms75G -Xmx290G -fork1 -vcf \
      all_combined_Genotyped_lines_vcftools.filtered.recode.sorted.vcf -export -exportType Hapmap -runfork1
    ...
  • Full Field (Environmental Logger)

    • Weather Observations

      • sunDirection [degrees / 358.4948271126]

      • airPressure [hPa / 1014.1764580218]

      • brightness [kilo Lux / 1.0607318339]

      • relHumidity [relHumPerCent / 19.3731498154]

      • temperature [DegCelsuis / 17.5243385113]

      • windDirection [degrees / 176.7864009522]

      • precipitation [mm/h / 0.0559327677]

      • windVelocity [m/s / 3.4772789697]

      • raw values shown here; check if extractor converts to SI units

    • Photosynthetically Active Radiation

      • par [umol/(m^2*s) / 0]

    • co2 Observations

      • co2 [ppm / 493.4684409718]

    • Spectrometer Observations

      • maxFixedIntensity [16383]

      • integration time in us [5000]

  • AZMET Maricopa Weather Station

    • Weather Observations

      • wind_speed [1.089077491]

      • eastward_wind [-0.365913231]

      • northward_wind [-0.9997966834]

      • air_temperature [Kelvin/301.1359779]

      • relative_humidity [60.41579336]

      • preciptation_rate [0]

      • surface_downwelling_shortwave_flux_in_air [43.60608856]

      • surface_downwelling_photosynthetic_photon_flux_in_air [152.1498155]

    • Irrigation Observations

      • flow [gallons / 7903]

  • UIUC Energy Farm - CEN

  • UIUC Energy Farm - NE

  • UIUC Energy Farm - SE

    • Energy Farm Observations - CEN/NE/SE

      • wind_speed

      • eastward_wind

      • northward_wind

      • air_temperature

      • relative_humidity

      • preciptation_rate

      • surface_downwelling_shortwave_flux_in_air

      • surface_downwelling_photosynthetic_photon_flux_in_air

      • air_pressure

  • PLOT_ID e.g. Range 51 Pass 2 (each plot gets a separate location group)

    • sensorName - Range 51 Pass 2 (each sensor gets a separate stream within the plot)

      • fov [polygon geometry]

      • centroid [point geometry]

    • canopycover - Range 51 Pass 2

      • canopy_cover [height/0.294124289126]

  • wavelength [long array of decimals]
  • spectrum [long array of decimals]

  • Existing Data Standards

    This page summarizes existing standards, conventions, controlled vocabularies, and ontologies used for the representation of crop physiological traits, agronomic metadata, sensor output, genomics, and other inforamtion related to the TERRA-REF project.

    hashtag
    Metadata standards

    hashtag
    International Consortium for Agricultural Systems Applications (ICASA)

    The ICASA Version 2.0 data standard defines an abstract model and data dictionary for the representation of agricultural field expirements. ICASA is explicitly designed to support implementations in a variety of formats, including plain text, spreadsheets or structured formats. It is important to note that ICASA is both the data dictionary and a format used to describe experiments.

    The Agricultural Model Intercomparison Project () project has developed a for use with the AgMIP Crop Experiment (ACE) database and API.

    Currently, the ICASA data dictionary is represented as a and is not suitable for linked-data applications. The next step is to render ICASA in RDF for the TERRA-REF project. This will allow TERRA-REF to produce data that leverages the ICASA vocabulary as well as other external or custom vocabularies in a single metadata format.

    The ICASA data dictionary is also being mapped to various ontologies as part of the project. With this, it may be possible in the future to represent ICASA concepts using formal ontologies or to create mappings/crosswalks between them.

    See also:

    • White et al (2013). . Computers and Electronics in Agriculture.

    • AgMIP

    • </small>

    hashtag
    Minimum Information About a Plant Phenotyping Experiment (MIAPPE)

    MIAPPE was developed by members of the European Phenotyping Network (EPPN) and the EU-funded project. It is intended to define a list of attributes necessary to fully describe a phenotyping experiment.

    The MIAPPE standard is available from the transPlant and is compatible with the framework. The transPLANT standards portal also provides example configuration for the ISA toolset.

    MIAPPE is currently the only standard listed in for the phenotyping domain. While several databases claim to support MIAPPE, the standard is still nascent.

    MIAPPE is based on the ISA framework, building on earlier “minimum information” standards, such as MIAME (Minimum Information about a Microarray Experiment). If the MIAPPE standard is determined to be useful for TERRA-REF, it would be worth reviewing the MIAME steandard and related formats such as MAGE-TAG, MINiML, and SOFT accepted by the Gene Expression Omnibus (GEO). GEO is a long-standing repository for genetic research data and might serve as another model for TERRA-REF.

    It is worth noting that linked-data methods are supported but optional when depositing data to GEO. The format, similar to the MIAPPE ISA Tab format, does support .

    See also:

    • </small>

    hashtag
    Dublin Core Application Profiles

    While some communities define explicit metadata schema (e.g., ), another approach is the use of "application profiles." An application profile is declaration of metadata terms adopted by a community or an organization along with the source of the terms. Application profiles are composed of terms drawn from multiple vocubularies or ontologies to define a "schema" or "profile" for metadata. For example, the Dryad metadata profile draws on the Dublin Core, Darwin Core, and Dryad-specific elements.

    See also:

    • DCMI .

    • Example

    • DCMI

      </small>

    hashtag
    Trait Dictionary Format (Crop Ontology)

    The Crop Ontology curation tool supports import and export of trait information in a trait dictionary format.

    See also:

    • </small>

    hashtag
    Vocabularies and Ontologies

    This section reviews related controlled vocabularies, data dictionaries, and ontologies.

    hashtag
    Biofuel Ecophysiological Traits and Yields Database (BETYdb)

    While BETYdb is not a controlled vocabulary itself, the relational schema models a variety of concepts including managements, sites, treatments, traites, and yields.

    The BETYdb “variables” table defines variables used to represent traits in the BETYdb relational model. There has been some effort to standardize variable names by adopting standard names where variables overlap. A variable is represented as a name, description, units, as well as min/max values.

    For example:

    See also:

    • </small>

    hashtag
    DCMI Metadata terms

    Controlled vocabulary for the representation of bibliographic information. See also:

    • </small>

    hashtag
    Climate and Forecast Standard Name Table

    Standard variable names and naming convention for use with NetCDF. The Climate and Forecast metadata conventions are intended to promote sharing of NetCDF files. The CF conventions define metadata that provide a definitive description of what the data in each variable represents, and the spatial and temporal properties of the data. This enables users of data from different sources to decide which quantities are comparable, and facilitates building applications with powerful extraction, regridding, and display capabilities.

    Basic conventions include lower-case letters, numbers, underscores, and US spelling.

    Information is encoded in the variable name itself. The basic format is (optional components in []):

    [surface] [component] standard_name [at surface] [in medium] [due to process] [assuming condition]

    For example:

    Standard names have optional canonical units, AMIP and GRIB (GRidded Binary) codes.

    The CF standard names have been converted to RDF by several communities, including the Marine Metadata Interoperability (MMI) project.

    Dimensions: time, lat, lon, other specify time first (unlimited) lat, lon or x, y extent to field boundaries.

    See also:

    • mentions RDF conversions.

      </small>

    hashtag
    ICASA master variable list

    Vocabulary and naming conventions for agricultural modeling variables, used by AgMIP. The ICASA master variable list is included, at least in part, in the AgrO ontology. The NARDN-HD Core Harmonized Crop Experiment Data is also taken from the ICASA vocabulary.

    ICASA variables have a number of fields, including name, description, type, min and max values.

    See also:

    • White et al (2013). . Computers and Electronics in Agriculture.

      </small>

    hashtag
    NARDN-HD Core Harmonized Crop Experiment Data

    A subset of the ICASA data dictionary representing set of core variables that are commonly collected in field crop experiments. These will be used to harmonize data from USDA experiments as part of a National Agricultural Research Data Network.

    hashtag
    CSDMS Standard Names

    Variable naming rules and patterns for any domain developed as part of the CSDMS project as an alternative to CF. CSDMS standard names is considered to have a more flexible community approval mechanism than CF. CSDMS names include object, quantity/attribute parts.

    CSDMS names have been converted to RDF as part of the Earth Cube Geosemantic Server project.

    See also:

    • </small>

    hashtag
    International Plant Names Index (IPNI)

    IPNI is a database of the names and associated basic bibliographical details of seed plants, ferns and lycophytes. It's goal is to eliminate the need for repeated reference to primary sources for basic bibliographic information about plant names.

    hashtag
    NCBI Taxonomy

    A curated classification and nomenclature for all of the organisms in the public sequence databases that represents about 10% of the described species of life on the planet. Taxonomy recommended by MIAPPE.

    hashtag
    Ontologies

    hashtag
    Agronomy Ontology (AGRO)

    The Agronomy Ontology “describes agronomic practices, agronomic techniques, and agronomic variables used in agronomic experiments.” It is intended as a complementary ontology to the Crop Ontology (CO). Variables are selected out of the International Consortium for Agricultural Systems Applications (ICASA) vocabulary and a mapping between AgrO and ICASA is in progress. AgrO is intended to work with the existing ontologies including ENVO, UO, PATO, IAO, and CHEBI. It will be part of an Agronomy Management System and fieldbook modeled on the CGIAR Breeding Management System to capture agronomic data.

    See also:

    • OBO Foundry.

    • FAO.

    • RDA.

      </small>

    hashtag
    Crop Ontology (CO)

    The Crop Ontology (CO) contains "Validated concepts along with their inter-relationships on anatomy, structure and phenotype of crops, on trait measurement and methods as well as on Germplasm with the multi-crop passport terms." The ontology is actively used by the CGIAR community and a central part of the Breeding Management System. MIAPPE recommends the CO (along with TO, PO, PATO, XEML) for observed variables.

    Shrestha et al (2012) describe a method for representing trait data via the CO.

    See also:

    • Shrestha et al (2012). . Front Physiol. 2012 Aug 25;3:326.

      </small>

    hashtag
    Crop Research Ontology (CRO)

    Describes experimental design, environmental conditions and methods associated with the crop study/experiment/trial and their evaluation. CRO is part of the Crop Ontology platform, originally developed for the International Crop Information System (ICIS). CRO is recommended in the MIAPPE standard for general metadata, environment, treatments, and experimental design fields.

    See also:

    • </small>

    hashtag
    Extensible Observation Ontology (OBOE)

    Cited in Kattge et al (2011) as an example of an ontology used in ecology and environmental sciences to represent measurements and observation. However, the CRO may be better suited for TERRA-REF.

    See also:

    • Kattge, J.(2011).

      </small>

    hashtag
    Gene Ontology (GO)

    Defines concepts/classes used to describe gene function, and relationships between these concepts. GO is a widely-adopted ontology in genetics research, supported by databases such as GEO. This ontology is cited in Krajewski et al (2015) and might be relevant for the TERRA genomics pipeline.

    See also:

    • Krajewski et al (2015). . Journal of Experimental Botany, 66(18), 5417–5427.

      </small>

    hashtag
    Information Artifact Ontology (IAO)

    Information entities, originally driven by work by OBI (e.g., abstract, author, citation, document etc). IAO covers similar territory to the Dublin Core vocabulary.

    hashtag
    Ontology for Biomedical Investigations (OBI)

    Integrated ontology for the description of biological and clinical investigations. This includes a set of 'universal' terms, that are applicable across various biological and technological domains, and domain-specific terms relevant only to a given domain. Recommended by MIAPPE for general metadata, timing and location, and experimental design.

    See also:

    • </small>

    hashtag
    Phenotype and Attribute Ontology (PATO)

    Phenotypic qualities (properties).

    Recommended in MAIPPE for use in the observed values field.

    See also:

    • </small>

    hashtag
    Plant Environment Ontology (EO)

    Part of the Plant Ontology (PO), standardized controlled vocabularies to describe various types of treatments given to an individual plant / a population or a cultured tissue and/or cell type sample to evaluate the response on its exposure.

    hashtag
    Plant Ontology (PO)

    Describes plant anatomy and morphology and stages of development for all plants intended to create a framework for meaningful cross-species queries across gene expression and phenotype data sets from plant genomics and genetics experiment. Recommended by MIAPPE for observed values fields. Along with EO, GO, and TO make up the Gramene database. Links plant anatomy, morphology and growth and development to plant genomics data.

    See also:

    • </small>

    hashtag
    Plant Trait Ontology (TO)

    Along with EO, GO, and PO, make up the Gramene database to link plant anatomy, morphology and growth and development to plant genomics data. Recommended by MIAPPE for observed values fields.

    Example trait entry:

    See also:

    • </small>

    hashtag
    Statistics Ontology (STATO)

    General purpose statistics ontology coveraging processes such as statistical tests, their conditions of application, and information needed or resulting from statistical methods, such as probability distributions, variables, spread and variation metrics. Recommended by MIAPPE for experimental design.

    See also:

    • </small>

    hashtag
    Units of Measurement Ontology (UO)

    Metric units for PATO. This OBO ontology defines a set of prefixes (giga, hecto, kilo, etc) and units (area/square meter, volume/liter, rate/count per second, temperature/degree Fahrenheit). The two top-level classes are prefixes and units.

    UO is mentioned in relation to the Agronomy Ontology (AGRO), but PATO is also recommended by MIAPPE for observed values fields

    While there are general standard units, it seems unlikely that these would ever be gathered in a single place. It seems more useful to define a high-level ontology to represent a "unit" and allow domains and communities to publish their own authoritative lists.

    hashtag
    XEML Environment Ontology (XEO)

    Created to help plant scientists in documenting and sharing metadata describing the abiotic environment.

    hashtag
    DDI-RDF Discovery Vocabulary

    hashtag
    Data Catalog Vocabulary (DCAT)

    The is an RDF vocabulary intended to facilitate interoperability between data catalogs published on the Web. DCAT defines a set of classes including Dataset, Catalog, CatalogRecord, and Distribution.

    hashtag
    Data Cite Ontology

    The

    hashtag
    Data Cube Vocabulary

    The is an RDF-based model for publishing multi-dimentional datasets, based in part on the SDMX guidelines. DataCube defines a set of classes including DataSet, Observation, and MeasureProperty that may be relevant to the TERRA project.

    hashtag
    Statistical Data and Metadata Exchange (SDMX)

    is an international initiative for the standarization of the exchange of statistical data and metadata among international organizations. Sponsors of the initiative include Eurostat, European Central Bank, the OECD, World Bank and the UN Statistical Division. They have defined a framework and an exchange format, SDMX-ML, for data exchange. Community members have also developed RDF encodings of the SDMX guidelines that are heavily referenced in the Data Cube vocabulary examples.

    hashtag
    Related Software, Services, and Databases

    Standard formats, ontologies, and controlled vocabularies are typically used in the context of specific software systems.

    hashtag
    Agricultural Model Inter-Comparison and Improvement Project (AgMIP) Crop Experiment (ACE) Database

    AgMIP "seeks to improve the capability of ecophysiological and economic models to describe the potential impacts of climate change on agricultural systems. AgMIP protocols emphasize the use of multiple models; consequently, data harmonization is essential. This interoperability was achieved by establishing a data exchange mechanism with variables defined in accordance with international standards; implementing a flexibly structured data schema to store experimental data; and designing a method to fill gaps in model-required input data."

    The data exchange format is based on a . Data are transfer into and out of the AgMIP Crop Experiment (ACE) and AgMIP Crop Model (ACMO) databases via REST apis using these JSON objects.

    See also

    • Porter et al (2014). . Environmental Modelling and Software. 62:495-508.

    • presentation

    hashtag
    Biofuel Ecophysiological Traits and Yields Database (BETYdb)

    is used to store TERRA meta-data, provenance, and traits information.

    BETYdb traits are available as web-page, csv, json, xml. This can be extended to allow spatial, temporal, and taxonomic / genomic queries. Trait vectors can be queries and rendered in several output formats. For example:

    Here are some examples from betydb.org.

    A separate instance of BETYdb is maintained for use by TERRA Ref at . The scope of the TERRA Ref database is limited to high througput phenotyping data and metadata produced and used by the TERRA program. Users can set up their own instances of BETYdb and import any public data in the distributed BETYdb network.

    See also: BETYdb documentation

    • includes accessing data with web interface, API, and R traits package

    • , see section "uniqueness constraints"

    • </small>

    hashtag
    Gramene

    is a curated, open-source, integrated data resource for comparative functional genomics in crops and model plant species

    hashtag
    Integrated Breeding Platform/Breeding Management System

    System for managing the breeding process including lists of germplasms, defining crosses, managing nurseries, trials, as well as ontologies and statistical analysis.

    See also:

    • </small>

    TERRA Ref has an instance of (requires login).

    hashtag
    International Crop Information System

    ICIS is "a database system that provides integrated management of global information on crop improvement and management both for individual crops and for farming systems." ICIS is developed by Consultative Group for International Agricultural Research (CGIAR).

    See also

    • Fox and Skovmand (1996). "The International Crop Information System (ICIS) - connects genebank to breeder to farmer’s field." Plant adaptation and crop improvement, CAB International.

      </small>

    hashtag
    MODAPS NASA MODIS Satellite data

    The data encompasses a library of functions that provides programmatic data access and processing services to MODIS Level 1 and Atmosphere data products. These routines enable both SOAP and REST based web service calls against the data archives maintained by MODAPS. These routines mirror existing LAADS Web services.

    See also:

    • </small>

    hashtag
    Phenomics Ontology Driven Database (PODD)

    Online repository for storage and retrieval of raw and analyzed data from Australian Plant Phenomics Facility (APPF) phenotyping platforms. PODD is based on Fedora Commons repository software with data and metadata modeled using OWL/RDFS.

    See also:

    • </small>

    hashtag
    Plant Breeders API

    Specifies a standard interface for plant phenotype/genotype databases to serve data for use in crop breeding applications. This is the API used by , which allows users to turn spreadsheets into databases. Examples indicate that the responses will include values linked to the Crop Ontology, for example:

    However, in general the BRAPI returned JSON data without linking context (i.e., not JSON-LD), so it is in essence it’s own data structure.

    Other notes:

    • The group has implemented a few features to make it compatible with Field Book in its current state without the use of API.

    • BMS and the are both pushing for the API and plan on implementing it when it's complete.

    • Read news about the and

    See also

    • </small>

    hashtag
    Plant Genomics and Phenomics Research Data Repository (PGP)

    German repository for plant research data including image collections from plant phenotyping and microscopy, unfinished genomes, genotyping data, visualizations of morphological plant models, data from mass spectrometry as well as software and documents.

    See also:

    • Arend et al (2016). . Database.

    • </small>

    hashtag
    USDA Plants

    “The PLANTS Database provides standardized information about the vascular plants, mosses, liverworts, hornworts, and lichens of the U.S. and its territories. It includes names, plant symbols, checklists, distributional data, species abstracts, characteristics, images, crop information, automated tools, onward Web links, and references.”

    See also

    • </small>

    hashtag
    USDA Quick Stats

    Web based application supports querying the agricultural census and survey statistics. Also available via API.

    See also

    • </small>

    hashtag
    transPLANT

    Infrastructure to support computational analysis of genomic data from crop and model plants. This includes the large-scale analysis of genotype-phenotype associations, a common set of reference plant genomic data, archiving genomic variation, and a search engine integrating reference bioinformatics databases and physical genetic materials. See also

    • </small>

    hashtag
    Sensor Data

    hashtag
    Meteorological data

    hashtag
    Multi-scale Synthesis and Terrestrial Model Intercomparison Project (MsTMIP) data formats

    • One implementation of CF for ecosystem model driver (met, soil) and output (mass, energy dynamics)

      • Standardized Met driver data

    hashtag
    Date-Time:

    YYYY-MM-DD hh:mm:ssZ: based on ISO 8601 . Optional offset for local time; precision determined by data (e.g. could be YYYY-MM-DD and decimals specified by a period.

    Trait Ontology (TO), Plant Ontology (PO), Crop Ontology (CO), Phenotypic Quality Ontology (PATO), XEO/XEML

  • AgMIP Crop Experiment Database data variablesarrow-up-right

  • AgMIP APIarrow-up-right

  • AgMIP using ICASA standardsarrow-up-right</small>

  • Section

    Recommended ontologies

    General metadata

    Ongtology for Biomedical Investigations (OBI), Crop Research Ontology (CRO)

    Timing and location

    OBI, Gazetteer (GAZ)

    Biosource

    UNIPROT taxonomy, NCBI taxonomy

    Environment, treatments

    XEO Environment Ontology, Ontology of Environmental Features (ENVO), CRO

    Experimental design

    OBI, CRO, Statistics Ontology (STATO)

    AgMIParrow-up-right
    JSON-based formatarrow-up-right
    Google Spreadsheetarrow-up-right
    Agronomy Ontologyarrow-up-right
    Integrated Description of Agricultural Field Experiments and Production: The ICASA Version 2.0 Data Standardsarrow-up-right
    JSON Data Objects format descriptionarrow-up-right
    ICASA Master Variable Listarrow-up-right
    transPLANTarrow-up-right
    standards portalarrow-up-right
    ISA-Tools suitearrow-up-right
    biosharing.orgarrow-up-right
    MAGE-TABarrow-up-right
    sources for controlled vocabulary terms or ontologiesarrow-up-right
    Minimum Information about a Plant Phenotyping Experimentarrow-up-right
    Ecological Metadata Languagearrow-up-right
    Guidelines for Dublin Core Application Profilesarrow-up-right
    Dryad Metadata Profilearrow-up-right
    Singapore Frameworkarrow-up-right
    The Crop Ontology Improving the Quality of 18 Crop Trait Dictionariesarrow-up-right
    Climate Forecasting (CF) conventionarrow-up-right
    The full suite of variables supported by BETYdbarrow-up-right
    Trait variables used in the TERRA Ref BETYdb instancearrow-up-right
    DCMI Termsarrow-up-right
    CF Conventionsarrow-up-right
    CF Conventions FAQarrow-up-right
    ICASA Master Variable Listarrow-up-right
    Integrated Description of Agricultural Field Experiments and Production: The ICASA Version 2.0 Data Standardsarrow-up-right
    CSMDS Standard Namesarrow-up-right
    http://www.ipni.org/arrow-up-right
    http://www.ncbi.nlm.nih.gov/taxonomyarrow-up-right
    Agonomy Ontologyarrow-up-right
    Crop Ontology: harmonizing semantics for phenotyping and agronomy dataarrow-up-right
    Interest Group on Agricultural Data (IGAD)arrow-up-right
    Crop Ontologyarrow-up-right
    Bridging the phenotypic and genetic data useful for integrated breeding through a data annotation using the Crop Ontology developed by the crop communities of practicearrow-up-right
    Crop Research Ontologyarrow-up-right
    International Crop Information Systemarrow-up-right
    A generic structure for plant trait databasesarrow-up-right
    Gene Ontologyarrow-up-right
    Towards recommendations for metadata and data handling in plant phenotypingarrow-up-right
    Minimum Information about a Plant Phenotyping Experimentarrow-up-right
    Minimum Information about a Plant Phenotyping Experimentarrow-up-right
    Minimum Information about a Plant Phenotyping Experimentarrow-up-right
    Minimum Information about a Plant Phenotyping Experimentarrow-up-right
    Minimum Information about a Plant Phenotyping Experimentarrow-up-right
    Data Catalog Vocabularyarrow-up-right
    DataCite Ontologyarrow-up-right
    Data Cube Vocabularyarrow-up-right
    SDMXarrow-up-right
    JSON rendering of the ICASA Master Variable Listarrow-up-right
    AgMIP Crop Expirement Databasearrow-up-right
    Harmonization and translation of crop modeling data to ensure interoperabilityarrow-up-right
    AgMIP Data Productsarrow-up-right
    AgMIP on Githubarrow-up-right
    BETYdbarrow-up-right
    HTML outputarrow-up-right
    csv outputarrow-up-right
    xml outputarrow-up-right
    Json-compatible outputarrow-up-right
    terraref.ncsa.illinois.edu.org/betyarrow-up-right
    BETYdb Data Accessarrow-up-right
    BETYdb constraintsarrow-up-right
    BETYdb Data Entryarrow-up-right
    Gramenearrow-up-right
    BMS Sitearrow-up-right
    BMS hosted by CyVersearrow-up-right
    MODAPS NASA MODIS Satellitearrow-up-right
    NDISC Modis Data Summariesarrow-up-right
    http://www.plantphenomics.org.au/projects/podd/arrow-up-right
    PODD Project Sitearrow-up-right
    FieldBookarrow-up-right
    https://github.com/plantbreeding/API/blob/master/Specification/Traits/ListAllTraits.mdarrow-up-right
    Breeding Management System (BMS)arrow-up-right
    Genomic & Open-source Breeding Informatics Initiative (GOBII)arrow-up-right
    BMS Breeding Management System Standalone Serverarrow-up-right
    genomes2fields migrating to BMSarrow-up-right
    Plant Breeding APIarrow-up-right
    PGP repository: a plant phenomics and genomics data publication infrastructurearrow-up-right
    PGP Repositoryarrow-up-right
    USDA Plants Websitearrow-up-right
    USDA Quick Stats Websitearrow-up-right
    transPlant Websitearrow-up-right
    Proposed format for meteorological variables exported from Lemnatec platformarrow-up-right
    Terrestrial Ecosystem Model outputarrow-up-right

    Observed values

    "variable": {
        "created_at": "2016-03-07T11:23:58-06:00",
        "description": "",
        "id": 604,
        "label": "",
        "max": "1000",
        "min": "0",
        "name": "NDVI",
        "notes": "",
        "standard_name": "normalized_difference_vegetation_index",
        "standard_units": "ratio",
        "type": "",
        "units": "ratio",
        "updated_at": "2016-03-07T11:26:07-06:00"
    }
    [Term]
    id: TO:0000019
    name: seedling height
    def: "Average height measurements of 10 seedlings, in centimeters from the base of the shoot to the tip of the tallest leaf blade." [IRRI:SES]
    synonym: "SH" RELATED []
    is_a: TO:0000207 ! plant height

    Data Backup

    hashtag
    Raw data

    Running nightly on ROGER.

    Script is hosted at: /gpfs/smallblockFS/home/malone12/terra_backup

    Script uses the Spectrum Scale policy engine to find all files that were modified the day prior, and passes that list to a job in the batch system. The job bundles the files into a .tar file, then uses pigz to compress it in parallel across 18 threads. Since this script is run as a job in the batch system, with variables passed with the date, if the batch system is busy, the backups won't need to preclude each other. The .tgz files are then sent over to NCSA Nearline using Globus, then purged from file system.

    hashtag
    BETYdb

    Runs every night at 23:59. View the scriptarrow-up-right.

    This script creates a daily backup every day of the month. On Sundays creates a weekly backup, on the last day of the month it creates a monthly backup and at the last day of the year it will create a yearly backup. This script overwrite existing backups, for example every 1st of the month it will create a backup called bety-d-1 that contains the backup of the 1st of the month. See the script for the rest of the file names.

    These backups are copied using crashplan to a central location and should allow recovery in case of a catastrophic failure.

    hashtag
    See Also

    • Description of Blue Water's nearline storage system https://bluewaters.ncsa.illinois.edu/dataarrow-up-right

    • Github issues:

      • https://github.com/terraref/computing-pipeline/issues/87arrow-up-right

    https://github.com/terraref/computing-pipeline/issues/384arrow-up-right