1 of 72

revisions

Introduction

About this book

This book describes the TERRA-REF data collection, computing, and analysis pipelines. The following links provide quick access to

Available Data

About TERRA-REF

The ARPA-E-funded Transportation Energy Resources from Renewable Agriculture Phenotyping Reference Platform (TERRA-REF) program aims to transform plant breeding by using remote sensing to quantify plant traits such as plant architecture, carbon uptake, tissue chemistry, water use, and other features to predict the yield potential and stress resistance of 300+ diverse Sorghum lines.

The data storage and computing system provides researchers with access to the reference phenotyping data and analytics resources using a high performance computing environment. The reference phenotyping data includes direct measurements and sensor observations, derived plant phenotypes, and genetic and genomic data.

Our objectives are to ensure that the software and data in the reference data and computing pipeline are interoperable, reusable, extensible, and understandable. Providing clear definitions of common formats will make it easier to analyze and exchange data and results.

Versions

The first edition (alpha release) was published November 2016.
The second edition (beta release) will be published November 2018
The third edition (version 1.0) will be published November 2019

Data Sources

Field phenotyping research sites

Maricopa Agricultural Center (MAC), Arizona

The Maricopa field site is located at the the University of Arizona Maricopa Agricultural Center and USDA Arid Land Research Station in Maricopa, Arizona. At this site, we have deployed the following phenotyping platforms.

The is the largest field crop analytics robot in the world. This high-throughput phenotyping field-scanning robot has a 30-ton steel gantry that autonomously moves along two 200-meter steel rails while continuously imaging the crops growing below it with a diverse array of .
The PhenoTractor is fitted with a sensor frame that supports a real time kinematic (RTK) satellite navigation antenna, a sonar transducer, an infrared temperature (IRT) scanner, and three .

Kansas State University

Tractor - coming 2017
UAV - coming 2017

Controlled-environment phenotyping

Donald Danforth Plant Science Center, Missouri

The is a climate controlled 70 m2 growth house with a conveyor belt system for moving plants to and from fluorescence, color, and near infrared imaging cabinets. This automated, high-throughput platform allows repeated non-destructive time-series image capture and multi-parametric analysis of 1,140 plants in a single experiment.

Genomics

Genomic data includes whole-genome resequencing data from the HudsonAlpha Institute for Biotechnology, Alabama for 384 samples for accessions from the sorghum Bioenergy Association Panel (BAP) and genotyping-by-sequencing (GBS) data from Kansas State University for 768 samples from a population of sorghum recombinant inbred lines (RIL).

Software

TERRA-REF uses a suite of databases and software components that are described below.

Clowder (sensor data and computation management with web user interface)

Clowder is the primary system used to organize, annotate, and process raw data generated by the phenotyping platforms as well as information about sensors. Use Clowder to explore the raw TERRA-REF data, perform exploratory analysis, and develop custom extractors. For more information, see Using Clowder.

Globus Connect (large data transfer)

Raw data is transferred to the primary TERRA-REF compute pipeline using Globus Online. Globus also provides access to TERRA REF files, but this is not a primary portal and metadata in Clowder may be required to locate and interpret these files. Use Globus Online when you want to transfer data from the TERRA-REF system for local analysis by accessing the . For more information, see .

BETYdb (phenotype data)

BETYdb is a database and web interface to the trait / phenotype data and agronomic metadata. This is where you can find plant and plot level trait data as well as plot locations and other information associated with agronomic experimental design. Use BETYdb to access derived trait and agronomic data. For more information, see .

Algorithms (a.k.a. 'extractors')

Plant CV

is an imaging processing package specific for plants that is built upon open-source software platforms , , and . Plant CV is used for trait identification, the output is stored in both Clowder and BETYdb.

Other Algorithms

Each step in the pipeline is performed by an algorithm. These are maintained in the TERRA REF GitHub organization in repositories with names that begin in extractors-* such as .

Analysis Tools

The NDS Workbench enables users to access the large filesystem and databases with familiar development environments. We provide a variety of environments for developing new algorithms and integrating them into the TERRA REF pipeline. These include and configured for specific use cases such as sensor data processing, trait analysis, database queries, and piepline development.

CoGe

CoGe contains genomic information and sequence data. For more information, see .

Protocols

The following protocols have been contributed by TERRA-REF team members:

Field Scanner - Coming 2017
Genomics - Coming 2017

A template for documenting protocols .

Controlled Environment Protocols

Abstract

Automated VIS and NIR imaging in a controlled growth environment

Template Protocol

Abstract

Materials

UAV Protocols

Abstract

Materials

Experimental Design

Field phenotyping

Maricopa Agricultural Center (MAC), Arizona
Kansas State University - coming 2017

Controlled-environment phenotyping

Genomics

HudsonAlpha Institute for Biotechnology, Alabama - coming 2017

Experimental Design Danforth

Location: The Automated controlled-environment phenotyping at the Donald Danforth Plant Science Center Bellwether Foundation Phenotyping Facility

The consists of multiple digital imaging chambers connected to the Conviron growth house by a conveyor belt system, resulting in a continuous imaging loop. Plants are imaged from the top and/or multiple sides, followed by digital construction of images for analysis.

RGB imaging allows visualization and quantification of plant color and structural morphology, such as leaf area, stem diameter and plant height.

Sorghum Lines Danforth

Experiment LT1A (TM015)

ATLAS LEOTI PI_144134 PI_145619 PI_145626 PI_145632 PI_145633 PI_146890 PI_147224 PI_152591 PI_152651 PI_152694 PI_152727 PI_152728 PI_152730 PI_152733 PI_152751 PI_152771 PI_152816 PI_152828 PI_152860 PI_152862 PI_152923 PI_152961 PI_152963 PI_152965 PI_152966 PI_152967 PI_152971 PI_153877 PI_154750 PI_154844 PI_154846 PI_154944 PI_154987 PI_154988 PI_155149 PI_155516 PI_155760 PI_155885 PI_156178 PI_156203 PI_156217 PI_156268 PI_156326 PI_156330 PI_156393 PI_156463 PI_156487 PI_156871 PI_156890 PI_157030 PI_157033 PI_157035 PI_157804 PI_167093 PI_170787 PI_175919 PI_176766 PI_179749 PI_180348 PI_181080 PI_181083 PI_195754 PI_196049 PI_196583 PI_196586 PI_196598 PI_197542 PI_19770 PI_213900 PI_217691 PI_218112 PI_221548 PI_221651 PI_226096 PI_22913 PI_229841 PI_251672 PI_253986 PI_255239 PI_255744 PI_257599 PI_257600 PI_266927 PI_267573 PI_273465 PI_273969 PI_276837 PI_297130 PI_297155 PI_297171 PI_302252 PI_303658 PI_329256 PI_329286 PI_329299 PI_329300 PI_329301 PI_329310 PI_329319 PI_329326 PI_329333 PI_329338 PI_329351 PI_329394 PI_329403 PI_329435 PI_329440 PI_329465 PI_329466 PI_329471 PI_329473 PI_329478 PI_329480 PI_329501 PI_329506 PI_329510 PI_329511 PI_329517 PI_329518 PI_329519 PI_329541 PI_329545 PI_329546 PI_329550 PI_329569 PI_329570 PI_329584 PI_329585 PI_329605 PI_329614 PI_329615 PI_329618 PI_329632 PI_329644 PI_329645 PI_329646 PI_329665 PI_329673 PI_329699 PI_329702 PI_329710 PI_329711 PI_329841 PI_329843 PI_329864 PI_329865 PI_330168 PI_330169 PI_330181 PI_330182 PI_330184 PI_330185 PI_330195 PI_330196 PI_330199 PI_330796 PI_330803 PI_330807 PI_330833 PI_330858 PI_337680 PI_337689 PI_35038 PI_365512 PI_452542 PI_452619 PI_452692 PI_453696 PI_455217 PI_455221 PI_455280 PI_455301 PI_455307 PI_505717 PI_505722 PI_505735 PI_506030 PI_506069 PI_506114 PI_506122 PI_508366 PI_510757 PI_511355 PI_513898 PI_514456 PI_521019 PI_521152 PI_521280

Experimental Design Genomics

Whole-genome resequencing

Experimental Design:

were sequenced to an average depth of ~25x.
Shotgun sequencing (127-bp paired-end) was done using an Illumina X10 instrument at the HudsonAlpha Institute for Biotechnology.
Variant calling was done using a at the Danforth Center.
See the page to get access to raw and derived data products.

Genotyping-by-sequencing

Experimental Design:

were sequenced using a GBS approach.

Sorghum Lines Genomics Year 1

see https://gist.github.com/dlebauer/6b7b0e181cc5ae5034b992f725712ba4#file-sorghum-lines-genomics-md

Sorghum Lines Genomics Year 1 (continued)

Sorghum Lines Genomics Year 2

User Manual

Overview

This user manual is divided into the following sections:

: A summary of the available data products and the processes used to create them

What Data is Available

Real-time sensor data transfer by file number and size can be viewed here.

See Data Products for more information about individual data products and How to Access Data for instructions to access the data products.

Data Products

The following table lists available TERRA-REF data products. The table will be updated as new datasets are released. Links are provided to pages with detailed information about each data product including sensor descriptions, algorithm (extractor) information, protocols, and data access instructions.

Fluorescence intensity imaging

Summary

Fluorescence intensity data is collected using the PSII camera.

Genomics data

Genomic data includes whole-genome resequencing data from the HudsonAlpha Institute for Biotechnology, Alabama for 384 samples for accessions from the sorghum (BAP) and genotyping-by-sequencing (GBS) data from Kansas State University for 768 samples from a population of sorghum recombinant inbred lines (RIL).

These data are available to Beta Users and require permission to access. The form to sign up for our beta user program is at . Once you have signed up for our beta user program you can access genomics data in one of the following locations:

Download via .

Infrared heat imaging data

Summary

Infrared heat imaging data is collected collected using the FLIR SC615 thermal sensor. These data are provided as geotiff image raster files as well as plot level means.

Algorithms are in the repository; see the readme for details.

Multispectral imaging data

Phenotype data

Point Cloud Data

Summary

3D point cloud data is collected using the Fraunhofer 3D laserscanner. .

Data access

Data is available via Clowder and Globus.

Clowder:
Globus path: /sites/ua_mac/raw_data/scanner3DTop
Sensor information:

For details about using this data via Clowder or Globus, please see section.

Computational pipeline

Raw sensor output (PLY) is converted to LAS format using the ply2las extractor

Description: PLY data is converted to LAS using the 3D point cloud extractor
Output:
- Clowder: LAS file is added to the dataset

How to Access Data

Overview

TERRA-REF data is available through four different approaches: Globus Connect, Clowder, BETYdb, and CoGe. Raw data is transfered to the primary compute pipeline using Globus Online. Data is ingested into Clowder to support exploratory analysis. The Clowder extractor system is used to transform the data and create derived data products, which are either available via Clowder or published to specialized services, such as BETYdb.

For more information, see the Architecture Documentation.

Clowder

Clowder is the primary system used to organize, annotate, and process raw data generated by the phenotyping platforms as well as information about sensors.

Use Clowder to explore the raw TERRA-REF data, perform exploratory analysis, and develop custom extractors.

For more information, see .

Globus Connect

Raw data is transferred to the primary TERRA-REF compute pipeline on the (ROGER) system using Globus Online. Data is available for Globus transfer via the . Direct access to ROGER is restricted.

Use Globus Online when you want to transfer data from the TERRA-REF system for local analysis.

For more information, see .

BETYdb

BETYdb contains the derived trait data with plot locations and other information associated with agronomic experimental design.

Use BETYdb to access about derived trait data.

For more information, see .

CoGe

CoGe contains genomic information and sequence data.

For more information, see .

Other Data

Field protocols
Calibration protocols
Field scanner operational log

Using Clowder (Sensor and Genoomics data)

About Clowder

Clowder is an active data repository designed to enable collaboration around a set of shared datasets. TERRAREF uses Clowder to organize, annotate, and process data generated by phenotyping platforms. Datafiles are available via the Clowder web interface or API.

See the Clowder documentation for more information about the software and its applications.

Requesting Access

To create an account, sign up at the and wait for your account to be approved. Once access is granted, you can explore collections and datasets.

Data organization

Data is organized into spaces, collections, and datasets, collections.

Spaces contain collections and datasets. TERRA-REF uses one space for each of the phenotyping platforms.
Collections consist of one or more datasets. TERRA-REF collections are organized by acquisition date and sensor. Users can also create their own collections.
Datasets consist of one or more files with associated metadata collected by one sensor at one time point. Users can annotate, download, and use these sensor datasets.

Searching the database

Clowder allows users to search metadata and filter datasets and files with particular attributes. Simply enter your search terms in the search box.

Analyzing data in Clowder

Clowder includes support for launching integrated analysis environments from your browser, including RStudio and Jupyter Notebooks.

After selecting a dataset, under the "Analysis Environment Instances", select the "Launch new instance with dataset" drop-down, select the desired tool, then the "Launch" button. Select the "Environment manager" link to view the list of active instances. Find your instance and select the title link. This will display the tool with the selected dataset mounted. If you have a running instance, you can also "Upload dataset to existing instance".

Clowder Extractors

Through it's extractor architecture, Clowder supports automated computational workflows. For more information about developing Clowder extractors, see the documentation

Using BETYdb (trait data, experimental metadata)

About BETYdb

BETYdb is used to manage and distribute agricultural and ecological data. It contains phenotype and agronomic data including plot locations and other geolocations of interest (e.g. fields, rows, plants).

Requesting access

To request access to BETYdb, register on the. You will be notified once you have been granted access.

Data organization

The primary BETYdb is largely relevant here, noting the following usages:

Genotypes are stored in the cultivars table
Plots are stored in the sites table. Plots are nested hierarchically based on geolocation.

Most tables in BETYdb have search boxes. We describe below how to use the Advanced Search box to query data from these tables and download the results as a CSV file.

The Advanced Search box is the easiest way to download summary datasets designed to have enough information (location, time, species, citations) to be useful for a wide range of use cases.

(For more information about querying data from specific tables, see the BETYdb .)

On the Welcome page of BETYdb there is a search option for trait and yield data (Figure 1). This tool allows users to search the entire collection of trait and yield data for specific sites, citations, species, and traits.

The results page provides a map interface and the option to download a file containing search results. The downloaded file is in CSV format. This file provides meta-data and provenance information, including the SQL query used to extract the data, the date and time the query was made, the citation source of each result row, and a citation for BETYdb itself.

Instructions

Using the search box to search trait and yield data is very simple: Type the site (city or site name), species (scientific or common name), cultivar, citation (author and/or year), or trait (variable name or description) into the search box and the results will show contents of BETYdb that match the search. The number of records per page can be changed to accord with the viewer's preference and the search results can be downloaded in the Excel-compatible CSV format.

The search map may be used in conjunction with search terms to restrict search results to a particular geographical area—or even a specific site—by clicking on a map. Clicking on a particular site will restrict results to that site. Clicking in the vicinity of a group of sites but not on a particular site will restrict the search to the region around the point clicked. Alternatively, if a search using search terms is done without clicking on the map, all sites associated with the returned results are highlighted on the map. Then, to zero in on results for a particular geographic area, click on or near highlighted locations on the map.

Using CyVerse (Genomics)

About CyVerse

CyVerse is a National Science Foundation funded cyberinfrastructure that aims to democratize access to supercomputing capabilities.

Accessing Data via CyVerse

TERRA-REF genomics data is accessible on the CyVerse Data Store and Discovery Environment. Accessing data through the CyVerse Discovery Environment requires signing up for a free CyVerse account. The Discovery Environment gives users access to software and computing resources, so this method has the advantage that TERRA-REF data can be utilized directly without the need to copy the data elsewhere. During the TERRA-REF , users will need to request access to the TERRA-REF CyVerse Community Data folder through the TERRA-REF . The TERRA-REF Community Data folder can be found at /iplant/home/shared/terraref.

Using Analysis Workbench (all data)

About the Analysis Workbench

The Analysis Workbench allows you to launch private Jupyter Notebook and RStudio instances to explore and analyze TERRA-REF data products.

Data Use Policy

Release with Attribution

We plan to make data from the Transportation Energy Resources from Renewable Agriculture Phenotyping Reference Platform (TERRA-REF) project available for use with attribution. Each type of data will include or point to the appropriate attribution policy.

Release / reprocessing schedule

We will release the data in stages or tiers.

First Tier

The first tier will be an internal release to the TERRA-REF team and the standards committee. This first tier release will be to initially quality check and calibrate the data and will take place as data sets are produced and compiled.

By November 2016, it is an objective of the TERRA-REF team to establish a data release pipeline, wherein the release of data to this first tier will be within 21 days from the date of collection.
Access to the data will be arranged for by the resource producer (i.e. limiting access to selected users).

Second Tier

The second tier will enable the release of the data generated solely by the TERRA-REF team to other TERRA teams as well as non-TERRA entities.

By November 2017, it is an objective of the TERRA-REF team to establish a data release pipeline, wherein the release of data to this second tier will be within 10 days from the data of collection.
It is noted that release of the data to the second tier may occur prior to publication and that access is granted with the understanding that the contributions and interests of the TERRA-REF team should be recognized and respected by the users of the data. The TERRA-REF team reserves the right to analyze and published its own data, provided that this is done in a timely fashion. Resource users should appropriately cite the source of the data and acknowledge the resource produces. The publication the data, as suggested in the TERRA-REF Authorship Guidelines, should specify the collaborative nature of the project, and authorship is expected to include all those contributing significantly to the work.

Technical Documentation

This section includes the following:

Data Standards

Overview

TERRA’s data standards facilitate the exchange of genomic and phenomic data across teams and external researchers. Applying common standards makes it easier to exchange analytical methods and data across domains and to leverage existing tools.

When practical, existing conventions and standards have been used to create data standards. Spatial data adopts Federal Geographic Data Committee (FGDC) and Open Geospatial Consortium (OGC) data and meta-data standards. CF variable naming convention was adopted for meteorological data and biophysical data. Data formats and variable naming conventions were adapted from NEON and NASA.

Feedback from data creators and users were used to define the types of data formats, semantics, and interfaces, file formats, and representations of space, time, and genetic identity based on existing standards, commonly used file formats, and user needs.

We anticipate that standards and data formats will evolve over time as we clarify use cases, develop new sensors and analytical pipelines, and build tools for data format conversion and feature extraction and tracking provenance. Each year we will re-convene to assess our standards based on user needs. The Standards Committee will assess the trade-off between the upfront cost of adoption with the long-term value of the data products, algorithms, and tools that will be developed as part of the TERRA program. The specifications for these data products will be developed iteratively over the course of the project in coordination with TERRA funded projects. The focus will be to take advantage of existing tools based on these standards, and to develop data translation interfaces where necessary.

Agronomic and Phenotype Data Standards

Current Practice

In TERRA-REF v0 release, agronomic and phenotype data is stored and exchanged using the BETYdb API. Agronomic data is stored in the sites, managements, and treatments tables. Phenotype data is stored in the traits, variables, and methods tables. Data is ingested and accessed via the BETYdb API formats.

Standardization Efforts

In cooperation with participants from , the , and groups, the TERRA-REF team is pursuing the development of a format to facilitate the exchange of data across systems based on the ICASA Vocabulary and AgMIP JSON Data Objects. An initial draft of this format is available for comment on

In addition, we plan to enable the TERRA-REF databases to import and export data via the .

Directory Structure

The data processing pipeline transmits data from origination sites to a controlled directory structure on the ROGER CyberGIS supercomputer.

The data is generally structured as follows:

/sites
  /ua-mac
    /raw_data
      /sensor1
        /timestamp
          /dataset
      /sensor2
      ...
    /Level_1
      /extractor1_outputs
      /extractor2_outputs
      ...
  /danforth
    /raw_data
      /sensor3
      ...
    /Level_1
      /extractor3_outputs

...where raw outputs from sensors per site are stored in a raw_data subdirectory and corresponding outputs from different extractor algorithms are stored in Level_1 (and eventually Level_2, etc) subdirectories.

When possible, sensor directories will be divided into days and then into individual datasets.

This directory structure is visible when accessing data via the Globus interface.

Data Storage

Blue Waters Nearline: NCSA 300PB+ Tape Archive (2PB Allocation)
ROGER: CyberGIS R&D server for GIS applications, 5PB storage + variety of nodes, including large memory. roger.ncsa.illinois.edu (1PB Allocation)

Data Transfer

Maricopa Agricultural Center, Arizona

Environmental Sensors

Transferring ima

Data is sent to the gantry-cache server located inside the main UA-MAC building's telecom room via FTP over a private 10GbE interface. Path to each file being transferred is logged to /var/log/xferlog. Docker container running on the gantry-cache reads through this log file, tracking the last line it has read and scans the file regularly looking for more lines. File paths are scraped from the log and are bundled into groups of 500 to be transferred to the Spectrum Scale file systems that backs the ROGER cluster at NCSA via the Globus Python API. The log file is rolled daily and compressed to keep size in check. Sensor directories on the gantry-cache are white listed for being monitored to prevent accidental or junk data from being ingested into the Clowder pipeline.

Data Processing Pipeline

Maricopa Agricultural Center, Arizona

Automated controlled-environment phenotyping, Missouri

At two points in the processing pipeline, metadata derived from collected data is inserted into BETYdb:

At the start of the transfer process, metadata collected and derived during Danforth's initial processing will be pushed.
After transfer to NCSA, extractors running in Clowder will derive further metadata that will be pushed. This is a subset of the metadata that will also be stored in Clowder's database. The complete metadata definitions are still being determined, but will likely include:
- plant identifiers

Kansas State University

HudsonAlpha - Genomics

Data Backup

Raw data

Running nightly on ROGER.

Script is hosted at: /gpfs/smallblockFS/home/malone12/terra_backup

Script uses the Spectrum Scale policy engine to find all files that were modified the day prior, and passes that list to a job in the batch system. The job bundles the files into a .tar file, then uses pigz to compress it in parallel across 18 threads. Since this script is run as a job in the batch system, with variables passed with the date, if the batch system is busy, the backups won't need to preclude each other. The .tgz files are then sent over to NCSA Nearline using Globus, then purged from file system.

BETYdb

Runs every night at 23:59. .

This script creates a daily backup every day of the month. On Sundays creates a weekly backup, on the last day of the month it creates a monthly backup and at the last day of the year it will create a yearly backup. This script overwrite existing backups, for example every 1st of the month it will create a backup called bety-d-1 that contains the backup of the 1st of the month. See the script for the rest of the file names.

These backups are copied using crashplan to a central location and should allow recovery in case of a catastrophic failure.

Data Product Creation

Data Product Levels

Data products are processed at various levels ranging from Level 0 to Level 4. Level 0 products are raw data at full instrument resolution. At higher levels, the data are converted into more useful parameters and formats. These are derived from NASA and NEON

Hyperspectral Data

The TERRA hyperspectral data pipeline processes imagery from hyperspectral camera, and ancillary metadata. The pipeline converts the "raw" ENVI-format imagery into netCDF4/HDF5 format with (currently) lossless compression that reduces their size by ~20%. The pipeline also adds suitable ancillary metadata to make the netCDF image files truly self-describing. At the end of the pipeline, the files are typically [ready for xxx]/[uploaded to yyy]/[zzz].

Installation

Software dependencies

The pipeline currently depends on three pre-requisites:

Quality Assurance and Quality Control

Logging

Automated checks

visualizations

testing and continuous integration framework

checking that scans align with plots

Developer Manual

TERRA members may submit data to Clowder, BETYdb, and CoGe.

Clowder contains data related to the field scanner operations and sensor box, including bounding box of each image / dataset as well as location of the sensor, data types and processing level, scanner missions.
BETYdb contains plot locations and other geolocations of interest (e.g. fields, rows, plants) that are associated with agronomic experimental design / meta-data (what was planted where, field boundaries, treatments, etc).

Submitting Data to CoGe

supports the genomics pipeline required for the TERRA program for Sorghum sequence alignment and analysis. It has a web interface and REST API. CoGe is developed by Eric Lyons and hosted at the University of Arizona, where it is made available for researchers to use. CoGe can be hosted on any server, VM, or Docker container.

Submitting Sequences to the CoGe Pipeline

Tutorials

We are developing a set of tutorials described here

Note that the tutorials assume that you are using terraref.ndslabs.org which provides all of the software dependencies along with data access.

Appendix

Code of Conduct

As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.

We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.

Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.

Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.

This code of conduct applies both within project spaces and in public spaces when an individual is representing the project or its community.

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.

This Code of Conduct is adapted from the Contributor Covenant, version 1.1.0, available from

Glossary

Accession - plant materials collected from a particular area.

Active reflectance - measurement of light originating from a sensor that reflects off of an object and back to the sensor

Algorithm - a process or set of rules to be followed in calculations or other problem-solving operations

Alignment, sequence - a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences

API (application programming interface) - a set of routine definitions, protocols, and tools for building software and applications.

BAM (Binary Alignment/Map) format - binary format for storing sequence data.

BED (Browser Extensible Data) format - format consisting of one line per feature, each containing 3-12 columns of data, plus optional track definition lines.

BETYdb (Biofuel Ecophysiological Traits and Yields database) - a web-based database of plant trait and yield data that supports research, forecasting, and decision making associated with the development and production of cellulosic biofuel crops

BRDF (Bidirectional Reflectance Distribution Function) - a function of four real variables that defines how light is reflected at an opaque surface.

Breeding Management System (BMS) - an information management system developed by the Integrated Breeding Platform to help breeders manage the breeding process, from program planning to decision-making.

Brown Dog - a research project to develop a method for easily accessing historic research data stored in order to maintain the long-term viability of large bodies of scientific research.

BWA - a software package for mapping low-divergent sequences against a large reference genome.

Clowder - a scalable data repository for sharing, organizing and analyzing data

Collections - one or more datasets.

Cultivar - plants selected for desirable characteristics that can be maintained by propagation.

Data product level - relative amount that data products are processed. Level 0 products are raw data at full instrument resolution. At higher levels, the data are converted into more useful parameters and formats.

Data standards - the rules by which data are described and recorded.

Datasets - one or more files with associated metadata collected by one sensor at one time point.

Downwelling spectral irradiance - The component of radiation directed toward the earth's surface per unit frequency or wavelength

Exposure - the amount of light per unit area reaching an electronic image sensor

FASTQ format - a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.

FASTX-toolkit - a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.

Gantry - a rail-bound crane systems that transport a measurement platform (like the Scanalyzer) over a field

GAPIT (Genome Association and Prediction Integrated Tool) – an R package that performs Genome Wide Association Study (GWAS) and genome prediction (or selection).

GATK (Genome Analysis Toolkit) - a software package for analysis of high-throughput sequencing data

Gbrowse - a combination of database and interactive web pages for manipulating and displaying annotations on genomes.

Generic Model Organism Database (GMOD) - a collection of open source software tools for managing, visualizing, storing, and disseminating genetic and genomic data.

Genome annotation - the process of attaching biological information to sequences.

Genomic coordinates - The beginning and ending positions of an annotation along a sequence

Genotype calling - inferring the genotype carried by an individual at each site

GeoDjango - geographic Web framework for building GIS Web applications

Germplasm - the sum total of genetic resources of an organism.

GFF (General Feature Format) - format consisting of one line per feature, each containing 9 columns of data, plus optional track definition lines

GIS (geographic information system) - a system designed to capture, store, manipulate, analyze, manage, and present all types of spatial or geographical data.

Globus - a connected set of data transfer and sharing services for research data management.

Hierarchical Data Format (HDF) - a set of file formats (HDF4, HDF5) designed to store and organize large amounts of data.

Hyperspectral data - information from across the electromagnetic spectrum.

IGV (Integrative Genomics Viewer) - a high-performance visualization tool for interactive exploration of large, integrated genomic datasets.

Integrated Breeding Platform (IBP) - platform providing integrated, high-performing breeding informatics and management system

Jbrowse - an embeddable genome browser

Json - open-standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs.

Jupyter Notebook - a web application for creating and sharing documents that contain live code, equations, visualizations and explanatory text.

Lemnatec - supplier of software and automated research platforms for plant phenotyping.

Metadata - data that provides information about other data

MLMM (multi-locus mixed-model) - analysis for genome-wide association studies (GWAS) that uses a forward and backward stepwise approach to select markers as fixed effect covariates in the model.

NetCDF - a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.

OpenAlea - a distributed collaborative effort to develop Python libraries and tools that address the needs of current and future works in Plant Architecture modeling.

OpenCV (Open Source Computer Vision Library) - an open source computer vision and machine learning software library.

PAR (Photosynthetically Active Radiation) - the amount of light available for photosynthesis, which is light in the 400 to 700 nanometer wavelength range.

Phenotype - the set of observable characteristics of an individual resulting from the interaction of its genotype with the environment.

Phytozome - a project that facilitates comparative genomic studies amongst green plants.

PlantCV - an imaging processing package specific for plants that is built upon open-source software

PostGIS - an open source software program that adds support for geographic objects to the PostgreSQL object-relational database.

Python - a programming language

QA (quality assurance) - a planned system of review procedures conducted outside the actual data compilation.

QC (quality control) - a system of checks to assess and maintain the quality of the data.

Quality scores - measure of the probability that a nucleotide base is correctly identified from DNA sequencing

R/qtl - an extensible, interactive environment for mapping quantitative trait loci (QTL) in experimental crosses.

Raw data - unprocessed data collected from an experiment

Reads - sequence of nucleotides of a segment of DNA

Reference data - data that defines the set of permissible values to be used by other data fields.

RESTful API - an application program interface (API) that uses HTTP requests to get, put, post, and delete data.

ROGER - a cluster housed at NCSA that has 13.3 TB of system memory available for computation

Rstudio - a set of integrated tools for use with R, a software environment for statistical computing and graphics.

SAMtools (Sequence Alignment/Map) – a generic format for storing large nucleotide sequence alignments.

Scanalyzer - instrumentation created by Lemnatec with robotic sensor arm with multiple overhead cameras and sensors

Sequencing - the process of determining the precise order of nucleotides within a DNA molecule.

SNP (single nucleotide polymorphism) - a variation in a single nucleotide that occurs at a specific position in the genome

Spaces - contain collections and datasets. TERRA-REF uses one space for each of the phenotyping platforms.

Spectral exposure - the radiant energy received by a surface, per unit time, per unit frequency

Spectral flux - the radiant energy emitted, reflected, transmitted or received, per unit time, per unit frequency

Spectral response function (SRF) - the quantum efficiency of a sensor at specific wavelengths over the range of a spectral band

SQL (Structured Query Language) is a special-purpose programming language designed for managing data held in a relational database management system

SRA (Sequence Read Archive) - a bioinformatics database that provides a public repository for DNA sequencing data

Standards committee - TERRA project representatives and external advisors who work to create clear definitions of data formats, semantics, and interfaces, file formats, and representations of space, time, and genetic identity based on existing standards, commonly used file formats, and user needs to make it easier to analyze and exchange data and results.

Swagger - a set of rules for a format describing REST API. The format can be used to share documentation among product managers, testers and developers, but can also be used by various tools to automate API-related processes.

TASSEL-GBS - software for investigating the relationship between phenotypes and genotypes

TERRA (Transportation Energy Resources from Renewable Agriculture) - a program funded by ARPA-E program that facilitates the improvement of advanced biofuel crops, by developing and integrating cutting-edge remote sensing platforms, complex data analytics tools, and high-throughput plant breeding technologies.

TERRA-REF (Transportation Energy Resources from Renewable Agriculture Phenotyping Reference Platform) - a research project focused on developing an integrated phenotyping system for energy sorghum that leverages genetics and breeding, automation, remote plant sensing, genomics, and computational analytics.

Thredds: Geospatial Data server - a web server that provides metadata and data access for scientific datasets, using a variety of remote data access protocols

Trait - the morphological, anatomical, physiological, biochemical and phenological characteristics of plants and their organs

Variants - a nucleotide difference in a genotype compared to a reference genotype

VCF - a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.

Vcftools - a program package designed for working with VCF files

White reference, reflectance of - light reflecting off of a white reference object that is used for the calibration of hyperspectral images

Accessing BETYdb via ArcMap and other GIS software

Interested researchers can access BETYdb directly from GIS software such as ESRI ArcMap and QGIS. In some cases direct access can simplify the use of spatial data in BETYdb data, but this convenience must be weighed against a more complex setup, limits of GIS software compatibility, and additional complexity of extracting data from a PostGIS SQL database.

Overview

Accessing the production BETYdb used by the TERRA REF program requires creating a secure shell tunnel (SSH) to a remote server. After creating the tunnel, the database is accessed as if it were available on the local machine. A step-by-step process is given below.

Configuration used for these instructions

ArcMap 10.3 or later (Requires Windows operating system)
Instructions for using QGIS and other GIS software are provided below
PuTTY: ssh client for Windows that can be downloaded here:

Setup

Request Access

Request access by following the link. This will take you to the NCSA identity service. If you do not have an NCSA account, you will be asked to create one. This account and password will be used to login to the database server. Access will generally be granted within 24-hours.

Confirm Access

Use PuTTY or your preferred SSH client and your NCSA account. First open the terminal and then login to bety6.ncsa.illinois.edu using ssh from the command line:

After confirming access to bety6 logout by typing exit.

Create SSH Tunnel to BETYdb

The following command will create an SSH tunnel from your computer to the BETYdb server:

Note if have a postgres running on your desktop computer (using the default port 5432), you will need to stop it first.

The above will bind the local port 5432 (first parameter) to port 5432 (second parameter), the default Postgres listening port, on the remote server. All traffic bound for port 5432 on your local machine will be automatically forwarded to the remote server. As a result, programs such as ArcGIS running on your computer will connect to the remote BETYdb as if it were on your computer.

Note you will need to create the SSH connection with the tunnel every time you wish to access BETYdb from your local machine.

To keep the tunnel open, use

note for PuTTY Users: you can configure Putty to remember these settings. In the navigation tree on the left-hand side, click Connection > SSH > Tunnels. Enter '5432' under Source port and 'localhost:5432' in the Destination field. Then click session and save this configuration for future use.

The next section of the guide will discuss accessing BETYdb using ArcMap, querying plots and joining these to the traits and experiments tables. The instructions for setting up a SSH tunnel will also work psql, pgAdmin3, QGIS, and other clients. Instructions for connecting via QGIS and ArcGIS Pro are provided below.

Using ArcMAP

Add BETYdb Layer or Table to ArcMap

BETYdb is configured with PostGIS geometry support. This allows ArcGIS Desktop clients to access geometry layers stored within BETYdb.

Warning: ArcGIS releases prior to 10.3 required you to place the PostgreSQL libpq files in the ArcGIS client's bin directory. This is no longer required for the ArcGIS Desktop clients but some ESRI tools may still require the library be installed.*

Click on the ArcCatalog icon (on right edge of ArcMap window) to open the ArcCatalog Tree
In the tree, click on 'Database Connections' and then "Add Database Connnections". A Database Connection dialog window will open.
Within the dialog box:

Warning: ArcMap does not support the big integer format used by BETYdb as primary keys and those fields will not be visible or available for selection. In most cases you should be able to use other fields as unique identifiers.*

Modifying the Query Layer

BETYdb contains one geometry table called betydb.public.sites containing the boundaries for each plot. Because the plot boundaries can change each season, and even within season, different plot definitions may be used (e.g. to subset plots or exclude boundary rows), there is significant overlap that can cause confusion when displayed. In general, you will want to use the query layer to limit plots to a single season and a single definition.

Right click the bety.public.sites layer and choose properties.
Choose the Definition Query tab
Add the line sitename LIKE 'MAC Field Scanner Season 1%' or sitename LIKE 'MAC Field Scanner Season 2%' to limit the layer to Season 1 or Season 2 respectively.

For more advanced selection of sites by experiment or season, you can join the experiments and experiments_sites tables. This is beyond the scope of the present tutorial.

Joining Additional BETYdb Tables

Additional tables can be added and joined to the sites table. Tables can be added just like any other layer. In this case, we'll add bety.public.traits_and_yields_view and join it to the bety.public.sites layer.

To create a join with other tables, start by adding the desired table.
Follow instructions above to add the bety.public.traits_and_yields_view
On this table the unique identifier is a group of columns, so select sitename, cultivar, scientificname, trait, date, entity, and method as the unique identifiers.

Creating a Thematic View

The final section describes how to create a thematic view of the bety.public.sites layer based on the mean attribute where the trait is NDVI from the bety.public.traits_and_yields_view. Remove any previous joins from bety.public.sites (right click bety.public.sites --> joins and relates --> remove join) prior to performing this procedure because we will be selecting the NDVI data by creating a query layer from bety.public.traits_and_yields_view prior to the join.

Right click bety.public_traits_and_yields_view table and select properties
Click on the Definition Query tab
Add the line "trait = 'NDVI'" to the Definition Query box

Connecting to Other GIS Software

Below connection instructions assume an SSH tunnel exists.

ArcGIS Pro

This assumes you have followed instructions for ArcMAP to create a database connection file.

Open ArcCatalog
- Under database connections, you will find the connection made above, called 'TERRA REF BETYdb.sde'
- right click this and select 'properties'

QGIS

Open QGIS
In left 'browser panel', right-click the PostGIS icon
select 'New Connection'

How to export plots from PostGIS as a Shapefile

This does not require GIS software other than the PostGIS traits database. While connecting directly to the database within GIS software is handy, it is also straightforward to export Shapefiles.

After you have connected via ssh to the PostGIS server, the pgsql2shp function is available and can be used to dump out all of the plot and site definitions (names and geometries) thus:

Submitting data to BETYdb

Submitting Data to BETYdb

BETYdb is a database used to centralize data from research done in all TERRA projects. (It is also the name of the Web interface to that database.) Uploading data to BETYdb will allow everyone on the team access to research done on the TERRA project.

Preliminary steps

Before submitting data to BETYdb, you must first have an account.

Go to the homepage.
Click the "Register for BETYdb" button to create an account. If you plan to submit data, be sure to request "Creator" page access level when filling out the sign-up form.
Understand how the database is organized and what search options are avaible. Do this by exploring the data using the Data tab (see next section).

Exploring the data

The Data tab contains a menu for searching the database for different types of data. The Data tab is also the pathway to pages allowing you to add new data of your own. But if you have a sizable amount of trait or yield data you wish to submit, you will likely want to use the Bulk Upload wizard (see below).

As an example, try clicking the Data tab and selecting Citations, the first menu item. A page with a list of citations that have already been uploaded into the system appears.

Citations are listed by the first author's last name. For example a journal article written by Andrew Davis and Kerri Shaw would have the name "Davis" in the author slot.

Use the search box located in the top right corner of the page to search for citations by author, year, title, journal, volume, page, URL, or DOI. Note that the search string must exactly match a substring of the value of one of these items (though the matching is case-insensitive).

Each of the other collections listed in the Data menu may be searched similarly. For example, on the Cultivars page you can search cultivars in the system by searching for them by any of several facets pertaining to cultivars, including the name, ecotype, associated species, even the notes. Keep in mind that when switching to a new Data menu item (such as Cultivars), the resulting page will initially show all items of the type selected that are currently on file. (More precisely, since results are paginated, it will show the first twenty-five of those results.)

Preparing for bulk upload of data

The Bulk Upload wizard expects data in CSV format, with one row for each set of associated data items. ("Associated data items" usually means a set of measurements made on the same entity at the same time.) Each trait or yield data item must be associated with a citation, site, species, and treatment and may be associated with a specific cultivar of the associated species. Before you can upload data from a data file, this associated citation, site, species, cultivar, and treatment information must already be in place.

Moreover, if you are uploading trait data, your CSV data file must have one or more trait variable columns (and optionally, one or more covariate variable columns), and the names of these columns must match the names of existing variables. (See the discussion of variables below.)

Details on adding associated data

There is no bulk upload process for adding citations, site, species, cultivars, treatment, and variables to the database. They must be added one at a time using Web forms. Since most often a set of dozens or hundreds of traits is associated with a single citation, site, or species (etcetera), usually this is not an undue burden.

Details on checking that items of each particular type exist (and adding them if they don't) follow:

Citations: To check that the needed citations exist, go to the citations listing by clicking Citations in the Data menu. Search for your citation(s) to determine if all citations associated with your data already exist. If they don't, then create new citations as needed. Be sure to fill in all the required data; author, year, and title are required; if at all possible, include the journal name, volume, page numbers, and DOI. (You must include the DOI if that is what your data files uses to identify citations.)

Sites: Go to the Data tab and click on Sites to verify that all sites in your data file are listed on the Sites page. If any of your sites are not already in the system, you will need to add them to the database. To do this, first search the citations list for the associated citation, select it (by clicking the checkmark in the row where it is listed) and then click the New Site button. A new site must have a name, but if possible, supply other information—the city, state, and country where the site is located, the latitude, longitude, and altitude of the site, and possibly climate and soil data.

It is possible that sites referenced by your data are already in the database but that they aren't yet associated with the citation associated with that data. To see the set of sites associated with a given citation, find the citation in the citations list and select it by clicking the checkmark in its row. This will take you to the Listing Sites page; all of the sites associated with the selected citation (if any) will be listed at the top. To associate another site with the selected citation, enter its name in the search box, find the row containing it, and click the "link" action in that row.

Treatments: The treatment specified for each of your data items must not only match the name of an existing treatment, it must also be associated with the citation for the data item. To see the list of treatments associated with a particular citation, select the citation as in the instructions for Sites. Then click the Treatments link on the Listing Sites page. The top section of this page lists all treatments associated with the selected citation.

Currently, there is no way to associate an arbitrary treatment with a citation via the Web interface. You will either have to make a new treatment with the desired name (after the desired citation has been selected), or you will have to (or have an administrator) modify the database directly.

Species: To check that the needed species entries exist, go to the the species listing by clicking Species in the Data menu. Search for each of the species required by your data. The species entry in the CSV file must match the scientific name (Latin name) of the species listed in the database. If necessary, add any species in your data that has not yet been added to the database. When adding a species, scientificname is the only required field, but the genus and species fields should be filled out as well.

Cultivars: If your data lists cultivars, you should check that these are in the database as well. Cultivar names are not necessarily unique, but they are unique within a given species. To check whether a cultivar matching the name and species listed in your CSV file has been added to the database, go to the cultivar listing by clicking Cultivars in the Data menu. Searching either by species name or cultivar name should quickly determine if the needed cultivar exists. If it needs to be added, click the New Cultivar button. Fill in the species search box with enough of the species name to narrow down the result list to a workable size, and then select the correct species from the result list immediately below the search box. Then type the name of the cultivar you wish to add in the Name field. The Ecotype and Notes sections are optional.

Variables: If you are submitting trait data, verify that the variables associated with each trait and each covariate match the names of variables in the system (for example, canopy_height, hull_area, or solidity). To do this, go to the Data tab and click on Variables. If any of your variables are not already in the system, you will need to add them.

For a variable to be recognized as a trait variable or covariate, it is not enough for it simply to be in the variables table; it must also be in the trait_covariate_associations table. To check which variables will be recogized as trait variables or covariates, click on the Bulk Upload tab. Then click the link View List of Recognized Traits. This will bring up a table that lists all names of variables recognized as traits and the names of all variables recognized as required or optional covariates for each trait. If you need to add to this table and do not have direct access to the underlying database to which you are submitting data, you will need to e-mail the site adminstrator to request additions. (See the "Contact Us" section in the footer of the homepage.)

The Bulk Upload Wizard

Once you have entered all the necessary data to prepare for a bulk data upload, you can then begin the bulk upload process.

There are some key rules for bulk uploading:

Templates To help you get started, some data file templates are available. There are four different templates to choose from.
- Use this template if you are uploading yields and you wish to specify the citations by author, year, and title.

Troubleshooting data files

Immediately after uploading a data file (or after specifying the citation if this is done interactively), the Bulk Upload Wizard tries to validate the uploaded file and displays the results of this validation.

The types of errors one may encounter at this stage fall into roughly three categories:

Parsing errors
These are errors at the stage of parsing the CSV file, before the header or data values are even checked. An error at this stage returns one to the file-upload page.
Header errors
These are errors caused by having an incongruous set of headings in the header row. Here are some examples:

After successful validation

Global options and values

If there are no errors in the data file, the bulk upload will proceed to a page allowing you to choose rounding options for your data values. You may choose to keep 1, 2, 3, or 4 significant digits, 3 being the default. If your data includes a standard error (SE) column, you may separately specify the amount of rounding for the standard error. Here the default is 2 significant digits.

If you did not specify all associated-data values and or did not specify an access level in the data file itself, this page will also allow you to specify a uniform global value for any association not specified in the file; and it will allow you to specify a uniform access level if your data file did not have an access_level column.

Verification page

Once you have specified global options and values, you will be taken to a verification page that will summarize the global options you have selected and the associations you specified for your data. The latter will be presented in more detail than any specification in your data file or on the Upload Options and Global Values page. For example, when summarizing the sites associated with your data, not only are the site names listed, but the city, state, country, latitude, longitude, soil type, and soil notes are also displayed. This will help ensure that the citations, sites, species, etc. that you specified are really the ones that you intended.

Once you have verified the data, clicking the Insert Data button will complete the upload. The insertions are done in an SQL transaction: if any insertion fails, the entire transaction is rolled back.

Existing Data Standards

This page summarizes existing standards, conventions, controlled vocabularies, and ontologies used for the representation of crop physiological traits, agronomic metadata, sensor output, genomics, and other inforamtion related to the TERRA-REF project.

Metadata standards

International Consortium for Agricultural Systems Applications (ICASA)

The ICASA Version 2.0 data standard defines an abstract model and data dictionary for the representation of agricultural field expirements. ICASA is explicitly designed to support implementations in a variety of formats, including plain text, spreadsheets or structured formats. It is important to note that ICASA is both the data dictionary and a format used to describe experiments.

The Agricultural Model Intercomparison Project () project has developed a for use with the AgMIP Crop Experiment (ACE) database and API.

Currently, the ICASA data dictionary is represented as a and is not suitable for linked-data applications. The next step is to render ICASA in RDF for the TERRA-REF project. This will allow TERRA-REF to produce data that leverages the ICASA vocabulary as well as other external or custom vocabularies in a single metadata format.

The ICASA data dictionary is also being mapped to various ontologies as part of the project. With this, it may be possible in the future to represent ICASA concepts using formal ontologies or to create mappings/crosswalks between them.

See also:

White et al (2013). . Computers and Electronics in Agriculture.
AgMIP

Minimum Information About a Plant Phenotyping Experiment (MIAPPE)

MIAPPE was developed by members of the European Phenotyping Network (EPPN) and the EU-funded project. It is intended to define a list of attributes necessary to fully describe a phenotyping experiment.

The MIAPPE standard is available from the transPlant and is compatible with the framework. The transPLANT standards portal also provides example configuration for the ISA toolset.

MIAPPE is currently the only standard listed in for the phenotyping domain. While several databases claim to support MIAPPE, the standard is still nascent.

MIAPPE is based on the ISA framework, building on earlier “minimum information” standards, such as MIAME (Minimum Information about a Microarray Experiment). If the MIAPPE standard is determined to be useful for TERRA-REF, it would be worth reviewing the MIAME steandard and related formats such as MAGE-TAG, MINiML, and SOFT accepted by the Gene Expression Omnibus (GEO). GEO is a long-standing repository for genetic research data and might serve as another model for TERRA-REF.

It is worth noting that linked-data methods are supported but optional when depositing data to GEO. The format, similar to the MIAPPE ISA Tab format, does support .

See also:

Dublin Core Application Profiles

While some communities define explicit metadata schema (e.g., ), another approach is the use of "application profiles." An application profile is declaration of metadata terms adopted by a community or an organization along with the source of the terms. Application profiles are composed of terms drawn from multiple vocubularies or ontologies to define a "schema" or "profile" for metadata. For example, the Dryad metadata profile draws on the Dublin Core, Darwin Core, and Dryad-specific elements.

See also:

DCMI .
Example
DCMI

Trait Dictionary Format (Crop Ontology)

The Crop Ontology curation tool supports import and export of trait information in a trait dictionary format.

See also:

Vocabularies and Ontologies

This section reviews related controlled vocabularies, data dictionaries, and ontologies.

Biofuel Ecophysiological Traits and Yields Database (BETYdb)

While BETYdb is not a controlled vocabulary itself, the relational schema models a variety of concepts including managements, sites, treatments, traites, and yields.

The BETYdb “variables” table defines variables used to represent traits in the BETYdb relational model. There has been some effort to standardize variable names by adopting standard names where variables overlap. A variable is represented as a name, description, units, as well as min/max values.

For example:

See also:

DCMI Metadata terms

Controlled vocabulary for the representation of bibliographic information. See also:

Climate and Forecast Standard Name Table

Standard variable names and naming convention for use with NetCDF. The Climate and Forecast metadata conventions are intended to promote sharing of NetCDF files. The CF conventions define metadata that provide a definitive description of what the data in each variable represents, and the spatial and temporal properties of the data. This enables users of data from different sources to decide which quantities are comparable, and facilitates building applications with powerful extraction, regridding, and display capabilities.

Basic conventions include lower-case letters, numbers, underscores, and US spelling.

Information is encoded in the variable name itself. The basic format is (optional components in []):

[surface] [component] standard_name [at surface] [in medium] [due to process] [assuming condition]

For example:

Standard names have optional canonical units, AMIP and GRIB (GRidded Binary) codes.

The CF standard names have been converted to RDF by several communities, including the Marine Metadata Interoperability (MMI) project.

Dimensions: time, lat, lon, other specify time first (unlimited) lat, lon or x, y extent to field boundaries.

See also:

mentions RDF conversions.

ICASA master variable list

Vocabulary and naming conventions for agricultural modeling variables, used by AgMIP. The ICASA master variable list is included, at least in part, in the AgrO ontology. The NARDN-HD Core Harmonized Crop Experiment Data is also taken from the ICASA vocabulary.

ICASA variables have a number of fields, including name, description, type, min and max values.

See also:

White et al (2013). . Computers and Electronics in Agriculture.

NARDN-HD Core Harmonized Crop Experiment Data

A subset of the ICASA data dictionary representing set of core variables that are commonly collected in field crop experiments. These will be used to harmonize data from USDA experiments as part of a National Agricultural Research Data Network.

CSDMS Standard Names

Variable naming rules and patterns for any domain developed as part of the CSDMS project as an alternative to CF. CSDMS standard names is considered to have a more flexible community approval mechanism than CF. CSDMS names include object, quantity/attribute parts.

CSDMS names have been converted to RDF as part of the Earth Cube Geosemantic Server project.

See also:

International Plant Names Index (IPNI)

IPNI is a database of the names and associated basic bibliographical details of seed plants, ferns and lycophytes. It's goal is to eliminate the need for repeated reference to primary sources for basic bibliographic information about plant names.

NCBI Taxonomy

A curated classification and nomenclature for all of the organisms in the public sequence databases that represents about 10% of the described species of life on the planet. Taxonomy recommended by MIAPPE.

Ontologies

Agronomy Ontology (AGRO)

The Agronomy Ontology “describes agronomic practices, agronomic techniques, and agronomic variables used in agronomic experiments.” It is intended as a complementary ontology to the Crop Ontology (CO). Variables are selected out of the International Consortium for Agricultural Systems Applications (ICASA) vocabulary and a mapping between AgrO and ICASA is in progress. AgrO is intended to work with the existing ontologies including ENVO, UO, PATO, IAO, and CHEBI. It will be part of an Agronomy Management System and fieldbook modeled on the CGIAR Breeding Management System to capture agronomic data.

See also:

OBO Foundry.
FAO.
RDA.

Crop Ontology (CO)

The Crop Ontology (CO) contains "Validated concepts along with their inter-relationships on anatomy, structure and phenotype of crops, on trait measurement and methods as well as on Germplasm with the multi-crop passport terms." The ontology is actively used by the CGIAR community and a central part of the Breeding Management System. MIAPPE recommends the CO (along with TO, PO, PATO, XEML) for observed variables.

Shrestha et al (2012) describe a method for representing trait data via the CO.

See also:

Shrestha et al (2012). . Front Physiol. 2012 Aug 25;3:326.

Crop Research Ontology (CRO)

Describes experimental design, environmental conditions and methods associated with the crop study/experiment/trial and their evaluation. CRO is part of the Crop Ontology platform, originally developed for the International Crop Information System (ICIS). CRO is recommended in the MIAPPE standard for general metadata, environment, treatments, and experimental design fields.

See also:

Extensible Observation Ontology (OBOE)

Cited in Kattge et al (2011) as an example of an ontology used in ecology and environmental sciences to represent measurements and observation. However, the CRO may be better suited for TERRA-REF.

See also:

Kattge, J.(2011).

Gene Ontology (GO)

Defines concepts/classes used to describe gene function, and relationships between these concepts. GO is a widely-adopted ontology in genetics research, supported by databases such as GEO. This ontology is cited in Krajewski et al (2015) and might be relevant for the TERRA genomics pipeline.

See also:

Krajewski et al (2015). . Journal of Experimental Botany, 66(18), 5417–5427.

Information Artifact Ontology (IAO)

Information entities, originally driven by work by OBI (e.g., abstract, author, citation, document etc). IAO covers similar territory to the Dublin Core vocabulary.

Ontology for Biomedical Investigations (OBI)

Integrated ontology for the description of biological and clinical investigations. This includes a set of 'universal' terms, that are applicable across various biological and technological domains, and domain-specific terms relevant only to a given domain. Recommended by MIAPPE for general metadata, timing and location, and experimental design.

See also:

Phenotype and Attribute Ontology (PATO)

Phenotypic qualities (properties).

Recommended in MAIPPE for use in the observed values field.

See also:

Plant Environment Ontology (EO)

Part of the Plant Ontology (PO), standardized controlled vocabularies to describe various types of treatments given to an individual plant / a population or a cultured tissue and/or cell type sample to evaluate the response on its exposure.

Plant Ontology (PO)

Describes plant anatomy and morphology and stages of development for all plants intended to create a framework for meaningful cross-species queries across gene expression and phenotype data sets from plant genomics and genetics experiment. Recommended by MIAPPE for observed values fields. Along with EO, GO, and TO make up the Gramene database. Links plant anatomy, morphology and growth and development to plant genomics data.

See also:

Plant Trait Ontology (TO)

Along with EO, GO, and PO, make up the Gramene database to link plant anatomy, morphology and growth and development to plant genomics data. Recommended by MIAPPE for observed values fields.

Example trait entry:

See also:

Statistics Ontology (STATO)

General purpose statistics ontology coveraging processes such as statistical tests, their conditions of application, and information needed or resulting from statistical methods, such as probability distributions, variables, spread and variation metrics. Recommended by MIAPPE for experimental design.

See also:

Units of Measurement Ontology (UO)

Metric units for PATO. This OBO ontology defines a set of prefixes (giga, hecto, kilo, etc) and units (area/square meter, volume/liter, rate/count per second, temperature/degree Fahrenheit). The two top-level classes are prefixes and units.

UO is mentioned in relation to the Agronomy Ontology (AGRO), but PATO is also recommended by MIAPPE for observed values fields

While there are general standard units, it seems unlikely that these would ever be gathered in a single place. It seems more useful to define a high-level ontology to represent a "unit" and allow domains and communities to publish their own authoritative lists.

XEML Environment Ontology (XEO)

Created to help plant scientists in documenting and sharing metadata describing the abiotic environment.

DDI-RDF Discovery Vocabulary

Data Catalog Vocabulary (DCAT)

The is an RDF vocabulary intended to facilitate interoperability between data catalogs published on the Web. DCAT defines a set of classes including Dataset, Catalog, CatalogRecord, and Distribution.

Data Cite Ontology

The

Data Cube Vocabulary

The is an RDF-based model for publishing multi-dimentional datasets, based in part on the SDMX guidelines. DataCube defines a set of classes including DataSet, Observation, and MeasureProperty that may be relevant to the TERRA project.

Statistical Data and Metadata Exchange (SDMX)

is an international initiative for the standarization of the exchange of statistical data and metadata among international organizations. Sponsors of the initiative include Eurostat, European Central Bank, the OECD, World Bank and the UN Statistical Division. They have defined a framework and an exchange format, SDMX-ML, for data exchange. Community members have also developed RDF encodings of the SDMX guidelines that are heavily referenced in the Data Cube vocabulary examples.

Standard formats, ontologies, and controlled vocabularies are typically used in the context of specific software systems.

Agricultural Model Inter-Comparison and Improvement Project (AgMIP) Crop Experiment (ACE) Database

AgMIP "seeks to improve the capability of ecophysiological and economic models to describe the potential impacts of climate change on agricultural systems. AgMIP protocols emphasize the use of multiple models; consequently, data harmonization is essential. This interoperability was achieved by establishing a data exchange mechanism with variables defined in accordance with international standards; implementing a flexibly structured data schema to store experimental data; and designing a method to fill gaps in model-required input data."

The data exchange format is based on a . Data are transfer into and out of the AgMIP Crop Experiment (ACE) and AgMIP Crop Model (ACMO) databases via REST apis using these JSON objects.

See also

Porter et al (2014). . Environmental Modelling and Software. 62:495-508.
presentation

Biofuel Ecophysiological Traits and Yields Database (BETYdb)

is used to store TERRA meta-data, provenance, and traits information.

BETYdb traits are available as web-page, csv, json, xml. This can be extended to allow spatial, temporal, and taxonomic / genomic queries. Trait vectors can be queries and rendered in several output formats. For example:

Here are some examples from betydb.org.

A separate instance of BETYdb is maintained for use by TERRA Ref at . The scope of the TERRA Ref database is limited to high througput phenotyping data and metadata produced and used by the TERRA program. Users can set up their own instances of BETYdb and import any public data in the distributed BETYdb network.

See also: BETYdb documentation

includes accessing data with web interface, API, and R traits package
, see section "uniqueness constraints"

Gramene

is a curated, open-source, integrated data resource for comparative functional genomics in crops and model plant species

Integrated Breeding Platform/Breeding Management System

System for managing the breeding process including lists of germplasms, defining crosses, managing nurseries, trials, as well as ontologies and statistical analysis.

See also:

TERRA Ref has an instance of (requires login).

International Crop Information System

ICIS is "a database system that provides integrated management of global information on crop improvement and management both for individual crops and for farming systems." ICIS is developed by Consultative Group for International Agricultural Research (CGIAR).

See also

Fox and Skovmand (1996). "The International Crop Information System (ICIS) - connects genebank to breeder to farmer’s field." Plant adaptation and crop improvement, CAB International.

MODAPS NASA MODIS Satellite data

The data encompasses a library of functions that provides programmatic data access and processing services to MODIS Level 1 and Atmosphere data products. These routines enable both SOAP and REST based web service calls against the data archives maintained by MODAPS. These routines mirror existing LAADS Web services.

See also:

Phenomics Ontology Driven Database (PODD)

Online repository for storage and retrieval of raw and analyzed data from Australian Plant Phenomics Facility (APPF) phenotyping platforms. PODD is based on Fedora Commons repository software with data and metadata modeled using OWL/RDFS.

See also:

Plant Breeders API

Specifies a standard interface for plant phenotype/genotype databases to serve data for use in crop breeding applications. This is the API used by , which allows users to turn spreadsheets into databases. Examples indicate that the responses will include values linked to the Crop Ontology, for example:

However, in general the BRAPI returned JSON data without linking context (i.e., not JSON-LD), so it is in essence it’s own data structure.

Other notes:

The group has implemented a few features to make it compatible with Field Book in its current state without the use of API.
BMS and the are both pushing for the API and plan on implementing it when it's complete.
Read news about the and

See also

Plant Genomics and Phenomics Research Data Repository (PGP)

German repository for plant research data including image collections from plant phenotyping and microscopy, unfinished genomes, genotyping data, visualizations of morphological plant models, data from mass spectrometry as well as software and documents.

See also:

Arend et al (2016). . Database.

USDA Plants

“The PLANTS Database provides standardized information about the vascular plants, mosses, liverworts, hornworts, and lichens of the U.S. and its territories. It includes names, plant symbols, checklists, distributional data, species abstracts, characteristics, images, crop information, automated tools, onward Web links, and references.”

See also

USDA Quick Stats

Web based application supports querying the agricultural census and survey statistics. Also available via API.

See also

transPLANT

Infrastructure to support computational analysis of genomic data from crop and model plants. This includes the large-scale analysis of genotype-phenotype associations, a common set of reference plant genomic data, archiving genomic variation, and a search engine integrating reference bioinformatics databases and physical genetic materials. See also

Sensor Data

Meteorological data

Multi-scale Synthesis and Terrestrial Model Intercomparison Project (MsTMIP) data formats

One implementation of CF for ecosystem model driver (met, soil) and output (mass, energy dynamics)
- Standardized Met driver data

Date-Time:

YYYY-MM-DD hh:mm:ssZ: based on ISO 8601 . Optional offset for local time; precision determined by data (e.g. could be YYYY-MM-DD and decimals specified by a period.

revisions

Introduction

hashtagAbout this book

hashtagAbout TERRA-REF

hashtagVersions

Data Sources

hashtagField phenotyping research sites

hashtagMaricopa Agricultural Center (MAC), Arizona

hashtagKansas State University

hashtagControlled-environment phenotyping

hashtagDonald Danforth Plant Science Center, Missouri

hashtagGenomics

Software

hashtagClowder (sensor data and computation management with web user interface)

hashtagGlobus Connect (large data transfer)

hashtagBETYdb (phenotype data)

hashtagAlgorithms (a.k.a. 'extractors')

hashtagPlant CV

hashtagOther Algorithms

hashtagAnalysis Tools

hashtagCoGe

Protocols

Controlled Environment Protocols

hashtagAbstract

hashtag

Template Protocol

hashtagAbstract

hashtagMaterials

UAV Protocols

hashtagAbstract

hashtagMaterials

Experimental Design

hashtagField phenotyping

hashtagControlled-environment phenotyping

hashtagGenomics

Experimental Design Danforth

Sorghum Lines Danforth

hashtagExperiment LT1A (TM015)

Experimental Design Genomics

hashtagWhole-genome resequencing

hashtagGenotyping-by-sequencing

Sorghum Lines Genomics Year 1

Sorghum Lines Genomics Year 1 (continued)

Sorghum Lines Genomics Year 2

User Manual

hashtagOverview

What Data is Available

Data Products

Fluorescence intensity imaging

hashtagSummary

hashtag

Genomics data

Infrared heat imaging data

hashtagSummary

Multispectral imaging data

Phenotype data

Point Cloud Data

hashtagSummary

hashtagData access

hashtagComputational pipeline

hashtagSee also

How to Access Data

hashtagOverview

hashtagClowder

hashtagGlobus Connect

hashtagBETYdb

hashtagCoGe

hashtagOther Data

Using Clowder (Sensor and Genoomics data)

hashtagAbout Clowder

hashtagRequesting Access

hashtagData organization

hashtagSearching the database

hashtagAnalyzing data in Clowder

hashtagClowder Extractors

Using BETYdb (trait data, experimental metadata)

hashtagAbout BETYdb

hashtagRequesting access

hashtagData organization

hashtagUsing the Advanced Search box

About this book

About TERRA-REF

Versions

Field phenotyping research sites

Maricopa Agricultural Center (MAC), Arizona

Kansas State University

Controlled-environment phenotyping

Donald Danforth Plant Science Center, Missouri

Genomics

Clowder (sensor data and computation management with web user interface)

Globus Connect (large data transfer)

BETYdb (phenotype data)

Algorithms (a.k.a. 'extractors')

Plant CV

Other Algorithms

Analysis Tools

CoGe

Abstract

Abstract

Materials

Abstract

Materials

Field phenotyping

Controlled-environment phenotyping

Genomics

Experiment LT1A (TM015)

Whole-genome resequencing

Genotyping-by-sequencing

Overview

Summary

Summary

Summary

Data access

Computational pipeline

See also

Overview

Clowder

Globus Connect

BETYdb

CoGe

Other Data

About Clowder

Requesting Access

Data organization

Searching the database

Analyzing data in Clowder

Clowder Extractors

About BETYdb

Requesting access

Data organization

Using the Advanced Search box

Using the Search Box

Instructions

See also

About CyVerse

Accessing Data via CyVerse

About the Analysis Workbench

Release with Attribution

First Tier

Second Tier

Overview

See also

Current Practice

Standardization Efforts

Maricopa Agricultural Center, Arizona

Maricopa Agricultural Center, Arizona

Automated controlled-environment phenotyping, Missouri

Kansas State University

HudsonAlpha - Genomics

Raw data

BETYdb

See Also

Data Product Levels

Installation

Submitting Sequences to the CoGe Pipeline

Field phenotyping research sites

Maricopa Agricultural Center (MAC), Arizona

Kansas State University

Controlled-environment phenotyping

Donald Danforth Plant Science Center, Missouri