TERRA-REF Documentation
WebsiteGitHubTutorials
Primary version
Primary version
  • Introduction
  • Scientific Objectives
  • Experimental Design
    • The Maricopa Agricultural Center (MAC)
    • Controlled Environment Phenotyping
    • Genomics
  • Data
    • How to Access Data
    • Data Products
      • Environmental conditions
      • Phenotype Data
      • Genomics data
      • Fluorescence intensity imaging
      • Geospatial information
      • Hyperspectral imaging data
      • Infrared heat imaging data
      • Meteorological data
      • Point Cloud Data
      • Controlled Environment phenotype data
    • Data Use Policy
    • Manuscripts and Authorship Guidelines
  • Protocols
    • Field Scanner
    • Sensor Calibration
    • Hyperspectral Data
    • Controlled Environment Protocols
    • Manual Field Data Protocols
    • Phenotractor Protocols
    • UAV Protocols
    • Genomic Protocols
  • Technical Documentation
    • Software
    • Data Standards
      • Existing Data Standards
      • Agronomic and Phenotype Data Standards
      • Genomic Data Standards
      • Sensor Data Standards
      • Data Standards Committee
    • Data Product Levels
    • Directory Structure
    • Data Transfer
    • Data Processing Pipeline
    • Time Series Data in Geostreams
    • Data Backup
    • Systems Configuration
  • Code of Conduct
  • Appendix
    • Glossary
    • Accessing BETYdb with GIS Software
  • References
  • Archived Documentation
    • Developer Manual
      • Submitting data to Clowder
      • Submitting data to BETYdb
      • Submitting Data to CoGe
      • Developing Clowder Extractors
Powered by GitBook
On this page
  • Developing Clowder Extractors
  • Developing the Computing Pipeline with Clowder Extractors
  • What does it take to contribute an extractor?
  • Overview
  • Roles
  • The Extractor Lifecycle
  • When is an extractor ready to be deployed?
  • TERRA-REF Extractor Resources
Export as PDF
  1. Archived Documentation
  2. Developer Manual

Developing Clowder Extractors

PreviousSubmitting Data to CoGe

Last updated 5 years ago

Developing Clowder Extractors

Developing the Computing Pipeline with Clowder Extractors

The TERRA REF computing pipeline and data management is managed by Clowder. The pipeline consists of 'extractors' that take a file or other piece of information and generate new files or information. In this way, each extractor is a step in the pipeline.

An extractor 'wraps' an algorithm in code that watches for files that it can convert into new data products and phenotypes. These extractors wait silently alongside the Clowder interface and databases. Extractors can be configured to wait for specific file types and automatically execute operations on those files to process them and extract metadata.

If you want to add an algorithm to the TERRAREF pipeline, or use the Clowder software to manage your own pipeline, extractors provide a way of automating and scaling the algorithms that you have. provides instructions, including:

  1. Setting up a pipeline development environment on your own computer.

  2. Using the ) (currently in beta testing)

  3. Using the Clowder API

  4. Using the pyClowder to add an analytical or technical component to the pipeline.

What does it take to contribute an extractor?

Overview

The purpose of this document is to define the requirements for contributing and maintaining algorithms to the TERRA REF pipeline.

How does an extractor developer get from drafting to deploying an extractor?

The stereo-rgb extractor is a good example of a completed extractor:

ISDA has an overview of some common Python conventions for reference:

Roles

  • Science Developer (e.g. Zongyang, Sean, Patrick)

    • Writes, tests, documents science code

    • Works with pipeline developer to integrate and deploy

    • Works with end users of data to assess quality

  • Pipeline Developer / Operator (e.g. Max, Todd)

    • Develops workflow code

    • Maintains real-time processing

    • Coordinates annual re-processing

  • End User

    • Scientist who will be using the output data

    • Defines specifications

    • Identifies data that can be used for calibration and validation

    • Reviews output during development and continuous operation

The Extractor Lifecycle

Lets define three stages of extractor development. This is iterative, and there should be open communication among the Science Developer, Pipeline Developer, and End User throughout the process.

  1. Define the extractor

    • Create an issue in Github to track development (information can later be added to README file)

    • Inputs (with examples)

    • Outputs

      • Add (or use) a citation, variable, and method in BETYdb

    • Data for ground truthing, testing, validation

  2. Draft the extractor

    • Create a working ‘feature’ branch on GitHub

    • This should be updated regularly; this helps collaborators keep up to date

  3. Request feedback on initial draft and sample output

    • From Pipeline Developer

    • From End User

    • Revise based on feedback

  4. Beta Release

    • Create a Pull Request when extractor is ready to deploy. The PR should be reviewed by both the Pipeline Operator and End User, who will either request changes or approve the PR.

    • A complete extractor is defined below

  5. Deployment

    • Extractor deployed

      • First on live data stream. Data should indicate beta status of extractor

      • Then for reprocessing

    • Extractor added to the list in gitbook

    • Example of how to access actual output generated by extractor (e.g. BETYdb API call)

    • Versioned and pushed to PyPi if science package was extended

  6. Operation

    • Output of extractor is vetted both by domain expert and code provider

    • Improvement

When is an extractor ready to be deployed?

All of the following are required for an extractor to be considered ‘complete’:

  1. Expected test input

    • Expected test input may either be placed in in repository if 1MB, place the test input to Globus or under the tests/ directory in the Workbench..

    • This should include both real and simulated data representing a range of successful and failure conditions

  2. Expected test output

  3. Implementation

  4. Example of output

  5. Output is vetted by domain expert

  6. Wrapped as extractor

  7. Documentation in README

    • Authors

      • One should be identified as maintainer / point of contact

    • Overview

      • Description

      • Inputs

      • outputs

    • Implementation (algorithm details)

      • Libraries used

      • References

      • Rationale (e.g. why method x over y)

    • QA/QC

      • Automated checks done in real time

      • Failure conditions

      • Known issues

    • Further Reading and Citations

      • Related Github issues

      • References

  8. Documentation in extractor_info.json with documentation (maybe use @FILE to read file into json document)

TERRA-REF Extractor Resources

terrautils

Modules include:

Science packages

To keep code and algorithms broadly applicable, TERRA-REF is developing a series of science-driven packages to collect methods and algorithms that are generic to an input and output from the pipeline. That is, these packages should not refer to Clowder or extraction pipelines, but instead can be used in applications to manipulate data products. They are organized by sensor.

These packages will also include test suites to verify that any changes are consistent with previous outputs. The test directories can also act as examples on how to instantiate and use the science packages in actual code.

Extractor repositories

Extractors can be considered wrapper scripts that call methods in the science packages to do work, but include the necessary components to communicate with TERRA's RabbitMQ message bus to process incoming data as it arrives and upload outputs to Clowder. There should be no science-oriented code in the extractor repos - this code should be implemented in science packages instead so it is easier for future developers to leverage.

Each repository includes extractors in the workflow chain corresponding to the named sensor.

Contact:

Use docstring for inline documentation

Inline documentation with docstrings

To make working with the TERRA-REF pipeline as easy as possible, the Python library was written. By importing this library in an extractor script, developers can ensure that code duplication is minimized and standard practices are used for common tasks such as GeoTIFF creation and georeferencing. It also provides modules for managing metadata, downloading and uploading, and BETYdb/geostreams API wrapping.

BETYdb API wrapper

General extractor tools e.g. for creating metadata JSON objects and generating folder hierarchies

Standard methods for creating output files e.g. images from numpy arrays

GDAL general image tools

Geostreams API wrapper

InfluxDB logging API wrapper

LemnaTec-specific data management methods

Getting and cleaning metadata

Get file lists

Standard sensor information resources

Geospatial metadata management

stereo RGB camera (stereoTop in rawdata, rgb prefix elsewhere)

FLIR infrared camera (flirIrCamera in rawData, ir prefix elsewhere)

laser 3D scanner (scanner3DTop in rawData, laser3d elsewhere)

Extractor development and deployment:

Development environments:

On our

On

The NCSA Extractor Development wiki
web development interface
pyClowder
pyClowder
https://github.com/terraref/extractors-stereo-rgb
https://docs.google.com/document/d/1n8iQHdb32u0EOkiNQSRK51XlAjGDsxg75W2SUAAMouM/edit#heading=h.wd4g4fd6q72u
https://www.python.org/dev/peps/pep-0257
https://www.python.org/dev/peps/pep-0257
terrautils
betydb
extractors
formats
gdal
geostreams
influx
lemnatec
metadata
products
sensors
spatial
stereo_rgb
flir_ir
scanner_3d
extractors-stereo-rgb
extractors-3dscanner
extractors-multispectral
extractors-metadata
extractors-hyperspectral
extractors-environmental
extractors-lemnatec-indoor
Max Burnette
Craig Willis
Slack Channel
GitHub