Developing Clowder Extractors
Developing Clowder Extractors
Developing the Computing Pipeline with Clowder Extractors
The TERRA REF computing pipeline and data management is managed by Clowder. The pipeline consists of 'extractors' that take a file or other piece of information and generate new files or information. In this way, each extractor is a step in the pipeline.
An extractor 'wraps' an algorithm in code that watches for files that it can convert into new data products and phenotypes. These extractors wait silently alongside the Clowder interface and databases. Extractors can be configured to wait for specific file types and automatically execute operations on those files to process them and extract metadata.
If you want to add an algorithm to the TERRAREF pipeline, or use the Clowder software to manage your own pipeline, extractors provide a way of automating and scaling the algorithms that you have. The NCSA Extractor Development wiki provides instructions, including:
Setting up a pipeline development environment on your own computer.
Using the web development interface) (currently in beta testing)
Using the Clowder API
What does it take to contribute an extractor?
Overview
The purpose of this document is to define the requirements for contributing and maintaining algorithms to the TERRA REF pipeline.
How does an extractor developer get from drafting to deploying an extractor?
The stereo-rgb extractor is a good example of a completed extractor:
ISDA has an overview of some common Python conventions for reference:
Roles
Science Developer (e.g. Zongyang, Sean, Patrick)
Writes, tests, documents science code
Works with pipeline developer to integrate and deploy
Works with end users of data to assess quality
Pipeline Developer / Operator (e.g. Max, Todd)
Develops workflow code
Maintains real-time processing
Coordinates annual re-processing
End User
Scientist who will be using the output data
Defines specifications
Identifies data that can be used for calibration and validation
Reviews output during development and continuous operation
The Extractor Lifecycle
Lets define three stages of extractor development. This is iterative, and there should be open communication among the Science Developer, Pipeline Developer, and End User throughout the process.
Define the extractor
Create an issue in Github to track development (information can later be added to README file)
Inputs (with examples)
Outputs
Add (or use) a citation, variable, and method in BETYdb
Data for ground truthing, testing, validation
Draft the extractor
Create a working ‘feature’ branch on GitHub
This should be updated regularly; this helps collaborators keep up to date
Use docstring for inline documentation https://www.python.org/dev/peps/pep-0257
Request feedback on initial draft and sample output
From Pipeline Developer
From End User
Revise based on feedback
Beta Release
Create a Pull Request when extractor is ready to deploy. The PR should be reviewed by both the Pipeline Operator and End User, who will either request changes or approve the PR.
A complete extractor is defined below
Deployment
Extractor deployed
First on live data stream. Data should indicate beta status of extractor
Then for reprocessing
Extractor added to the list in gitbook
Example of how to access actual output generated by extractor (e.g. BETYdb API call)
Versioned and pushed to PyPi if science package was extended
Operation
Output of extractor is vetted both by domain expert and code provider
Improvement
When is an extractor ready to be deployed?
All of the following are required for an extractor to be considered ‘complete’:
Expected test input
Expected test input may either be placed in in repository if 1MB, place the test input to Globus or under the tests/ directory in the Workbench..
This should include both real and simulated data representing a range of successful and failure conditions
Expected test output
Implementation
Example of output
Output is vetted by domain expert
Wrapped as extractor
Inline documentation with docstrings https://www.python.org/dev/peps/pep-0257
Documentation in README
Authors
One should be identified as maintainer / point of contact
Overview
Description
Inputs
outputs
Implementation (algorithm details)
Libraries used
References
Rationale (e.g. why method x over y)
QA/QC
Automated checks done in real time
Failure conditions
Known issues
Further Reading and Citations
Related Github issues
References
Documentation in extractor_info.json with documentation (maybe use @FILE to read file into json document)
TERRA-REF Extractor Resources
terrautils
To make working with the TERRA-REF pipeline as easy as possible, the terrautils Python library was written. By importing this library in an extractor script, developers can ensure that code duplication is minimized and standard practices are used for common tasks such as GeoTIFF creation and georeferencing. It also provides modules for managing metadata, downloading and uploading, and BETYdb/geostreams API wrapping.
Modules include:
betydb BETYdb API wrapper
extractors General extractor tools e.g. for creating metadata JSON objects and generating folder hierarchies
formats Standard methods for creating output files e.g. images from numpy arrays
gdal GDAL general image tools
geostreams Geostreams API wrapper
influx InfluxDB logging API wrapper
lemnatec LemnaTec-specific data management methods
metadata Getting and cleaning metadata
products Get file lists
sensors Standard sensor information resources
spatial Geospatial metadata management
Science packages
To keep code and algorithms broadly applicable, TERRA-REF is developing a series of science-driven packages to collect methods and algorithms that are generic to an input and output from the pipeline. That is, these packages should not refer to Clowder or extraction pipelines, but instead can be used in applications to manipulate data products. They are organized by sensor.
These packages will also include test suites to verify that any changes are consistent with previous outputs. The test directories can also act as examples on how to instantiate and use the science packages in actual code.
stereo_rgb stereo RGB camera (stereoTop in rawdata, rgb prefix elsewhere)
flir_ir FLIR infrared camera (flirIrCamera in rawData, ir prefix elsewhere)
scanner_3d laser 3D scanner (scanner3DTop in rawData, laser3d elsewhere)
Extractor repositories
Extractors can be considered wrapper scripts that call methods in the science packages to do work, but include the necessary components to communicate with TERRA's RabbitMQ message bus to process incoming data as it arrives and upload outputs to Clowder. There should be no science-oriented code in the extractor repos - this code should be implemented in science packages instead so it is easier for future developers to leverage.
Each repository includes extractors in the workflow chain corresponding to the named sensor.
Contact:
Extractor development and deployment: Max Burnette
Development environments: Craig Willis
On our Slack Channel
On GitHub
Last updated