Log in with your account
Click 'Datasets' > 'Create'
Provide a name and description
Click 'Select Files' to choose which files to add
Click 'Upload' to save selected files to dataset
Click 'View Dataset' to confirm. You can add more content with 'Add Files'.
Add metadata, terms of use, etc.
Some metadata may automatically be generated depending on the types of files uploaded. Metadata can be manually added to files or datasets at any time.
Clowder also includes a RESTful API that allows programmatic interactions such as creating new datasets and downloading files. For example, one can request a list of datasets using: GET _clowder home URL_/api/datasets. The current API schema for a Clowder instance can be accessed by selecting API from the ? Help menu in the upper-right corner of the application.
For typical workflows, the following steps are sufficient to push data into Clowder in an organized fashion:
Create a collection to hold relevant datasets (optional) POST /api/collections
provide a name; returns collection ID
Create a dataset to hold relevant files and add it to the collection POST /api/datasets/createempty
provide a name; returns dataset ID POST /api/collections/<collection id>/datasets/<dataset id>
Upload files and metadata to dataset POST /api/datasets/uploadToDataset/<dataset id>
provide file(s) and metadata
An extensive API reference can be found here.
Some files, e.g. those transferred via Globus, will be moved to the server without triggering Clowder's normal upload paths. These must be transmitted in a certain way to ensure proper handling.
Log into Globus and click 'Transfer Files'.
Select your source endpoint, and Terraref as the destination. You need to contact NCSA to ensure you have the necessary credentials and folder space to utilize Globus - unrecognized Globus accounts will not be trusted.
Transfer your files. You will receive a Task ID when the transfer starts.
Send this Task ID and requisite information about the transfer to the TERRAREF Globus Monitor API as a JSON object:
In addition to username and Task ID, you must also send a "contents" object containing each dataset that should be created in Clowder, and the files that belong to that dataset. This allows Clowder to verify it has handled every file in the Globus task.
The JSON object is sent to the API via an HTTP request: POST 141.142.168.72:5454/tasks
For example, with cURL this would be done with: curl -X POST -u <globus_username>:<globus_password> -d <json_object> 141.142.168.72:5454/tasks
In this way Clowder indexes a pointer to the file on disk rather than making a new copy of the file; thus the file will still be accessible via Globus, FTP, or other methods directed at the filesystem.
TERRA members may submit data to Clowder, BETYdb, and CoGe.
Clowder contains data related to the field scanner operations and sensor box, including bounding box of each image / dataset as well as location of the sensor, data types and processing level, scanner missions.
BETYdb contains plot locations and other geolocations of interest (e.g. fields, rows, plants) that are associated with agronomic experimental design / meta-data (what was planted where, field boundaries, treatments, etc).
CoGe contains genomic data.
They may also develop extractors - services that run silently alongside Clowder.
The TERRA REF computing pipeline and data management is managed by Clowder. The pipeline consists of 'extractors' that take a file or other piece of information and generate new files or information. In this way, each extractor is a step in the pipeline.
An extractor 'wraps' an algorithm in code that watches for files that it can convert into new data products and phenotypes. These extractors wait silently alongside the Clowder interface and databases. Extractors can be configured to wait for specific file types and automatically execute operations on those files to process them and extract metadata.
If you want to add an algorithm to the TERRAREF pipeline, or use the Clowder software to manage your own pipeline, extractors provide a way of automating and scaling the algorithms that you have. The NCSA Extractor Development wiki provides instructions, including:
Setting up a pipeline development environment on your own computer.
Using the web development interface) (currently in beta testing)
Using the Clowder API
To make working with the TERRA-REF pipeline as easy as possible, the terrautils Python library was written. By importing this library in an extractor script, developers can ensure that code duplication is minimized and standard practices are used for common tasks such as GeoTIFF creation and georeferencing. It also provides modules for managing metadata, downloading and uploading, and BETYdb/geostreams API wrapping.
Modules include:
betydb BETYdb API wrapper
extractors General extractor tools e.g. for creating metadata JSON objects and generating folder hierarchies
formats Standard methods for creating output files e.g. images from numpy arrays
gdal GDAL general image tools
geostreams Geostreams API wrapper
influx InfluxDB logging API wrapper
lemnatec LemnaTec-specific data management methods
metadata Getting and cleaning metadata
products Get file lists
sensors Standard sensor information resources
spatial Geospatial metadata management
To keep code and algorithms broadly applicable, TERRA-REF is developing a series of science-driven packages to collect methods and algorithms that are generic to an input and output from the pipeline. That is, these packages should not refer to Clowder or extraction pipelines, but instead can be used in applications to manipulate data products. They are organized by sensor.
These packages will also include test suites to verify that any changes are consistent with previous outputs. The test directories can also act as examples on how to instantiate and use the science packages in actual code.
stereo_rgb stereo RGB camera (stereoTop in rawdata, rgb prefix elsewhere)
flir_ir FLIR infrared camera (flirIrCamera in rawData, ir prefix elsewhere)
scanner_3d laser 3D scanner (scanner3DTop in rawData, laser3d elsewhere)
Extractors can be considered wrapper scripts that call methods in the science packages to do work, but include the necessary components to communicate with TERRA's RabbitMQ message bus to process incoming data as it arrives and upload outputs to Clowder. There should be no science-oriented code in the extractor repos - this code should be implemented in science packages instead so it is easier for future developers to leverage.
Each repository includes extractors in the workflow chain corresponding to the named sensor.
Extractor development and deployment: Max Burnette
Development environments: Craig Willis
On our Slack Channel
On GitHub
BETYdb is a database used to centralize data from research done in all TERRA projects. (It is also the name of the Web interface to that database.) Uploading data to BETYdb will allow everyone on the team access to research done on the TERRA project.
Before submitting data to BETYdb, you must first have an account.
Go to the homepage.
Click the "Register for BETYdb" button to create an account. If you plan to submit data, be sure to request "Creator" page access level when filling out the sign-up form.
Understand how the database is organized and what search options are avaible. Do this by exploring the data using the Data tab (see next section).
The Data tab contains a menu for searching the database for different types of data. The Data tab is also the pathway to pages allowing you to add new data of your own. But if you have a sizable amount of trait or yield data you wish to submit, you will likely want to use the Bulk Upload wizard (see below).
As an example, try clicking the Data tab and selecting Citations, the first menu item. A page with a list of citations that have already been uploaded into the system appears.
Citations are listed by the first author's last name. For example a journal article written by Andrew Davis and Kerri Shaw would have the name "Davis" in the author slot.
Use the search box located in the top right corner of the page to search for citations by author, year, title, journal, volume, page, URL, or DOI. Note that the search string must exactly match a substring of the value of one of these items (though the matching is case-insensitive).
Each of the other collections listed in the Data menu may be searched similarly. For example, on the Cultivars page you can search cultivars in the system by searching for them by any of several facets pertaining to cultivars, including the name, ecotype, associated species, even the notes. Keep in mind that when switching to a new Data menu item (such as Cultivars), the resulting page will initially show all items of the type selected that are currently on file. (More precisely, since results are paginated, it will show the first twenty-five of those results.)
The Bulk Upload wizard expects data in CSV format, with one row for each set of associated data items. ("Associated data items" usually means a set of measurements made on the same entity at the same time.) Each trait or yield data item must be associated with a citation, site, species, and treatment and may be associated with a specific cultivar of the associated species. Before you can upload data from a data file, this associated citation, site, species, cultivar, and treatment information must already be in place.
Moreover, if you are uploading trait data, your CSV data file must have one or more trait variable columns (and optionally, one or more covariate variable columns), and the names of these columns must match the names of existing variables. (See the discussion of variables below.)
Details on adding associated data
There is no bulk upload process for adding citations, site, species, cultivars, treatment, and variables to the database. They must be added one at a time using Web forms. Since most often a set of dozens or hundreds of traits is associated with a single citation, site, or species (etcetera), usually this is not an undue burden.
Details on checking that items of each particular type exist (and adding them if they don't) follow:
Citations: To check that the needed citations exist, go to the citations listing by clicking Citations in the Data menu. Search for your citation(s) to determine if all citations associated with your data already exist. If they don't, then create new citations as needed. Be sure to fill in all the required data; author, year, and title are required; if at all possible, include the journal name, volume, page numbers, and DOI. (You must include the DOI if that is what your data files uses to identify citations.)
Sites: Go to the Data tab and click on Sites to verify that all sites in your data file are listed on the Sites page. If any of your sites are not already in the system, you will need to add them to the database. To do this, first search the citations list for the associated citation, select it (by clicking the checkmark in the row where it is listed) and then click the New Site button. A new site must have a name, but if possible, supply other information—the city, state, and country where the site is located, the latitude, longitude, and altitude of the site, and possibly climate and soil data.
It is possible that sites referenced by your data are already in the database but that they aren't yet associated with the citation associated with that data. To see the set of sites associated with a given citation, find the citation in the citations list and select it by clicking the checkmark in its row. This will take you to the Listing Sites page; all of the sites associated with the selected citation (if any) will be listed at the top. To associate another site with the selected citation, enter its name in the search box, find the row containing it, and click the "link" action in that row.
Treatments: The treatment specified for each of your data items must not only match the name of an existing treatment, it must also be associated with the citation for the data item. To see the list of treatments associated with a particular citation, select the citation as in the instructions for Sites. Then click the Treatments link on the Listing Sites page. The top section of this page lists all treatments associated with the selected citation.
Currently, there is no way to associate an arbitrary treatment with a citation via the Web interface. You will either have to make a new treatment with the desired name (after the desired citation has been selected), or you will have to (or have an administrator) modify the database directly.
Species: To check that the needed species entries exist, go to the the species listing by clicking Species in the Data menu. Search for each of the species required by your data. The species entry in the CSV file must match the scientific name (Latin name) of the species listed in the database. If necessary, add any species in your data that has not yet been added to the database. When adding a species, scientificname is the only required field, but the genus and species fields should be filled out as well.
Cultivars: If your data lists cultivars, you should check that these are in the database as well. Cultivar names are not necessarily unique, but they are unique within a given species. To check whether a cultivar matching the name and species listed in your CSV file has been added to the database, go to the cultivar listing by clicking Cultivars in the Data menu. Searching either by species name or cultivar name should quickly determine if the needed cultivar exists. If it needs to be added, click the New Cultivar button. Fill in the species search box with enough of the species name to narrow down the result list to a workable size, and then select the correct species from the result list immediately below the search box. Then type the name of the cultivar you wish to add in the Name field. The Ecotype and Notes sections are optional.
Variables: If you are submitting trait data, verify that the variables associated with each trait and each covariate match the names of variables in the system (for example, canopy_height, hull_area, or solidity). To do this, go to the Data tab and click on Variables. If any of your variables are not already in the system, you will need to add them.
Once you have entered all the necessary data to prepare for a bulk data upload, you can then begin the bulk upload process.
There are some key rules for bulk uploading:
Templates To help you get started, some data file templates are available. There are four different templates to choose from.
Use this template if you are uploading yields and you wish to specify the citations by author, year, and title.
Use this template if you are uploading yields and you wish to specify the citations by DOI.
Use this template if you are uploading traits and you wish to specify the citations by author, year, and title.
Use this template if you are uploading traits and you wish to specify the citations by DOI.
These "templates" consist of a single line of text showing a typical header row for a CSV file. In the traits templates, the headings of the form "[trait variable 1]" or "[covariate 1]" must be replaced with actual variable names corresponding to a trait variable or covariate, respectively.
These templates show all possible columns that may be included. In most cases, fewer columns will be needed and the unneeded column headings should be removed. The only programmatically required headings are "yield" (for uploads of yield data), or, for uploads of trait data, the name of at least one recognized trait variable. All other data required for an upload—the citation, site, species, treatment, access level, and date—may be specified interactively, provided that they have a uniform value for all of the trait or yield data in the file being uploaded. (Specification of a cultivar is not required, but it too may be specified interactively if it has a uniform value for all of the data in the file.)
Matching It is important that text values and trait or covariate column names in the data file match records in the database. This includes variable names, site names, species and cultivar names, etc. Note, however, that matching is somewhat lax: the matching is done case-insensitively, and extraneous spaces in values in the data file are ignored.
Some special cases of note: In the case of citation_title
, the supplied value need only match an initial substring of the title specified in the database as long as the combination of author, year, and the initial portion of the title uniquely identifies a citation stored in the database. (The value for citation_title
may even be empty if the author and year together uniquely identify a citation!) And in the case of species names, the letter 'x' may be used to match the times symbol '×' used in names of hybrid species.
Column order The order of columns in the data file is immaterial; in making the template files, an arbitrary order was chosen. But because the data in the data file is displayed for review during the bulk upload process, it may be that some orderings are easier to work with than others.
Quotation rules Since commas are used to delineate columns in CSV files, any data value containing a comma must be surrounded by double quotes. (Single quotes are interpreted as part of the value!) If the value itself contains a double-quote, this double-quote must be doubled ("") in addition to surrounding the value with double quotes.
Character encoding Non-ASCII characters must use UTF-8 encoding.
Blank lines There can be no blank lines in the file, either between data rows or at the end of the file.
Troubleshooting data files
Immediately after uploading a data file (or after specifying the citation if this is done interactively), the Bulk Upload Wizard tries to validate the uploaded file and displays the results of this validation.
The types of errors one may encounter at this stage fall into roughly three categories:
Parsing errors
These are errors at the stage of parsing the CSV file, before the header or data values are even checked. An error at this stage returns one to the file-upload page.
Header errors
These are errors caused by having an incongruous set of headings in the header row. Here are some examples:
There is citation_author
column heading without a corresponding citation_year
and citation_title
heading. It is an error to use one of these headings without the other two.
There is both a citation_doi
heading and a citation_author
, citation_year
, or citation_title
heading. If citation_doi
is used, none of the other citation-related headings is allowed.
There is an SE
heading without an n
heading or vice versa.
There is neither a yield
heading nor a heading corresponding to a recognized trait variable.
There is both a yield
heading and a heading corresponding to a recognized trait variable. A data file can be used to insert data into the traits table or the yields table but not both at once.
There is a cultivar
heading but no species
heading.
If any of these errors occur, validation of data values will not proceed.
There may be other errors associated with the header row that aren't treated as errors as such. For example, if you intend to supply two trait variables per row but misspell one of them, the data in the column headed by the misspelled variable name will simply be ignored. That column will be grayed-out, but the file may still be used to insert data corresponding to the "good" variable (provided there are no other errors). In other words, if you ignore the "ignored column" warning and the gray highlighting, you may end up uploading only a portion of the data you intended to upload.
Value errors
If there are no file-parsing errors or header errors, the Bulk Upload wizard will proceed to validate data values. Valid values will be highlighted in green. Ignored columns will be highlighted in gray. (This will warn you, for example, if you have misspelled the name of a trait variable.) Other colors signify various sorts of errors. A summary of errors is shown at the top of the page with links to rows in which the various errors occur.
Matching value errors
Each row of the CSV file must be associated with a unique citation, site, species, and treatment and may be associated with a unique cultivar. These associations may either be specified in the CSV file or, if a particular association is constant for all rows of the file, it may be specified interactively. If they are specified in the file, problems that may arise include:
The combination of values for citation_author
, citation_year
, and citation_title
do not uniquely identify a citation in the database. (This may be because there are no matches or too many (i.e., more than one) matches. (There should never be multiple database rows having the same combination of author, year, and title, but this is not currently enforced.)
The value for citation_doi
does not uniquely match a citation in the database. (Again, citation DOIs should be unique, but the database schema doesn't enforce this.)
The value for site
does not uniquely match the sitename of a site in the database. (site.sitename
should be unique, but this again is not enforced.)
The site specified in a given row is not consistent with the citation specified in that row. (If you visit the "Show" page for the site, you should see the citation listed at the top of the page right under Viewing Site.)
The value for species
does not match the value of scientificname
for a unique row of the species table. (species.scientificname
should be unique, but the database scheme doesn't currently enforce this.)
The value for treatment
does not match the value of the name of any treatment row in the database.
The value for treatment
in a particular row matches one or more treatments in the database, but none are associated with the citation specified by that row.
The value for treatment
in a particular row matches more that one treatment in the database that is associated with the citation specified by that row. (This error is rare. Names of treatments associated with a particular citation should be unique, but this is not yet enforced.)
The value for cultivar
specified in a particular row is not consistent with the species specified in that row.
Other value errors, not having to do with associated attributes of the data, are as follows:
A value for a trait is out of range. An obvious example would be giving a negative number as the value for annual yield. If a variable value is flagged as being out of range, double check the data. If you determine that the value is indeed correct, you should request to have the range in the database adjusted for that variable.
A value for the measurement date is not in the correct format or is out of range.
A value for the access level is not 1, 2, 3, or 4.
A value of the wrong type is given. Examples would be giving a text value for yield
or a floating point number for n
.
Global options and values
If there are no errors in the data file, the bulk upload will proceed to a page allowing you to choose rounding options for your data values. You may choose to keep 1, 2, 3, or 4 significant digits, 3 being the default. If your data includes a standard error (SE
) column, you may separately specify the amount of rounding for the standard error. Here the default is 2 significant digits.
If you did not specify all associated-data values and or did not specify an access level in the data file itself, this page will also allow you to specify a uniform global value for any association not specified in the file; and it will allow you to specify a uniform access level if your data file did not have an access_level
column.
Verification page
Once you have specified global options and values, you will be taken to a verification page that will summarize the global options you have selected and the associations you specified for your data. The latter will be presented in more detail than any specification in your data file or on the Upload Options and Global Values page. For example, when summarizing the sites associated with your data, not only are the site names listed, but the city, state, country, latitude, longitude, soil type, and soil notes are also displayed. This will help ensure that the citations, sites, species, etc. that you specified are really the ones that you intended.
Once you have verified the data, clicking the Insert Data button will complete the upload. The insertions are done in an SQL transaction: if any insertion fails, the entire transaction is rolled back.
For a variable to be recognized as a trait variable or covariate, it is not enough for it simply to be in the variables
table; it must also be in the trait_covariate_associations
table. To check which variables will be recogized as trait variables or covariates, click on the Bulk Upload tab. Then click the link View List of Recognized Traits. This will bring up a table that lists all names of variables recognized as traits and the names of all variables recognized as required or optional covariates for each trait. If you need to add to this table and do not have direct access to the underlying database to which you are submitting data, you will need to e-mail the site adminstrator to request additions. (See the "Contact Us" section in the footer of the homepage.)
CoGe supports the genomics pipeline required for the TERRA program for Sorghum sequence alignment and analysis. It has a web interface and REST API. CoGe is developed by Eric Lyons and hosted at the University of Arizona, where it is made available for researchers to use. CoGe can be hosted on any server, VM, or Docker container.
Upload files to Cyverse data store. The TERRARef project has a 2TB allocation
Use icommands to transfer to data store
project directory: /iplant/home/shared/terraref
Raw data goes in subdirectory raw_data/
, which is only writable for those sending raw reads.
(CoGe output) can go into output/
Transferring data from Roger to iplant data store