ESG publication GUI
Overview of the ESG publisher GUI
The ESG Publisher is an application that makes data resident on a local system (the node) visible for search and download from an ESG web portal, running on the gateway system. Data may be disk-resident or on archival mass storage accessible from the node.
- Specify project and dataset
- Scan data (data extraction)
- Modify metadata
- Publish datasets
- Delete datasets
- Query datasets
There are two equivalent ways to publish data:
- From a graphical user interface, described in this document;
- Using a script.
In both cases the publication steps are as follows:
- Authenticate with a MyProxy server. This is a single sign-on that generates a limited lifetime proxy certificate. Note that this certificate is only used for the final gateway publication step.
- Scan the data files in one or more datasets, and write the scanned metadata to a relational database:
- Specify project and dataset
- Scan the data files and extract metadata
- Optionally modify dataset information.
There are a few basic ideas and assumptions to know about publishing:
- In ESG, datasets are hierarchical collections. Typically a dataset will consist of a set of associated data files. When a dataset is published, two of the important pieces of information passed to the gateway are the Dataset Identifiier and the Parent Identifier. This says to the gateway: 'publish dataset X in the parent collection Y'. The Publisher application supports user-defined naming conventions for consistent, automated generation of identifiers. A common convention is fields separated with periods, for example, 'pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly'.
- At the top level datasets are organized into projects. A project is the activity for which the datasets are generated. A prominent example of a project is the IPCC Fifth Assessment Report (AR5). It is up to the data publisher to decide what datasets belong to a project.
- The publisher application iis designed to scan large volumes of data. Multiple datasets can be scanned and published in a single operation. In the extreme, if all datasets for a project are stored in a single, consistently-named directory structure, they can all be published at one time.
Metadata refers to information about data, for example, the name of the model and the experiment identifier. A main function of the publisher is to determine what metadata fields are associated with a datasets and the values of those fields. The set of metadata fields to be scanned is determined by the project, and can vary between projects. This information is stored in a configuration file, esg.ini.
- There are several ways the publisher can read metadata:
- Scanning the headers of the data files. If the data files are online (read-accessible) and in CF-compliant netCDF format, the publisher can read the information during file scanning. To handle other formats, data handlers can be plugged in to the publisher software. If the data is offline (resident on archival storage) a different method is used.
- Some metadata can be determined from the directory structure of the data files, if the data is consistently organized. Data may be resident in more than one top-level (root) directory.
- Metadata can be input directly by the data publisher, on a dataset by dataset basis or as default values for all datasets to be published.
- The publisher stores metadata in a database on the node. Once a collection of files are scanned they do not need to be rescanned, for instance when a few files are to be added or modified; only the new data must be processed. The publisher supports querying the database for information on what has already been published, and for the status of ongoing publication operations.
- The last step is to contact the gateway to finalize the publication process. At this point the publisher must authenticate to the gateway. This is done by running the 'myproxy-logon' client from the command line to generate a temporary certificate. The location of the certificate is part of the publisher configuration. The lifetime of the certificate is 12 hours by default; during that period the publisher can reuse the certificate.
At this point we assume that the ESG publication clients are installed and configured for your system. Before running the ESG Publisher client for the first time:
- Find the location of the myproxy-logon client, and verify your username and password.
- Find the location of python and the publication clients. Add this directory to your path.
- Create the directory $HOME/.esgcet. Copy the initialization file esg.ini to this directory. You can edit this file to override the system-wide defaults.
- In esg.ini:
- search for the parameter thredds_root. This is the root directory for TDS catalogs. Ensure that you have read-write permissions to this directory and its contents.
- find the parameter hessian_service_certfile. This is the location where myproxy-logon should write the proxy certificate.
To start the ESG publisher GUI:
% esgpublish_gui &
This screen appears:
- The collapsible function menu corresponds to the publication steps.
- The dataset window lists the datasets currently being processed, or the results of a query.
- The output window echoes the logging output.
- The status bar indicates the progress of a dataset scan or publication.
# Ensure that the Globus environment is set. For example, in csh: % setenv GLOBUS_LOCATION <globus-directory> % source $GLOBUS_LOCATION/etc/globus-user-env.csh # Or in Bourne shell: % GLOBUS_LOCATION=<globus-directory> % export GLOBUS_LOCATION % source $GLOBUS_LOCATION/etc/globus-user-env.sh % myproxy-logon -s <host> -l <username> -p <port> -o <certificate_location> Enter MyProxy pass phrase: A credential has been received for user <username> in <certificate_location>.
- If no dataset entries appear in the dataset window, it means that none of the directories specified match the directory templates for the selected project. To correct this, edit the directory_format option in the current project section of the configuration file, adding an entry to match the directory name.
- The Fields button allows you to set values for mandatory fields for this project. In general it is better to let the publisher discover metadata values if possible, especially if multiple datasets are processed. The specified values will override any information in the file or directory name, so use this option with caution.
- The options are:
- Create/Replace to create a new dataset or replace an existing one.
- Append/Update to add or replace files in an existing dataset; if the dataset does not exist the default is to create one.
- During the scan:
- the status bar tracks the scan progress,
- the output window echoes logging information
- the error tab shows all errors
- When the scan is complete:
- the dataset window displays metadata for each dataset scanned. Datasets for which scan errors occurred are highlighted in red.
- the Data Publication menu is enabled.
Select a dataset entry in the dataset window:
- Metadata for that dataset is displayed
- Mandatory fields are highlighted in blue. (Fields are defined as mandatory in the configuration file).
- Field values found during the scan are displayed in red. Fixed fields are shaded red, all other fields can be modified.
- When done, select 'Save Changes'. This saves changes to the database.
Select the Data Publication function menu. The dataset window shows the datasets just scanned. By default all the datasets are selected for publication. Click the check boxes to toggle selection.
When Publish is clicked:
- THREDDS data server (TDS) catalogs are created for each new dataset.
- The TDS is reinitialized.
- If the Live Access Server is implemented, the LAS configuration file is written and LAS reinitialized.
- The gateway server is notified that new datasets have been published.
To delete a dataset (just the metadata, not the files):
- Select the dataset(s) to be deleted by clicking the leftmost
button in the dataset window. The datasets indicated by checkboxes will
- From the Dataset pulldown menu, click 'Remove'.
- When the deletion is complete, the status of the datasets changes to 'GW Deletion' (gateway deletion).
Note: A record of the dataset is left in the database. To remove this record as well, use the esgunpublish script.
Select 'Query Data Information' in the Dataset Query menu. The matching datasets appear in the Dataset window: You can select, republish, or delete any of the displayed datasets.