Personal tools
You are here: Home Members bdrach's Home Personal Items for bdrach ESG publication GUI
Document Actions

ESG publication GUI

Overview of the ESG publisher GUI

The ESG Publisher is an application that makes data resident on a local system (the node) visible for search and download from an ESG web portal, running on the gateway system. Data may be disk-resident or on archival mass storage accessible from the node.

Basic Ideas
Getting Started
Publication Steps

 



There are two equivalent ways to publish data:

  • From a graphical user interface, described in this document;
  • Using a script.

 

In both cases the publication steps are as follows:

 

ESG node diagram

 

 

  1. Authenticate with a MyProxy server. This is a single sign-on that generates a limited lifetime proxy certificate. Note that this certificate is only used for the final gateway publication step.
  2. Scan the data files in one or more datasets, and write the scanned metadata to a relational database:
  3. Write the metadata to one or more TDS catalogs, and reinitialize the Thredds Data Server (TDS) from the catalogs.
  4. If the Live Access Server (LAS) is implemented on the node, generate an LAS configuration file and reinitialize LAS.
  5. Publish the new TDS catalogs to the gateway.


The publisher GUI also supports dataset deletion and queries.

Basic Ideas

There are a few basic ideas and assumptions to know about publishing:

  • In ESG, datasets are hierarchical collections. Typically a dataset will consist of a set of associated data files. When a dataset is published, two of the important pieces of information passed to the gateway are the Dataset Identifiier and the Parent Identifier. This says to the gateway: 'publish dataset X in the parent collection Y'. The Publisher application supports user-defined naming conventions for consistent, automated generation of identifiers. A common convention is fields separated with periods, for example, 'pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly'.
  • At the top level datasets are organized into projects. A project is the activity for which the datasets are generated. A prominent example of a project is the IPCC Fifth Assessment Report (AR5). It is up to the data publisher to decide what datasets belong to a project.
  • The publisher application iis designed to scan large volumes of data. Multiple datasets can be scanned and published in a single operation. In the extreme, if all datasets for a project are stored in a single, consistently-named directory structure, they can all be published at one time.
  • Metadata refers to information about data, for example, the name of the model and the experiment identifier. A main function of the publisher is to determine what metadata fields are associated with a datasets and the values of those fields. The set of metadata fields to be scanned is determined by the project, and can vary between projects. This information is stored in a configuration file, esg.ini.

  • There are several ways the publisher can read metadata:
    • Scanning the headers of the data files. If the data files are online (read-accessible) and in CF-compliant netCDF format, the publisher can read the information during file scanning. To handle other formats, data handlers can be plugged in to the publisher software. If the data is offline (resident on archival storage) a different method is used.
    • Some metadata can be determined from the directory structure of the data files, if the data is consistently organized. Data may be resident in more than one top-level (root) directory.
    • Metadata can be input directly by the data publisher, on a dataset by dataset basis or as default values for all datasets to be published.
  • The publisher stores metadata in a database on the node.  Once a collection of files are scanned they do not need to be rescanned, for instance when a few files are to be added or modified; only the new data must be processed. The publisher supports querying the database for information on what has already been published, and for the status of ongoing publication operations.
  • The last step is to contact the gateway to finalize the publication process. At this point the publisher must authenticate to the gateway. This is done by running the 'myproxy-logon' client from the command line to generate a temporary certificate. The location of the certificate is part of the publisher configuration. The lifetime of the certificate is 12 hours by default; during that period the publisher can reuse the certificate.

Getting Started

At this point we assume that the ESG publication clients are installed and configured for your system. Before running the ESG Publisher client for the first time:

  • Find the location of the myproxy-logon client, and verify your username and password.
  • Find the location of python and the publication clients. Add this directory to your path.
  • Create the directory $HOME/.esgcet. Copy the initialization file esg.ini to this directory. You can edit this file to override the system-wide defaults.
  • In esg.ini:
    • search for the parameter thredds_root. This is the root directory for TDS catalogs. Ensure that you have read-write permissions to this directory and its contents.
    • find  the parameter hessian_service_certfile. This is the location where myproxy-logon should write the proxy certificate.


To start the ESG publisher GUI:

% esgpublish_gui &

This screen appears:

  • The collapsible function menu corresponds to the publication steps.
  • The dataset window lists the datasets currently being processed, or the results of a query.
  • The output window echoes the logging output.
  • The status bar indicates the progress of a dataset scan or publication.

 

Step 1: Authenticate with the MyProxy server:

# Ensure that the Globus environment is set. For example, in csh:
% setenv GLOBUS_LOCATION <globus-directory>
% source $GLOBUS_LOCATION/etc/globus-user-env.csh

# Or in Bourne shell:
% GLOBUS_LOCATION=<globus-directory>
% export GLOBUS_LOCATION
% source $GLOBUS_LOCATION/etc/globus-user-env.sh

% myproxy-logon -s <host> -l <username> -p <port> -o <certificate_location>
Enter MyProxy pass phrase:
A credential has been received for user <username> in <certificate_location>.

Step 2a: Specify Project and Dataset:

 

 

 

  • Select the project from the pulldown menu.
  • If the datasets are offline, on tertiary storage, choose the Off-line checkbox. Otherwise if the data to be scanned is online and accessible, choose On-line.
  • The files to be scanned can be selected from a directory, or from a mapfile generated with the esgscan_directory script. After the files are selected the datasets are listed in the Dataset Window.

  • Notes:

    • If  no dataset entries appear in the  dataset window,  it means that  none of the directories specified match the directory templates for the selected project. To correct this, edit the directory_format option in the current project section of the configuration file, adding an entry to match the directory name.
    • The Fields button allows you to set values for mandatory fields for this project.  In general it is better to let the publisher discover metadata values if possible, especially if multiple datasets are processed. The specified values will override any information in the file or directory name, so use this option with caution.
       

    Step 2b: Data Extraction:


     

     

    • The options are:
      • Create/Replace to create a new dataset or replace an existing one.
      • Append/Update to add or replace files in an existing dataset; if the dataset does not exist the default is to create one.
    • During the scan:
      • the status bar tracks the scan progress,
      • the output window echoes logging information
      • the error tab shows all errors
    • When the scan is complete:
      • the dataset window displays metadata for each dataset scanned. Datasets for which scan errors occurred are highlighted in red.
      • the Data Publication menu is enabled.

    Step 2c: Modify metadata 


     

    Select a dataset entry in the dataset window:

    • Metadata for that dataset is displayed
    • Mandatory fields are  highlighted in blue. (Fields are defined as mandatory in the configuration file).
    • Field values found during the scan are displayed in red. Fixed fields are shaded red, all other fields can be modified.
    • When done, select 'Save Changes'. This saves changes to the database.

    Steps 3-5: Data Publication 




    Select the Data Publication function menu. The dataset window shows the datasets just scanned. By default all the datasets are selected for publication. Click the check boxes to toggle selection.

    When Publish is clicked:

    • THREDDS data server (TDS) catalogs are created for each new dataset.
    • The TDS is reinitialized.
    • If the Live Access Server is implemented, the LAS configuration file is written and LAS reinitialized.
    • The gateway server is notified that new datasets have been published.
    The publishing scripts can be used to run the above steps separately if desired.

     

    Dataset deletion

     



    To delete a dataset (just the metadata, not the files):

    • Select the dataset(s) to be deleted by clicking the leftmost button in the dataset window. The datasets indicated by checkboxes will be deleted.
    • From the Dataset pulldown menu, click 'Remove'.
    • When the deletion is complete, the status of the datasets changes to 'GW Deletion' (gateway deletion).

    Note: A record of the dataset is left in the database. To remove this record as well, use the esgunpublish script.

    Querying

     



    Select 'Query Data Information' in the Dataset Query menu. The matching datasets appear in the Dataset window: You can select, republish, or delete any of the displayed datasets.


    Powered by Plone