Personal tools
You are here: Home Members bdrach's Home Personal Items for bdrach ESG publication scripts
Document Actions

ESG publication scripts

Publishing ESG data and managing ESG node services.

An Overview of Data Publication

 

A First Example

Node architecture

Publication Steps

Example: Specifying the dataset with a mapfile

Unpublishing

Adding checksums

Publishing technical notes

Querying

Authentication

Initialization

Publication is the process of making data visible for search and download from an ESG gateway. We assume that a person responsible for publication (the data publisher) has a collection of files that are to be made visible via the gateway portal software. Each file belongs to a collection of files called a dataset, and each dataset is part of a hierarchical collection of datasets associated with a  project. Datasets and projects have unique ESG names; the exact form of a dataset name is configurable dependent on the project.

The publication utilities operate on one or more datasets at a time.

The scripts covered here are:

  • esgpublish: Publish one or more datasets
  • esgunpublish: Remove metadata for one or more datasets
  • esglist_datasets: Query local dataset information.
  • esgscan_directory: Create a mapfile by scanning one or more directories. Mapfiles may be used as input to esgpublish, esgunpublish and esgpublish_gui.
  • esginitialize: Create database tables, and initialize with model and standard name information.
  • esgupdate_metadata: Add checksums, technical notes, and other metadata to an existing dataset.

A First Example

 

Suppose we want to publish a dataset consisting of all netCDF files in directories /ipcc/20c3m/atm/mo/cl*/gfdl_cm2_0/run1. (We assume the data publisher has already authenticated with a MyProxy server.) One way is with the esgpublish script:

 

% esgpublish --project ipcc4 --thredds --publish /ipcc/20c3m/atm/mo/cl*/gfdl_cm2_0/run1
INFO       2008-12-24 14:26:07,256 Creating dataset: pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly
INFO       2008-12-24 14:26:07,260 Scanning /somedata/20c3m/atm/mo/cl/gfdl_cm2_0/run1/cl_A1.186101-196012.nc
INFO       2008-12-24 14:26:08,289 Scanning /somedata/20c3m/atm/mo/cl/gfdl_cm2_0/run1/cl_A1.196101-200012.nc
...
INFO       2008-12-24 14:26:36,171 Adding file info to database
INFO       2008-12-24 14:26:36,711 Aggregating variables
INFO       2008-12-24 14:26:37,197 Writing THREDDS catalog ...
INFO       2008-12-24 14:26:37,245 Writing THREDDS master catalog ...
INFO       2008-12-24 14:26:37,247 Reinitializing THREDDS server
INFO       2008-12-24 14:26:37,627 Publishing: pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly
INFO       2008-12-24 14:26:39,940   Result: SUCCESSFUL

At this point the files and dataset are now visible from the gateway, and can be searched or downloaded.

Node architecture

Before looking at the example in detail, we need to review the architecture of an ESG node. A node is a host on which a data archive resides, or from which an archive is published. In contrast an ESG gateway is a site which supports portal services such as search and user authentication.  End users interact with a portal to search and download data. The above diagram indicates node components in green, data and configuration files in blue. The red components are the subset of gateway services that interact with the publisher.
The node clients and servers are:

  • esgpublish - the ESG publisher application
  • TDS: the THREDDS Data server. TDS serves datasets described in a collection of XML catalogs generated by esgpublish.
  • LAS: Live Access Server. LAS serves visualization and data products.
  • Postgres: relational database server. A relational database is used to store metadata scanned from data files. The publication utilities can also interoperate with a MySQL server.


The related gateway components are:

  • gateway server: esgpublish communicates with the gateway server to publish dataset metadata. Executing gateway server operations requires authentication.
  • MyProxy: authentication server: the gateway component that generates a user proxy certificate.

 

Publication Steps

In brief the publication steps are:

  1. Authenticate with a MyProxy server. This is a single sign-on that generates a limited lifetime proxy certificate. Note that this certificate is only used for gateway publication (the last step).
  2. Scan the data files in one or more datasets, and write the scanned metadata to a relational database. Optionally modify or update dataset metadata.
  3. Write the metadata to one or more TDS catalogs, and reinitialize the TDS from the catalogs.
  4. If LAS is implemented on the node, generate an LAS configuration file and reinitialize LAS.
  5. Publish the new TDS catalogs to the gateway.

Here's what happens when we run esgpublish in the first example. The script:

  • takes as input a project name and one or more datasets in that project to be published.
    • In the example above the project is named "ipcc4", and the dataset consists of all netCDF files contained in the directories /ipcc/20c3m/atm/mo/cl*/gfdl_cm2_0/run1. The project name is supplied with the --project option:
      % esgpublish --project ipcc4 ...
      and the directory names are the trailing arguments.
    • Instead of directories, a dataset can also be described with a mapfile. See the next example.
    • The name of the dataset was generated:
      Creating dataset: pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly

      The form of dataset names is defined in the publisher configuration file. Although it is possible to supply the dataset name manually, it is generally simpler and more reliable to let esgpublish generate it.

  • scans the dataset files for metadata, if they are online.
    • Each file is opened, and the file metadata is read and stored in the relational database. It is possible to publish offline datasets for which the files cannot be opened directly, in which case the scan is skipped.
    • For online datasets the files are also aggregated, typically by time. If any two files have the same variable with overlapping time ranges, a warning is printed.
  • adds or modifies dataset metadata. In this example we have not modified any dataset metadata.
  • creates THREDDS catalogs for each dataset, and reinitializes TDS. The option to create TDS catalogs is --thredds:
    % esgpublish --thredds ...
    By default esgpublish scans the data files but does not create the TDS catalogs.
  • publishes the THREDDS catalogs to the gateway.  The --publish option tells esgpublish to contact the gateway and publish the newly-created catalogs. This is the only step requiring a valid MyProxy certificate.
    % esgpublish --publish ...

 

Example: Specifying the dataset with a mapfile

 

A mapfile is a text file that lists the members of a dataset and their file sizes. Each line of the mapfile has the form:

dataset_name | absolute_path | byte_length [ | property=value [ | property=value ...]]

Note: The optional property=value fields apply to Esgcet V2.0 and greater. They may take the form:

  • mod_time=epochal_time where epochal_time is expressed as seconds since January 1, 1970. This field specifies the file modification time.
  • checksum=checksum_value
  • checksum_type=checksum_type, where checksum_type may be MD5 or SHA1

As of Esgcet V2.0 the above fields are generated by esgscan_directory.

For esgpublish rename operations (--rename-files) the field

  • from_file=path

may be used to specify the file that was replaced or renamed.

A mapfile can specify more than one dataset. An easy way to generate a mapfile is with the esgscan_directory script. By default it prints the mapfile to standard output:

% esgscan_directory --project ipcc4 /ipcc/20c3m/atm/mo/cl*/gfdl_cm2_0/run1 > sample.txt
% cat sample.txt
pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly | /ipcc/20c3m/atm/mo/cl/gfdl_cm2_0/run1/cl_A1.186101-196012.nc | 1555241356 | mod_time=1205377080.522737 | checksum=521f04d8dfd2d5c0b6210484390963a1 | checksum_type=MD5
pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly | /ipcc/20c3m/atm/mo/cl/gfdl_cm2_0/run1/cl_A1.196101-200012.nc | 622104080 | mod_time=1205377893.725071 | checksum=db70da20849a8dd18dcf5871b85222ef | checksum_type=MD5
pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly | /ipcc/20c3m/atm/mo/clivi/gfdl_cm2_0/run1/clivi_A1.186101-200012.nc | 87140380
pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly | /ipcc/20c3m/atm/mo/clt/gfdl_cm2_0/run1/clt_A1.186101-200012.nc | 87140364
pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly | /ipcc/20c3m/atm/mo/clwvi/gfdl_cm2_0/run1/clwvi_A1.186101-200012.nc | 87140468

The advantage of using a mapfile is that it can be edited before input to esgpublish. This is useful when some but not all files in a directory are to be published. Mapfiles are also used to publish offline datasets. Lines may be commented with a pound sign (#) in the first column.

[Debugging hint: esgscan_directory matches file paths with the directory_format option. If esgscan_directory does not return any output, make sure that the directory name matches the value in directory_format. If directory_format is an absolute pathname, the directory argument for esgscan_directory should also be absolute. Also note that directory_format is interpreted as a Python regular expression, so special regular expression characters can be used.

To see what esgscan_directory is doing (as of Esgcet V2.5), set log_level = DEBUG in the [extract] section of esg.info.]

To publish data in a mapfile use the --map option instead of directory names:

 

% esgpublish --project ipcc4 --thredds --publish --map sample.txt
INFO       2008-12-24 15:24:55,071 Creating dataset: pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly
...

 

Unpublishing

 

Unpublishing is the reverse operation of publication. It removes dataset metadata from the relevant ESG components: the gateway, TDS, and node (local) database. In this example we use esgunpublish to remove the database published above:

 

% esgunpublish --database-delete pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly
INFO       2009-01-05 14:15:36,391 Deleting: pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly
INFO       2009-01-05 14:15:38,712   Result: SUCCESSFUL
INFO       2009-01-05 14:15:38,721 Deleting THREDDS catalog: /usr/local/apache-tomcat-6.0.18/content/thredds/esgcet/1/pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly.xml
INFO       2009-01-05 14:15:38,723 Writing THREDDS master catalog /usr/local/apache-tomcat-6.0.18/content/thredds/esgcet/catalog.xml
INFO       2009-01-05 14:15:38,727 Reinitializing THREDDS server
INFO       2009-01-05 14:15:38,919 Deleting existing dataset: pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly

Note:

  • The trailing argument is the dataset name. The --map option can also be used, in which case all datasets in the mapfile are deleted.
  • Unpublishing removes data in the reverse order of publishing: first gateway, then TDS, and finally the local database.
  • By default the scan metadata is not removed from the database. The --database-delete options forces the removal.
  • On the gateway there are two flavors of deletion: full and partial. [At current writing only full deletion is implemented.] Both remove the dataset from visibility on the portal, but partial deletion leaves metrics and versioning information intact.
  • Unpublishing does not remove or alter the underlying data files.
  • Only the gateway deletion step requires authentication.

Each unpublishing step can be done separately if needed. The previous example is equivalent to:

% esgunpublish --skip-thredds pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly
% esgunpublish --skip-gateway pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly
% esgunpublish --database-delete --skip-thredds --skip-gateway pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly


Publishing checksums and other metadata

 

In some cases it is necessary or convenient to add metadata to an existing dataset, without unpublishing and rescanning the dataset. This can be accomplished with the esgupdate_metadata script in V1.9+.

 

The process is as follows:

 

  • Create a mapfile with the additional metadata
  • Run esgupdate_metadata with the mapfile as the argument

 

For example, if mapfile foo.txt contains:

 

cmip5.output1.IPSL.IPSL-CM5A-LR.1pctCO2.fx.atmos.fx.r0i0p0|/foo/bar/output1/IPSL/IPSL-CM5A-LR/1pctCO2/fx/atmos/fx/r0i0p0/areacella/1/areacella_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc|49240|checksum=8253d332f4d7a5402b38ae8878fb0908|checksum_type=MD5
cmip5.output1.IPSL.IPSL-CM5A-LR.1pctCO2.fx.atmos.fx.r0i0p0|/foo/bar/output1/IPSL/IPSL-CM5A-LR/1pctCO2/fx/atmos/fx/r0i0p0/orog/1/orog_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc|49732|checksum=dbc0879685d1dc51ee4e7a6c44a75775|checksum_type=MD5
cmip5.output1.IPSL.IPSL-CM5A-LR.1pctCO2.fx.atmos.fx.r0i0p0|/foo/bar/output1/IPSL/IPSL-CM5A-LR/1pctCO2/fx/atmos/fx/r0i0p0/sftlf/1/sftlf_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc|49444|checksum=a406c5f2ac076a3f1754295b7b7badff|checksum_type=MD5

Then running:

 

% esgupdate_metadata foo.txt

 

adds the checksum information to the local database. The thredds catalog generation and publication are then run as usual:

 

% esgpublish --noscan --map foo.txt --project cmip5 --thredds --publish

 

Note that if the dataset has not changed since being published, it is not necessary to unpublish the existing dataset. The checksum information will be added to the gateway database, and can be retrieved with esgquery_gateway:

 

% esgquery_gateway --urls cmip5.output1.IPSL.IPSL-CM5A-LR.1pctCO2.fx.atmos.fx.r0i0p0
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| name                                        | id | size  | capability | url                                                                                                                                                                     | checksum                         | algorithm |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| areacella_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc | 1  | 49240 | HTTPServer | http://pcmdi3.llnl.gov/thredds/fileServer/cmip5_data/cmip5/output1/IPSL/IPSL-CM5A-LR/1pctCO2/fx/atmos/fx/r0i0p0/areacella/1/areacella_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc | 8253d332f4d7a5402b38ae8878fb0908 | MD5       |
| orog_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc      | 1  | 49732 | HTTPServer | http://pcmdi3.llnl.gov/thredds/fileServer/cmip5_data/cmip5/output1/IPSL/IPSL-CM5A-LR/1pctCO2/fx/atmos/fx/r0i0p0/orog/1/orog_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc           | dbc0879685d1dc51ee4e7a6c44a75775 | MD5       |
| sftlf_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc     | 1  | 49444 | HTTPServer | http://pcmdi3.llnl.gov/thredds/fileServer/cmip5_data/cmip5/output1/IPSL/IPSL-CM5A-LR/1pctCO2/fx/atmos/fx/r0i0p0/sftlf/1/sftlf_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc         | a406c5f2ac076a3f1754295b7b7badff | MD5       |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

 

Publishing technical notes

 

A technical note is a file that contains additional information about a data file or dataset.  When the URL of a technical note is published, it is made visible in the associated THREDDS catalog.

 

The process of publishing technical notes is similar to publishing checksums, as described in the previous section. The following fields may be added to a mapfile:

  • tech_notes=URL   # URL of technical notes to associate with a data file
  • tech_notes_title=string   # Optional title of file-related tech notes, if tech_notes option is set
  • dataset_tech_notes=URL   # URL of technical notes to associate with a dataset
  • dataset_tech_notes_title=string   # Optional title of dataset-related tech notes, if dataset_tech_notes option is set

 

For example, if file mapfile.foo.text contains:

cmip5.output1.IPSL.IPSL-CM5A-LR.1pctCO2.fx.atmos.fx.r0i0p0|/foo/bar/output1/IPSL/IPSL-CM5A-LR/1pctCO2/fx/atmos/fx/r0i0p0/areacella/1/areacella_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc|49240|tech_notes=http://pcmdi3.llnl.gov/thredds/areacella_notes.html|tech_notes_title=AREACELLA notes
cmip5.output1.IPSL.IPSL-CM5A-LR.1pctCO2.fx.atmos.fx.r0i0p0|/foo/bar/output1/IPSL/IPSL-CM5A-LR/1pctCO2/fx/atmos/fx/r0i0p0/orog/1/orog_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc|49732|tech_notes=http://pcmdi3.llnl.gov/thredds/orog_notes.html|tech_notes_title=OROG notes
cmip5.output1.IPSL.IPSL-CM5A-LR.1pctCO2.fx.atmos.fx.r0i0p0|/foo/bar/output1/IPSL/IPSL-CM5A-LR/1pctCO2/fx/atmos/fx/r0i0p0/sftlf/1/sftlf_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc|49444|tech_notes=http://pcmdi3.llnl.gov/thredds/sftlf_notes.html|tech_notes_title=SFTLF notes

 

then, assuming the dataset has not yet been published, run:

% esgpublish --map mapfile.foo.text --project myproject

to add the per-file technical notes metadata. Similarly, if the dataset has already been published, then the esgupdate_metadata script can be used to add the notes.

 

For dataset-level technical notes, it is only necessary to set the field dataset_tech_notes - and optionally dataset_tech_notes_title - for one file in the dataset. If more than one file in a dataset has the options set, the last file processed determines the values.

 

Querying

 

The esglist_datasets script queries dataset properties from the local database. The argument is the project name:

% esglist_datasets ipcc4
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id | name                                           | project | model           | experiment | run_name | time_frequency | publish_time        | publish_status  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 34 | pcmdi.ipcc4.cnrm_cm3.20c3m.run1.3hourly        | ipcc4   | cnrm_cm3        | 20c3m      | run1     | 3hourly        | 2008-12-30 12:11:18 | CREATE_DATASET  |
| 35 | pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.3hourly      | ipcc4   | gfdl_cm2_0      | 20c3m      | run1     | 3hourly        | 2008-12-30 12:11:40 | CREATE_DATASET  |
...

To find all listable properties for datasets in a given project:

% esglist_datasets --list-properties ipcc4
['creator', 'date', 'description', 'experiment', 'format', 'id', 'model', 'name', 'project', 'publish_status', 'publish_time', 'publisher', 'run_name', 'source', 'submodel', 'time_frequency', 'title']

 Use the --property option to restrict the listed values or add another property to the listing. The wildcard character is '%'. For example, to list all models that start with "cn", and add the submodel property:

% esglist_datasets --property model=cn% --property submodel=% ipcc4
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id | name                                    | project | model    | experiment | run_name | time_frequency | submodel | publish_time        | publish_status |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 34 | pcmdi.ipcc4.cnrm_cm3.20c3m.run1.3hourly | ipcc4   | cnrm_cm3 | 20c3m      | run1     | 3hourly        | atm      | 2008-12-30 12:11:18 | CREATE_DATASET |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+

 

Authentication

Executing portal functions requires authentication with the gateway MyProxy server:

% myproxy-logon -s <host> -l <username> -p <port> -o <certificate_location>
Enter MyProxy pass phrase:
A credential has been received for user <username> in <certificate_location>.

 The generated credential has a default lifetime of 12 hours. The certificate location should match the hessian_service_certfile and hessian_service_keyfile options in the publisher configuration file.

Initialization of models and standard names

The publication scripts use information in the node database, about:

  • model names and descriptions
  • standard names

The esginitialize script is used to load this information into the database from text files. There are also options to create the database tables at setup time.

The model information is obtained by default from esgcet_models_table.txt, in the same location as the esg.ini configuration file, for example in directory $HOME/.esgcet. When a new project is configured, all models associated with the project should be added to this file. To load new models in the database:

  • In esg.ini, in the [initialize] section, uncomment the line:

initial_models_table = %(home)s/.esgcet/esgcet_models_table.txt

  • Run
% esginitialize -c

Similarly to load the standard name table from cf-standard-name-table.xml, uncomment

initial_standard_name_table = %(home)s/.esgcet/cf-standard-name-table.xml

and run esginitialize.

 


Powered by Plone