ESG publication scripts
Publishing ESG data and managing ESG node services.
An Overview of Data Publication
Example: Specifying the dataset with a mapfile
Publication is the process of making data visible for search and download from an ESG gateway. We assume that a person responsible for publication (the data publisher) has a collection of files that are to be made visible via the gateway portal software. Each file belongs to a collection of files called a dataset, and each dataset is part of a hierarchical collection of datasets associated with a project. Datasets and projects have unique ESG names; the exact form of a dataset name is configurable dependent on the project.
The publication utilities operate on one or more datasets at a time.
The scripts covered here are:
- esgpublish: Publish one or more datasets
- esgunpublish: Remove metadata for one or more datasets
- esglist_datasets: Query local dataset information.
- esgscan_directory: Create a mapfile by scanning one or more directories. Mapfiles may be used as input to esgpublish, esgunpublish and esgpublish_gui.
- esginitialize: Create database tables, and initialize with model and standard name information.
- esgupdate_metadata: Add checksums, technical notes, and other metadata to an existing dataset.
A First Example
Suppose we want to publish a dataset consisting of all netCDF files in directories /ipcc/20c3m/atm/mo/cl*/gfdl_cm2_0/run1. (We assume the data publisher has already authenticated with a MyProxy server.) One way is with the esgpublish script:
% esgpublish --project ipcc4 --thredds --publish /ipcc/20c3m/atm/mo/cl*/gfdl_cm2_0/run1 INFO 2008-12-24 14:26:07,256 Creating dataset: pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly INFO 2008-12-24 14:26:07,260 Scanning /somedata/20c3m/atm/mo/cl/gfdl_cm2_0/run1/cl_A1.186101-196012.nc INFO 2008-12-24 14:26:08,289 Scanning /somedata/20c3m/atm/mo/cl/gfdl_cm2_0/run1/cl_A1.196101-200012.nc ... INFO 2008-12-24 14:26:36,171 Adding file info to database INFO 2008-12-24 14:26:36,711 Aggregating variables INFO 2008-12-24 14:26:37,197 Writing THREDDS catalog ... INFO 2008-12-24 14:26:37,245 Writing THREDDS master catalog ... INFO 2008-12-24 14:26:37,247 Reinitializing THREDDS server INFO 2008-12-24 14:26:37,627 Publishing: pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly INFO 2008-12-24 14:26:39,940 Result: SUCCESSFUL
At this point the files and dataset are now visible from the gateway, and can be searched or downloaded.
Node architecture

Before looking at the example in detail, we need to review the architecture of an ESG node. A node is a host on which a data archive resides, or from which an archive is published. In contrast an ESG gateway is a site which supports portal services such as search and user authentication. End users interact with a portal to search and download data. The above diagram indicates node components in green, data and configuration files in blue. The red components are the subset of gateway services that interact with the publisher.
The node clients and servers are:
- esgpublish - the ESG publisher application
- TDS: the THREDDS Data server. TDS serves datasets described in a collection of XML catalogs generated by esgpublish.
- LAS: Live Access Server. LAS serves visualization and data products.
- Postgres: relational database server. A relational database is used to store metadata scanned from data files. The publication utilities can also interoperate with a MySQL server.
The related gateway components are:
- gateway server: esgpublish communicates with the gateway server to publish dataset metadata. Executing gateway server operations requires authentication.
- MyProxy: authentication server: the gateway component that generates a user proxy certificate.
Publication Steps
In brief the publication steps are:
- Authenticate with a MyProxy server. This is a single sign-on that generates a limited lifetime proxy certificate. Note that this certificate is only used for gateway publication (the last step).
- Scan the data files in one or more datasets, and write the scanned metadata to a relational database. Optionally modify or update dataset metadata.
- Write the metadata to one or more TDS catalogs, and reinitialize the TDS from the catalogs.
- If LAS is implemented on the node, generate an LAS configuration file and reinitialize LAS.
- Publish the new TDS catalogs to the gateway.
Here's what happens when we run esgpublish in the first example. The script:
- takes as input a project name and one or more datasets in that project to be published.
- In the example above the project is named "ipcc4", and the dataset consists of all netCDF files contained in the directories /ipcc/20c3m/atm/mo/cl*/gfdl_cm2_0/run1. The project name is supplied with the --project option:
% esgpublish --project ipcc4 ...
and the directory names are the trailing arguments. - Instead of directories, a dataset can also be described with a mapfile. See the next example.
- The name of the dataset was generated:
Creating dataset: pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly
The form of dataset names is defined in the publisher configuration file. Although it is possible to supply the dataset name manually, it is generally simpler and more reliable to let esgpublish generate it.
- scans the dataset files for metadata, if they are online.
- Each file is opened, and the file metadata is read and stored in the relational database. It is possible to publish offline datasets for which the files cannot be opened directly, in which case the scan is skipped.
- For online datasets the files are also aggregated, typically by time. If any two files have the same variable with overlapping time ranges, a warning is printed.
- adds or modifies dataset metadata. In this example we have not modified any dataset metadata.
- creates THREDDS catalogs for each dataset, and reinitializes TDS. The option to create TDS catalogs is --thredds:
% esgpublish --thredds ...
By default esgpublish scans the data files but does not create the TDS catalogs. - publishes the THREDDS catalogs to the gateway. The --publish option tells esgpublish to contact the gateway and publish the newly-created catalogs. This is the only step requiring a valid MyProxy certificate.
% esgpublish --publish ...
Example: Specifying the dataset with a mapfile
A mapfile is a text file that lists the members of a dataset and their file sizes. Each line of the mapfile has the form:
dataset_name | absolute_path | byte_length [ | property=value [ | property=value ...]]
Note: The optional property=value fields apply to Esgcet V2.0 and greater. They may take the form:
- mod_time=epochal_time where epochal_time is expressed as seconds since January 1, 1970. This field specifies the file modification time.
- checksum=checksum_value
- checksum_type=checksum_type, where checksum_type may be MD5 or SHA1
As of Esgcet V2.0 the above fields are generated by esgscan_directory.
For esgpublish rename operations (--rename-files) the field
- from_file=path
may be used to specify the file that was replaced or renamed.
A mapfile can specify more than one dataset. An easy way to generate a mapfile is with the esgscan_directory script. By default it prints the mapfile to standard output:
% esgscan_directory --project ipcc4 /ipcc/20c3m/atm/mo/cl*/gfdl_cm2_0/run1 > sample.txt % cat sample.txt pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly | /ipcc/20c3m/atm/mo/cl/gfdl_cm2_0/run1/cl_A1.186101-196012.nc | 1555241356 | mod_time=1205377080.522737 | checksum=521f04d8dfd2d5c0b6210484390963a1 | checksum_type=MD5 pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly | /ipcc/20c3m/atm/mo/cl/gfdl_cm2_0/run1/cl_A1.196101-200012.nc | 622104080 | mod_time=1205377893.725071 | checksum=db70da20849a8dd18dcf5871b85222ef | checksum_type=MD5 pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly | /ipcc/20c3m/atm/mo/clivi/gfdl_cm2_0/run1/clivi_A1.186101-200012.nc | 87140380 pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly | /ipcc/20c3m/atm/mo/clt/gfdl_cm2_0/run1/clt_A1.186101-200012.nc | 87140364 pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly | /ipcc/20c3m/atm/mo/clwvi/gfdl_cm2_0/run1/clwvi_A1.186101-200012.nc | 87140468
The advantage of using a mapfile is that it can be edited before input to esgpublish. This is useful when some but not all files in a directory are to be published. Mapfiles are also used to publish offline datasets. Lines may be commented with a pound sign (#) in the first column.
[Debugging hint: esgscan_directory matches file paths with the directory_format option. If esgscan_directory does not return any output, make sure that the directory name matches the value in directory_format. If directory_format is an absolute pathname, the directory argument for esgscan_directory should also be absolute. Also note that directory_format is interpreted as a Python regular expression, so special regular expression characters can be used.
To see what esgscan_directory is doing (as of Esgcet V2.5), set log_level = DEBUG in the [extract] section of esg.info.]
To publish data in a mapfile use the --map option instead of directory names:
% esgpublish --project ipcc4 --thredds --publish --map sample.txt INFO 2008-12-24 15:24:55,071 Creating dataset: pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly ...
Unpublishing
Unpublishing is the reverse operation of publication. It removes dataset metadata from the relevant ESG components: the gateway, TDS, and node (local) database. In this example we use esgunpublish to remove the database published above:
% esgunpublish --database-delete pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly INFO 2009-01-05 14:15:36,391 Deleting: pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly INFO 2009-01-05 14:15:38,712 Result: SUCCESSFUL INFO 2009-01-05 14:15:38,721 Deleting THREDDS catalog: /usr/local/apache-tomcat-6.0.18/content/thredds/esgcet/1/pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly.xml INFO 2009-01-05 14:15:38,723 Writing THREDDS master catalog /usr/local/apache-tomcat-6.0.18/content/thredds/esgcet/catalog.xml INFO 2009-01-05 14:15:38,727 Reinitializing THREDDS server INFO 2009-01-05 14:15:38,919 Deleting existing dataset: pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly
Note:
- The trailing argument is the dataset name. The --map option can also be used, in which case all datasets in the mapfile are deleted.
- Unpublishing removes data in the reverse order of publishing: first gateway, then TDS, and finally the local database.
- By default the scan metadata is not removed from the database. The --database-delete options forces the removal.
- On the gateway there are two flavors of deletion: full and partial. [At current writing only full deletion is implemented.] Both remove the dataset from visibility on the portal, but partial deletion leaves metrics and versioning information intact.
- Unpublishing does not remove or alter the underlying data files.
- Only the gateway deletion step requires authentication.
Each unpublishing step can be done separately if needed. The previous example is equivalent to:
% esgunpublish --skip-thredds pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly % esgunpublish --skip-gateway pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly % esgunpublish --database-delete --skip-thredds --skip-gateway pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.monthly
Publishing checksums and other metadata
In some cases it is necessary or convenient to add metadata to an existing dataset, without unpublishing and rescanning the dataset. This can be accomplished with the esgupdate_metadata script in V1.9+.
The process is as follows:
- Create a mapfile with the additional metadata
- Run esgupdate_metadata with the mapfile as the argument
For example, if mapfile foo.txt contains:
cmip5.output1.IPSL.IPSL-CM5A-LR.1pctCO2.fx.atmos.fx.r0i0p0|/foo/bar/output1/IPSL/IPSL-CM5A-LR/1pctCO2/fx/atmos/fx/r0i0p0/areacella/1/areacella_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc|49240|checksum=8253d332f4d7a5402b38ae8878fb0908|checksum_type=MD5 cmip5.output1.IPSL.IPSL-CM5A-LR.1pctCO2.fx.atmos.fx.r0i0p0|/foo/bar/output1/IPSL/IPSL-CM5A-LR/1pctCO2/fx/atmos/fx/r0i0p0/orog/1/orog_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc|49732|checksum=dbc0879685d1dc51ee4e7a6c44a75775|checksum_type=MD5 cmip5.output1.IPSL.IPSL-CM5A-LR.1pctCO2.fx.atmos.fx.r0i0p0|/foo/bar/output1/IPSL/IPSL-CM5A-LR/1pctCO2/fx/atmos/fx/r0i0p0/sftlf/1/sftlf_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc|49444|checksum=a406c5f2ac076a3f1754295b7b7badff|checksum_type=MD5
Then running:
% esgupdate_metadata foo.txt
adds the checksum information to the local database. The thredds catalog generation and publication are then run as usual:
% esgpublish --noscan --map foo.txt --project cmip5 --thredds --publish
Note that if the dataset has not changed since being published, it is not necessary to unpublish the existing dataset. The checksum information will be added to the gateway database, and can be retrieved with esgquery_gateway:
% esgquery_gateway --urls cmip5.output1.IPSL.IPSL-CM5A-LR.1pctCO2.fx.atmos.fx.r0i0p0 +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | name | id | size | capability | url | checksum | algorithm | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | areacella_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc | 1 | 49240 | HTTPServer | http://pcmdi3.llnl.gov/thredds/fileServer/cmip5_data/cmip5/output1/IPSL/IPSL-CM5A-LR/1pctCO2/fx/atmos/fx/r0i0p0/areacella/1/areacella_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc | 8253d332f4d7a5402b38ae8878fb0908 | MD5 | | orog_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc | 1 | 49732 | HTTPServer | http://pcmdi3.llnl.gov/thredds/fileServer/cmip5_data/cmip5/output1/IPSL/IPSL-CM5A-LR/1pctCO2/fx/atmos/fx/r0i0p0/orog/1/orog_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc | dbc0879685d1dc51ee4e7a6c44a75775 | MD5 | | sftlf_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc | 1 | 49444 | HTTPServer | http://pcmdi3.llnl.gov/thredds/fileServer/cmip5_data/cmip5/output1/IPSL/IPSL-CM5A-LR/1pctCO2/fx/atmos/fx/r0i0p0/sftlf/1/sftlf_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc | a406c5f2ac076a3f1754295b7b7badff | MD5 | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Publishing technical notes
A technical note is a file that contains additional information about a data file or dataset. When the URL of a technical note is published, it is made visible in the associated THREDDS catalog.
The process of publishing technical notes is similar to publishing checksums, as described in the previous section. The following fields may be added to a mapfile:
- tech_notes=URL # URL of technical notes to associate with a data file
- tech_notes_title=string # Optional title of file-related tech notes, if tech_notes option is set
- dataset_tech_notes=URL # URL of technical notes to associate with a dataset
- dataset_tech_notes_title=string # Optional title of dataset-related tech notes, if dataset_tech_notes option is set
For example, if file mapfile.foo.text contains:
cmip5.output1.IPSL.IPSL-CM5A-LR.1pctCO2.fx.atmos.fx.r0i0p0|/foo/bar/output1/IPSL/IPSL-CM5A-LR/1pctCO2/fx/atmos/fx/r0i0p0/areacella/1/areacella_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc|49240|tech_notes=http://pcmdi3.llnl.gov/thredds/areacella_notes.html|tech_notes_title=AREACELLA notes cmip5.output1.IPSL.IPSL-CM5A-LR.1pctCO2.fx.atmos.fx.r0i0p0|/foo/bar/output1/IPSL/IPSL-CM5A-LR/1pctCO2/fx/atmos/fx/r0i0p0/orog/1/orog_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc|49732|tech_notes=http://pcmdi3.llnl.gov/thredds/orog_notes.html|tech_notes_title=OROG notes cmip5.output1.IPSL.IPSL-CM5A-LR.1pctCO2.fx.atmos.fx.r0i0p0|/foo/bar/output1/IPSL/IPSL-CM5A-LR/1pctCO2/fx/atmos/fx/r0i0p0/sftlf/1/sftlf_fx_IPSL-CM5A-LR_1pctCO2_r0i0p0.nc|49444|tech_notes=http://pcmdi3.llnl.gov/thredds/sftlf_notes.html|tech_notes_title=SFTLF notes
then, assuming the dataset has not yet been published, run:
% esgpublish --map mapfile.foo.text --project myproject
to add the per-file technical notes metadata. Similarly, if the dataset has already been published, then the esgupdate_metadata script can be used to add the notes.
For dataset-level technical notes, it is only necessary to set the field dataset_tech_notes - and optionally dataset_tech_notes_title - for one file in the dataset. If more than one file in a dataset has the options set, the last file processed determines the values.
Querying
The esglist_datasets script queries dataset properties from the local database. The argument is the project name:
% esglist_datasets ipcc4 +------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | id | name | project | model | experiment | run_name | time_frequency | publish_time | publish_status | +------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 34 | pcmdi.ipcc4.cnrm_cm3.20c3m.run1.3hourly | ipcc4 | cnrm_cm3 | 20c3m | run1 | 3hourly | 2008-12-30 12:11:18 | CREATE_DATASET | | 35 | pcmdi.ipcc4.gfdl_cm2_0.20c3m.run1.3hourly | ipcc4 | gfdl_cm2_0 | 20c3m | run1 | 3hourly | 2008-12-30 12:11:40 | CREATE_DATASET | ...
To find all listable properties for datasets in a given project:
% esglist_datasets --list-properties ipcc4 ['creator', 'date', 'description', 'experiment', 'format', 'id', 'model', 'name', 'project', 'publish_status', 'publish_time', 'publisher', 'run_name', 'source', 'submodel', 'time_frequency', 'title']
Use the --property option to restrict the listed values or add another property to the listing. The wildcard character is '%'. For example, to list all models that start with "cn", and add the submodel property:
% esglist_datasets --property model=cn% --property submodel=% ipcc4 +--------------------------------------------------------------------------------------------------------------------------------------------------------------+ | id | name | project | model | experiment | run_name | time_frequency | submodel | publish_time | publish_status | +--------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 34 | pcmdi.ipcc4.cnrm_cm3.20c3m.run1.3hourly | ipcc4 | cnrm_cm3 | 20c3m | run1 | 3hourly | atm | 2008-12-30 12:11:18 | CREATE_DATASET | +--------------------------------------------------------------------------------------------------------------------------------------------------------------+
Authentication
Executing portal functions requires authentication with the gateway MyProxy server:
% myproxy-logon -s <host> -l <username> -p <port> -o <certificate_location> Enter MyProxy pass phrase: A credential has been received for user <username> in <certificate_location>.
The generated credential has a default lifetime of 12 hours. The certificate location should match the hessian_service_certfile and hessian_service_keyfile options in the publisher configuration file.
Initialization of models and standard names
The publication scripts use information in the node database, about:
- model names and descriptions
- standard names
The esginitialize script is used to load this information into the database from text files. There are also options to create the database tables at setup time.
The model information is obtained by default from esgcet_models_table.txt, in the same location as the esg.ini configuration file, for example in directory $HOME/.esgcet. When a new project is configured, all models associated with the project should be added to this file. To load new models in the database:
- In esg.ini, in the [initialize] section, uncomment the line:
initial_models_table = %(home)s/.esgcet/esgcet_models_table.txt
- Run
% esginitialize -c
Similarly to load the standard name table from cf-standard-name-table.xml, uncomment
initial_standard_name_table = %(home)s/.esgcet/cf-standard-name-table.xml
and run esginitialize.