ESG publisher configuration
Configuring ESG publication applications
ESG publisher configuration
The ESG configuration file is an ASCII text file that contains fixed information for all ESG publication applications. The file is used to setup:
- relational database connections
- parameters for communication with node servers such as TDS and LAS
- project-specific information, such as names of models and experiments
Overriding configuration parameters
Configuration options are defined in a text file, the 'configuration file'. The first of these files found, in the listed order, is used:
- Value of environment variable ESGINI, if specified.
- $HOME/.esgcet/esg.ini
- <PYTHON>/lib/python2.5/site-packages/esgcet/config/etc/esg.ini, where <PYTHON> is the Python installation directory. Note: If there is no directory
<PYTHON>/lib/python2.5/site-packages/esgcet, then esgcet
has been installed with a .egg file. To extract the initialization file
from the egg use unzip:
% unzip <PYTHON_LOCATION>/lib/python2.5/site-packages/esgcet-0.1dev_r7306-py2.5.egg esgcet/config/etc/esg.ini
- esg.ini in the current working directory.
Most ESG publication tools have an option (-i) to specify the configuration file location.
The ESG publication source code contains a file template.ini, that can be used as a starting point for first-time creation of esg.ini.
Configuration file format
The configuration file has a sequence of sections, each with a [section] header and a set of name = value options. Leading and trailing whitespace is stripped from values. A value can contain format strings of the form %(option)s which are substitued with other options in the same section or a special DEFAULT section. Lines beginning with '#' or ';' are comments.
Each section corresponds to a project or application:
[DEFAULT]
...
[project:ipcc4]
...
[project:cfmip]
...
[initialize]
...
Each section is a list of options of the form
option = value
where value may be:
- a string on the same line. Leading and trailing white space is stripped:
thredds_authentication_realm = THREDDS Data Server
Multivalued strings may be separated with commas:
thredds_exclude_variables = a, a_bnds, b, b_bnds, bounds_lat, bounds_lon, height, lat_bnds, lev_bnds, lon_bnds, p0, time_bnds, lat, lon, time, lev, depth, depth_bnds, plev, geo_region
- multiple lines. Use a vertical bar (|) to separate fields on each line:
categories =
project | enum | true | true | 0
experiment | enum | true | true | 1
model | enum | true | true | 2
Format strings
Format strings have the form %(option)s, where option is defined elsewhere in the section. For example:
root_id = pcmdi
parent_id = %(root_id)s.ipcc4
results in parent_id=pcmdi.ipcc4.
Some format strings are predefined:
%(home)s : home directory
%(here)s : current directory
%(pythonbin)s : Python executables directory
Format strings may also be defined dynamically by the project handler - the software module that defines how project-specific metadata is discovered. For example, the IPCC4Handler is used for ipcc4 datasets, and defines the format strings %(project)s and %(model)s among others. The strings are interpreted when the dataset files are scanned. For example, if:
parent_id = %(root_id)s.%(project)s.%(model)s
then for datasets generated by model ukmo_hadcm3 the parent id will be
parent_id = pcmdi.ipcc4.ukmo_hadcm3
DEFAULT section
Options in the default section apply to all other sections. A default option may be overridden in another section. For example, to turn on debug logging during database initialization but only print warnings otherwise:
[DEFAULT]
log_level = WARNING
...
[initialize]
log_level = DEBUG
...
| Option | Description | Valid values | Example |
|---|---|---|---|
| checksum |
If set, calculate and publish checksums of all files. The form is: checksum_client, checksum_type where: checksum_client = Absolute or relative path of the checksum client executable, and checksum_type = Type of checksum generated, e.g., MD5 NB: Enabling this option may significantly increase the time required for file scanning. New in esgcet V2.0. |
checksum_type: MD5 (recommended) | SHA1 |
checksum = md5sum, MD5 |
| dburl | Defines the connection to the database. The form is: db_driver://user:password@host:port/db_name where db_driver = mysql for MySQL = postgres for PostgreSQL and db_name is the database name Important! The database type (PostgreSQL / MySQL) is determined by the db_driver definition in the *default* configuration file! |
postgres://myuser:secret@myhost.llnl.gov/esgcet | |
| gateway_options |
Comma-separated list of gateway identifiers. New in esgcet V2.0. |
ESG-PCMDI, ESG-NCAR, ... |
gateway_options = ESG-PCMDI, ESG-NCAR |
| hessian_service_certfile | X.509 user certificate. Location of the myproxy-logon output. |
$(home)s/.globus/proxycert |
|
| hessian_service_debug | Print debugging information when the gateway is contacted |
true, false |
|
| hessian_service_keyfile | X.509 user key file (may be the same as the cert file) |
$(home)s/.globus/proxycert | |
| hessian_service_polling_delay | Seconds to wait after initial publication before checking status |
||
| hessian_service_polling_iterations | Number of times to poll for publishing status |
||
| hessian_service_port | SSL gateway port number |
||
| hessian_service_url | URL of the gateway publishing service endpoint |
https://host/remote/publishingService |
|
| log_format | Log entry format. See Python logging format strings. |
%(levelname)-10s %(asctime)s %(message)s |
|
| log_level | Level above which logging will occur. |
DEBUG, INFO, WARNING, ERROR, CRITICAL |
|
| offline_lister | Program to list offline datasets. The program takes the top-level directory as an argument, and returns a listing of "path size" on each line. The option is a multiline list of the form: service_name | lister_section where service_name references one of the thredds_offline_services entries. lister_section is the name of the configuration section where the lister client is defined. There may be more than one lister_section for a given executable. This allows srmls to be configured for different storage sites. |
offline_lister = HRMatORNL | srmls_at_ornl HRMatNERSC | srmls_at_nersc MSSatNCAR | msls |
|
| project_options | Multiline list of projects. There must be a line for each [project:project_name] section. Each line has the form: project_name | project_description | search_order search_order determines the order in which the associated handlers are searched when no project has been specified, with the lowest number searched first. The first handler that successfully opens the data file is used. |
project_options = ipcc4 | IPCC Fourth Assessment Report | 1 dycore | CAM Finite Volume Dycore | 3 |
|
| root_id | Root dataset ID field |
root_id = pcmdi |
|
| thredds_aggregation_services | Multiline description of the THREDDS services that support subsetting of data aggregations. The service_base forms the leading part of the THREDDS URL. Each line has the form: THREDDS serviceType | service_base | service_name [ | compound_service_name] This option is used to generate a <service> element in THREDDS configuration catalogs. If the optional compound_service_name is specified, then a compound service element is created, and contains all services with that compound name. See the THREDDS tutorial for details. |
service_type and service_base values are defined by THREDDS. service_name can be anything - it is used in the TDS configuration catalogs to reference service elements. |
thredds_aggregation_services = OpenDAP | /thredds/dodsC/ | gridded |
| thredds_authentication_realm | Name of the TDS basic authentication realm. Used for TDS reinitialization. |
thredds_authentication_realm = THREDDS Data Server |
|
| thredds_catalog_basename | See thredds_root. | ||
| thredds_dataset_roots |
Multiline definition of the THREDDS datasetRoot elements. Each line has the form: root_path | location where root_path is a string and location is the absolute directory name to associate with root_path. Any files published with a path having root_path as the leading string will be mapped to the corresponding location. This is a way to factor out and hide root directory names. This option only applies to THREDDS-related services, i.e., those with base=/thredds/xxx. It does not appliy SRM and gsiftp services.More importantly, the dataset roots also configure the basic THREDDS access control. The rules are: - A file not contained in any dataset root or subdirectory is not accessible (cannot be downloaded). - All published files under a dataset root or subdirectory are accessible and visible through THREDDS and the gateway portal. - All unpublished files contained in a dataset root or its subdirectories are potentially accessible, but are not visible in THREDDS or the gateway portal. - The dataset root should not have any symbolic links to directories outside the dataset root. |
thredds_dataset_roots = model | /data/mymodel obs | /data/myobs |
|
| thredds_error_pattern | Pattern that delimits the TDS error log for the most recent reinitialization. |
thredds_error_pattern = Catalog reinit |
|
| thredds_fatal_error_pattern | TDS error log fatal error pattern. Searched for in the TDS error log as defined by thredds_reinit_error_url. | ||
| thredds_file_services | Same as thredds_aggregation_services for file services. |
thredds_file_services = HTTPServer | /thredds/fileServer/ | HTTPServer |
|
| thredds_master_catalog_name | See thredds_root. |
||
| thredds_max_catalogs_per_directory | See thredds_root. | ||
| thredds_offline_services | Same as thredds_aggregation_services for offline (tertiary storage) services. |
thredds_offline_services = SRM | srm://host.sample.gov:6288/srm/v2/server?SFN=/archive.sample.gov | HRMatPCMDI |
|
| thredds_password | Password for TDS, that provides access to the TDS reinitialization URL as defined in thredds_reinit_url. See thredds_username. Typically defined in the Tomcat tomcat-users.xml configuration file. See the Tomcat security document. | ||
| thredds_reinit_error_url | URL of TDS error log. |
thredds_reinit_error_url = https://localhost:port/thredds/content/logs/catalogErrors.log |
|
| thredds_reinit_success_pattern | Pattern indicating success of TDS reinitialization. Searched for in the HTML response from reading the TDS reinitialization URL as defined by thredds_reinit_error_url. Note this response is different from the error catalog. |
thredds_reinit_success_pattern = reinit ok |
|
| thredds_reinit_url | URL to reinitialize TDS. |
thredds_reinit_url = https://localhost:port/thredds/debug?catalogs/reinit |
|
| thredds_restrict_access | Value of the dataset 'restrictAccess' attribute. |
thredds_restrict_access = esg-user | |
| thredds_root | These options define the location of THREDDS configuration catalogs as generated by the publisher. Each dataset catalog will be written to a file named: <thredds_root>/nnn/<thredds_catalog_basename> where 'nnn' is an integer, The master catalog is named:<thredds_root>/catalog.xml thredds_root: TDS top level directory. An absolute path, not containing any substitution patterns.thredds_url: URL of the top-level THREDDS directory thredds_catalog_basename: basename of the catalog. Should contain the pattern %(dataset_id)s, and have extension .xml thredds_max_catalogs_per_directory: Max number of catalogs in a subdirectory thredds_master_catalog_name: description of the TDS master catalog |
thredds_root = /thredds/root thredds_url = http://hostname/thredds/esgcet thredds_catalog_basename = %(dataset_id)s.xml thredds_max_catalogs_per_directory = 500 thredds_master_catalog_name = PCMDI Earth System Grid catalog |
|
| thredds_service_applications |
(Optional) Specify application classes that are associated with THREDDS services. Don't use this options unless you need to override the publisher defaults. Each line has the form: service_name | application_class New in esgcet V2.0. |
Web Browser, Web Script, DataMover-Lite | thredds_service_applications = GRIDFTPatPCMDI | DataMover-Lite HTTPServer | Web Browser HTTPServer | Web Script gridded | Web Browser |
| thredds_service_auth_required |
(Optional) Specify whether each THREDDS service needs authentication with the gateway. Don't use this options unless you need to override the publisher defaults. Each line has the form: service_name | true_or_false New in esgcet V2.0. |
true, false |
thredds_service_auth_required = GRIDFTPatPCMDI | true HTTPServer | true gridded | false |
| thredds_service_descriptions |
(Optional) Textual descriptions of THREDDS services. Each line has the form: service_name | service_description New in esgcet V2.0. |
thredds_service_descriptions = GRIDFTPatPCMDI | PCMDI GridFTP HTTPServer | PCMDI TDS gridded | PCMDI OPeNDAP |
|
| thredds_url | URL of the top-level THREDDS directory. Important: the hostname should reference the data node on which the THREDDS server is running, not the gateway! |
thredds_url = http://hostname/thredds/esgcet | |
| thredds_username | Username for TDS, that provides access to the TDS reinitialization URL as defined in thredds_reinit_url. See thredds_password. Typically defined in the Tomcat tomcat-users.xml configuration file. See the Tomcat security document. |
Project sections
Project-specific section headers have the form:
[project:project_name]
For example, the ipcc4 section header is:
[project:ipcc4]
Each project must also be listed in the default project_options:
project_options =
ipcc4 | IPCC Fourth Assessment Report | 1
...
Each project has a corresponding handler, a Python class that defines the logic of dealing with files from the project.
Categories
Each dataset in a project has as a set of associated categories: string-valued attributes. (In the THREDDS context they are called properties). Some important points about categories:
- Categories are the potential fields used to search for datasets in the gateway.
- They are the fields displayed in the dataset window of esgpublish_gui. The type of widget associated with the category depends on the category_type:
- Enumerated categories are displayed with a pulldown menu. if the field name is xyz, there should also be an option xyz_options
that defines the enumerated values as a comma separated list. For example, the 'publisher' options are defined in 'publisher_options'.
- 'string' categories are displayed in a single-line text box
- 'text' categories are displayed in a multi-line text box.
- 'fixed' categories are not modifiable.
- Categories may be mandatory or optional for a given dataset. Attempting to publish a dataset before all mandatory fields are set raises an exception. Typically the categories used to form the dataset_id should be defined as mandatory.
- A category may be a THREDDS property, in which case the data is stored as a <property> element in the THREDDS catalog.
When datasets are published the category values are filled in by the appropriate project handler in one of several ways:
- From command line arguments in the esgpublish script, or user-input values in the esgpublish_gui client.
- By matching directory names to a template. See directory_format.
- By reading metadata from the 'first file' discovered in the dataset.
- From default values, if the category value cannot be determined otherwise. See category_defaults.
Some categories are predefined:
- project: defined by project_options, mandatory
- experiment: defined by experiment_options, mandatory
- model: defined by initial_models_table in the [initialize] section, mandatory
- run_name: specific model run that generated the dataset
- title: optional string
- creator: optional multiline enumeration, defined by creator_options
- publisher: optional multiline enumeration, defined by publisher_options
- date: optional string publication date in the form yyyy-mm-dd hh:mi:ss
- format: fixed string, file format (e.g., netCDF, CF-1.0)
- source: optional text string
- description: optional text string
Maps
When a set of data files is scanned, the associated dataset ID is generated dynamically by matching the directory_format template. This greatly simplifies publication of large numbers of files, and helps ensure that datasets are identifed consistently when published. For example, if:
root_id = pcmdi
directory_format = /ipcc/%(experiment)s/%(submodel)s/%(time_frequency_short)s/%(variable)s/%(model)s/%(run_name)s
dataset_id = %(root_id)s.%(project)s.%(model)s.%(experiment)s.%(run_name)s.%(time_frequency)s
and we are scanning directory /ipcc/20c3m/atm/mo/hfls/cnrm_cm3/run1
then the directory name and directory_format are matched, resulting in:
experiment=20c3m
submodel=atm
time_frequency_short=mo
variable=hfls
model=cnrm_cm3
run_name=run1
Note the dataset_id has the format string %(time_frequency)s which is not part of the directory_format template. What determines the time_frequency? The answer is a map. a table associating a set of string-valued independent variables with a dependent variable:
time_frequency_map = map(time_frequency_short : time_frequency)
3h | 3hourly
da | daily
fixed | monthly
mo | monthly
yr | yearly
Given this mapping the generated dataset_id for this directory will be:
pcmdi.ipcc4.cnrm_cm3.20c3m.run1.monthly
since time_frequency_short=mo maps to time_frequency=monthly.
To define a map:
- Add the map name to the map option in the project section.
- Add an option of the form:
map_name = map(independent_variable_1[, independent_variable_2[, ...]] : dependent_variable)
value_1 [ | value_2 [...]] | dependent_value
value_1 [ | value_2 [...]] | dependent_value
Each line has the same number of values as the variables listed in the map() definition.
Maps are a convenient way to:
- add flexibility to the generation of option values
- make handler-specific associations explicit and configurable.
The project options are:
| Option | Description | Valid values | Example |
|---|---|---|---|
| categories | Multiline option to define the categories (metadata fields) for the project. Each line has the form: name | category_type | is_mandatory | is_thredds_property | display_order where:Name: category name category_type: one of: enum: enumerated value. There must be a corresponding <name>_options option text: multi-line text string string: single-line text string fixed: a fixed value, can be displayed but not modified is_mandatory: the category must be defined is_thredds_property: the category will be represented in the THREDDS output as a <property> element display_order: ordering in the dataset display, with the smallest value first |
categories = project | enum | true | true | 0 experiment | enum | true | true | 1 model | enum | true | true | 2 time_frequency | enum | true | true | 3 submodel | enum | false | false | 4 run_name | string | true | true | 5 title | string | false | true | 6 creator | enum | false | false | 7 publisher | enum | false | false | 8 date | string | false | true | 9 format | fixed | false | true | 11 source | text | false | false | 13 description | text | false | false | 99 |
|
| category_defaults |
Multiline option to define default values for categories. If a category is not specified, it is assumed not to have a default value. Defaults are set only when the value cannot otherwise be determined (e.g., from the command line, directory name, or file metadata. Each line has the form: category_name | value Values may contain format strings. |
category_defaults = model | ncar_ccsm3_0 experiment | %(experiment_major)s.%(experiment_minor)s |
|
| creator_options | Optional multiline definition of the allowed dataset creators. Each line has the form: creator_name | creator_email | creator_url |
creator_options = Contact_1 | Contact_1@samp.org | http://sample.samp.org Contact_2 | Contact_2@foo.net | |
|
| dataset_id | Template of generated dataset IDs. More than one dataset_id may be specified for a given project. In this case, the values should be separated by a vertical bar (|), and the order is significant: when generating a dataset id, the first template for which all format string values are known is used. This means the most specific templates should be listed first. For example, if dataset_id = %(root_id)s.%(project)s.%(model)s.%(hemisphere)s | %(root_id)s.%(project)s.%(model)s then the first template will be used if values for root_id, project, model, and hemisphere are known (possibly from matching the directory name). However if hemisphere is unknown but the other format strings are known, the second template will be used. |
dataset_id = %(root_id)s.%(project)s.%(model)s.%(experiment)s.%(run_name)s.%(time_frequency)s |
|
| dataset_name_format | Format of THREDDS dataset descriptions. These are the descriptions that appear in the portal. May contain the special format strings %(project_description)s, %(model_description)s, and %(experiment_description) |
dataset_name_format = project=%(project_description)s, model=%(model_description)s, experiment=%(experiment_description)s, run=%(run_name)s, time_frequency=%(time_frequency)s |
|
| directory_format | Format of the data archive directories. May contain format strings, in which case the directory will be matched against the format to determine the values of the format string variables. This options may be multivalued, in which case the values should be separated by vertical bars (|). Order is significant: the first template that matches a directory is used. |
directory_format = /ipcc/%(experiment)s/%(submodel)s/%(time_frequency_short)s/%(variable)s/%(model)s/%(run_name)s |
|
| experiment_options | Mandatory multiline option to define the valid experiments. Each line has the form: project | experiment_name | experiment_description |
experiment_options = ipcc4 | 20c3m | climate of the 20th century ipcc4 | 1pctto2x | 1 percent/year CO2 increase experiment (to doubling) ipcc4 | 1pctto4x | 1 percent/year CO2 increase experiment (to quadrupling) ipcc4 | picntrl | pre-industrial control ipcc4 | pdcntrl | present-day control ipcc4 | 2xco2 | doubled CO2 equilibrium ipcc4 | commit | committed climate change ipcc4 | sresa1b | 720 ppm stabilization ipcc4 | sresa2 | SRES A2 ipcc4 | sresb1 | 550 ppm stabilization ipcc4 | amip | AMIP ipcc4 | slabcntl | slab ocean control |
|
| handler | Class name of the project handler, of the form: package:class |
esgcet.config.ipcc4_handler:IPCC4Handler esgcet.config.netcdf_handler:NetcdfHandler New handlers may also be defined by extending the class esgcet.config.project:ProjectHandler |
handler = esgcet.config.ipcc4_handler:IPCC4Handler |
| maps | Comma-separated list of map names. |
maps = time_frequency_map, cmor_table_map, submodel_combined_map |
|
| parent_id | Format of the parent ID of the dataset. See dataset_id. |
parent_id = %(root_id)s.%(project)s.%(model)s |
|
| per_time_files_dataset_name |
Format of per-time files dataset names. The format strings %(dataset_name)s, %(project_description)s, %(model_description)s, and %(experiment_description) are available here. | %(dataset_name)s, All files | |
| publisher_options | Optional multiline definition of the allowed dataset publishers. Same format as creator_options. |
publisher_options = Publisher_1 | Publisher_1@samp.org | http://sample.samp.org Publisher_2 | Publisher_2@foo.net | http://sample.foo.net |
|
| thredds_aggregations_root_location | Same as thredds_files_root_location, for data aggregations. |
||
| thredds_aggregations_root_path | Same as thredds_files_root_path, for data aggregations. | ||
| thredds_exclude_variables | Comma-separated list of variables to exclude from THREDDS catalogs. They are still added to the database. Typically you want to exclude axes and boundary variables. |
thredds_exclude_variables = a, a_bnds, b, b_bnds, bounds_lat, bounds_lon, height, lat_bnds, lev_bnds, lon_bnds, p0, time_bnds, lat, lon, time, lev, depth, depth_bnds, plev, geo_region |
|
| thredds_files_root_location | See thredds_files_root_path. | ||
| thredds_files_root_path | Definition of mapping between THREDDS file URLs (external address of data files) and data archive pathnames. The options are: service_base ( defined in the [DEFAULT] section) thredds_files_root_path: THREDDS URL root path thredds_files_root_location: corresponding data archive root location) Thredds will map file URLs of the form: http://hostname/<service_base><thredds_files_root_path><filepath> to file location: <thredds_files_root_location><filepath> |
If: thredds_files_root_path = ipcc4/files/ thredds_files_root_location = /ipcc then the THREDDS URL http://hostname:8080/thredds/fileServer/ipcc4/files/20c3m/somepath/ps_A1.nc maps to: /ipcc/20c3m/somepath/ps_A1.nc |
|
| variable_aggregation_dataset_name |
Format of per-variable aggregation dataset names. The format strings %(variable)s, %(variable_long_name)s, %(variable_standard_name)s, %(project_description)s, %(model_description)s, and %(experiment_description) are available here. | %(variable_dataset_name)s, Aggregation | |
| variable_dataset_name |
Format of per-variable dataset names. The format strings %(variable)s, %(variable_long_name)s, %(variable_standard_name)s, %(project_description)s, %(model_description)s, and %(experiment_description) are available here. |
%(dataset_name_format)s, %(variable_standard_name)s |
|
| variable_files_dataset_name |
Format of per-variable files dataset names. The format strings %(variable)s, %(variable_long_name)s, %(variable_standard_name)s, %(project_description)s, %(model_description)s, and %(experiment_description) are available here. | %(variable_dataset_name)s, All files |
|
| variable_locate | Single-line option of the form: variable_name,file_prefix | variable_name,file_prefix | ... Only scan the variable(s) if they are contained in a file with the corresponding prefix. If a variable is time-dependent, and flagged as a 'duplicate variable', adding it to this list will remove the error. |
To read variable ps from files with basename ps_xxx: variable_locate = ps,ps_ |
|
| variable_per_file | Boolean option, true if files follow the IPCC standard of one variable per file. If true, the THREDDS metadata is organized as per-variable datasets. Otherwise, the datasets are assumed to be per-time. |
true, false |
application sections
Some sections apply to specific applications:
- initialize: esginitialize (create and initialize the node database)
Option Description Valid values Example initial_models_table Path of the model definition table. This table defines model-specific metadata for all projects. Each line of the file has the form:
project | model_id | URL | description
The default table is named esgcet_models_table.txt, and is located in the same directory as the configuration file.initial_models_table = %(here)s/data/esgcet_models_table.txt initial_standard_name_table Path of the standard name table. This table defines CF standard name metadata, used for optional name validation at scan time. The file is in XML format as defined by the CF standard name table. The default table is named cf-standard-name-table.xml, and is located in the same directory as the configuration file. initial_standard_name_table = %(here)s/data/cf-standard-name-table.xml - hsils: HSI lister for listing HPSS-resident files. Options in this section are enabled if offline_lister=hsi in the [DEFAULT] section.
Option Description Valid values Example hsi Absolute path of the hsi executable hsi = /usr/local/bin/hsi offline_lister_executable Absolute path of the hsils.py script. offline_lister_executable = %(home)s/work/Esgcet/esgcet/scripts/hsils.py - srmls: SRM lister for listing tertiary storage files. Options in this section are enabled if offline_lister=srmls in the [DEFAULT] section.
Option Description Valid values Example offline_lister_executable Absolute path of the srmls.py script. offline_lister_executable = %(home)s/work/Esgcet/esgcet/scripts/srmls.py srm_archive SRM archive srm_archive = /garchive.nersc.gov srm_server SRM server URL srm_server = srm://somehost.llnl.gov:6288/srm/v2/server srmls Absolute path of srm-ls srmls = /usr/local/esg/bin/srm-ls