Building the catalog#

The access-nri-intake package includes a command line script called catalog-build for building catalogs using the tools described in the previous sections from Configuration files that specify the paths to sources and which Builders and Translators to use. It can be used as follows:

$ catalog-build --help
usage: catalog-build [-h] [--build_base_path BUILD_BASE_PATH] [--catalog_base_path CATALOG_BASE_PATH]
   [--catalog_file CATALOG_FILE] [--version VERSION] [--no_update] config_yaml [config_yaml ...]

Build an intake-dataframe-catalog from YAML configuration file(s).

positional arguments:
config_yaml           Configuration YAML file(s) specifying the Intake source(s) to add.

options:
   -h, --help            show this help message and exit
   --build_base_path BUILD_BASE_PATH
                           Directory in which to build the catalog and source(s). A directory
                           with name equal to the version (see the `--version` argument) of
                           the catalog being built will be created here. The catalog file
                           (see the `--catalog_file` argument) will be written into this version
                           directory, and any new intake source(s) will be written into a
                           'source' directory within the version directory.
                           Defaults to the current work directory.
   --catalog_base_path CATALOG_BASE_PATH
                           Directory in which to place the catalog.yaml file. This file is the
                           descriptor of the catalog, and provides references to the data locations
                           where the catalog data itself is stored (build_base_path).
                           Defaults to the current work directory.
   --catalog_file CATALOG_FILE
                           The name of the intake-dataframe-catalog. Defaults to 'metacatalog.csv'
   --version VERSION     The version of the catalog to build/add to. Defaults to the current date.
   --no_update           Set this if you don't want to update the access_nri_intake.data (e.g. if running a test)

The ACCESS-NRI catalog is built using this script by submitting the build_all.sh shell script in the bin/ directory of ACCESS-NRI/access-nri-intake-catalog. See the section on Releases for more details.

Configuration files#

The catalog-build script reads configuration files like the ones found in ACCESS-NRI/access-nri-intake-catalog (these are the configuration files used to build the ACCESS-NRI catalog). Configuration files should include the Builder and Translator to use along with a list of sources to process. As a minimum, each source should specify the path(s) to pass to the Builder and the path to the metadata.yaml file for that source. Additional kwargs to pass to the Builder can also be specified. As an example, a configuration file might look something like:

builder: AccessCm2Builder

translator: DefaultTranslator

sources:

  - path:
      - /g/data/p73/archive/non-CMIP/ACCESS-CM2/bx944
      - /g/data/p73/archive/non-CMIP/ACCESS-CM2/bx944a
      - /g/data/p73/archive/non-CMIP/ACCESS-CM2/bx944b
      - /g/data/p73/archive/non-CMIP/ACCESS-CM2/bx944c
      - /g/data/p73/archive/non-CMIP/ACCESS-CM2/bx944d
    metadata_yaml: /g/data/p73/archive/non-CMIP/ACCESS-CM2/bx944/metadata.yaml
    ensemble: true

In most cases, adding a new Intake-ESM datastore to the ACCESS-NRI catalog should be as simple as adding a new entry to the configuration files and rebuilding the catalog.

metadata.yaml files#

Each source in the catalog must have an associated metadata.yaml file that includes key high-level metadata about the data product. This is to ensure that there is core metadata associated with all data products in the catalog. Additionally, this core metadata is added to the corresponding Intake-ESM datastore’s metadata attribute, meaning it is available to Translators and to catalog users wanting to know more about a particular product. Ideally this file will live in the base output directory of your model run so that it’s easy for others to find, even if they aren’t using the catalog (but it doesn’t have to).

The contents of the metadata.yaml files are validated against access_nri_intake.catalog.EXP_JSONSCHEMA (see Adding sources) when the script catalog-build is called to ensure that all required metadata is available prior to building the catalog. The metadata.yaml file should include the following:

schema_version: 1-0-3  # metadata-template automatically gives the correct value
name: <REQUIRED The name of the experiment (string)>
experiment_uuid: <REQUIRED *Unique* uuid for the experiment (string)>
description: <REQUIRED Short description of the experiment (string, < 150 char)>
long_description: <REQUIRED Long description of the experiment (string)>
model:
- <The name(s) of the model(s) used in the experiment (string)>
realm:
- <The realm(s) included in the experiment (string)>
frequency:
- <The frequency(/ies) included in the experiment (string)>
variable:
- <The variable(s) included in the experiment (string)>
nominal_resolution:
- <The nominal resolution(s) of model(s) used in the experiment (string)>
version: <The version of the experiment (number, string)>
contact: <Contact name for the experiment (string)>
email: <Email address of the contact for the experiment (string)>
created: <Initial creation date of experiment (string)>
reference: <Citation or reference information (string)>
license: <License of the experiment (string)>
url: <Relevant url, e.g. github repo for experiment configuration (string)>
parent_experiment: <experiment_uuid for parent experiment if appropriate (string)>
related_experiments:
- <experiment_uuids for any related experiment(s) (string)>
notes: <Additional notes (string)>
keywords:
- <Keywords to associated with experiment (string)>

Warning

Your experiment UUID must be unique to the experiment. Even if you’re adding multiple related experiments, each experiment must have a unique UUID.

There’s nothing special about the UUID value - they’re simply meant to be randomly-generated values that are almost guaranteed to be unique. You can get a UUID value easily from any Unix system by running the uuidgen command:

> uuidgen
36C2010B-9D65-4066-AB91-CE9D1FAE30B4

Note

The access-nri-intake package includes some command-line utility scripts to help with creating and validating metadata.yaml files:

  • To create an empty metadata.yaml template in the current directory:

    $ metadata-template
    

    You’ll then need to replace all the values enclosed in <>. Fields marked as REQUIRED are required. All other fields are encouraged but can be deleted or commented out if they are not relevant.

  • To validate a metadata.yaml file (i.e. to check that required fields are present with required types):

    $ metadata-validate <path/to/metadata.yaml>
    

Catalog versioning#

Note

New in version 0.1.4.

Catalog versions (as distinct from the package version of access_nri_intake_catalog) are a date-formatted string, e.g., v2024-11-29.

When a new catalog version is built (see Releases), the build script will analyze both the catalog storage directory defined by --build_base_path, and the catalog YAML location defined by --catalog_base_path, and then create or update the catalog reference YAML (catalog.yaml) as follows:

  1. If no catalog.yaml exists in --catalog_base_path, then a new one will be created, with the default catalog version set to the new catalog version. The minimum and maximum supported catalog versions will be calculated as follows:

    1. If there are no directories or symlinks in --build_base_path that match the version naming schema, it is assumed that no other catalog versions exists, and the minimum/maximum catalog version will be set to the new version;

    2. If there are existing catalog directories in --build_base_path, the build system will assume that those catalogs are compatible with the new catalog, and will compute a minimum and maximum catalog version to encompass those existing directories (i.e., the minimum and maximum catalog version in catalog.yaml will be the minimum and maximum catalog versions currently in --build_base_path, modulo the new version number).

  2. If a catalog.yaml exists in --catalog_base_path, and the newly-built catalog appears to have a consistent structure and schema to that defined in the existing catalog.yaml, then the existing catalog.yaml will be updated to have new default and maximum versions equal to the new catalog version; the previous minimum version will not be altered. The presence/absence of catalog directories in --build-base-path will not be considered.

  3. If a catalog.yaml exists in --catalog_base_path, but the newly-built catalog has a different structure/schema to what’s defined in the existing catalog, then a brand-new catalog.yaml will be created, describing the new catalog structure, and setting all versions (minimum, maximum, default) to the new catalog version. The existing catalog.yaml will be renamed to catalog-<old min version>-<old max version>.yaml, or catalog-<version>.yaml if it only supported a single catalog version. The presence/absence of catalog directories in --build-base-path will not be considered.

access_nri_intake_catalog only links a singular catalog.yaml to the entry point intake.cat.access_nri; either the user’s local version, or if that does not exist, the live version on Gadi (see FAQs). To load outdated catalogs from Gadi, we recommend copying the catalog-<old min version>-<old max version>.yaml to ~/.access_nri_intake_catalog/catalog.yaml.