Building the catalog#
The access-nri-intake package includes a command line script called catalog-build
for building
catalogs using the tools described in the previous sections from Configuration files that specify the
paths to sources and which Builders and Translators to use. It can be used as follows:
$ catalog-build --help
usage: catalog-build [-h] [--build_base_path BUILD_BASE_PATH] [--catalog_base_path CATALOG_BASE_PATH]
[--catalog_file CATALOG_FILE] [--version VERSION] [--no_update] config_yaml [config_yaml ...]
Build an intake-dataframe-catalog from YAML configuration file(s).
positional arguments:
config_yaml Configuration YAML file(s) specifying the Intake source(s) to add.
options:
-h, --help show this help message and exit
--build_base_path BUILD_BASE_PATH
Directory in which to build the catalog and source(s). A directory
with name equal to the version (see the `--version` argument) of
the catalog being built will be created here. The catalog file
(see the `--catalog_file` argument) will be written into this version
directory, and any new intake source(s) will be written into a
'source' directory within the version directory.
Defaults to the current work directory.
--catalog_base_path CATALOG_BASE_PATH
Directory in which to place the catalog.yaml file. This file is the
descriptor of the catalog, and provides references to the data locations
where the catalog data itself is stored (build_base_path).
Defaults to the current work directory.
--catalog_file CATALOG_FILE
The name of the intake-dataframe-catalog. Defaults to 'metacatalog.csv'
--version VERSION The version of the catalog to build/add to. Defaults to the current date.
--no_update Set this if you don't want to update the access_nri_intake.data (e.g. if running a test)
The ACCESS-NRI catalog is built using this script by submitting the build_all.sh
shell script
in the bin/
directory of ACCESS-NRI/access-nri-intake-catalog. See the section
on Releases for more details.
Configuration files#
The catalog-build
script reads configuration files like the ones found in
ACCESS-NRI/access-nri-intake-catalog (these are the configuration files used to
build the ACCESS-NRI catalog). Configuration files should include the Builder and Translator to use along
with a list of sources to process. As a minimum, each source should specify the path(s) to pass to the
Builder and the path to the metadata.yaml file for that source. Additional
kwargs
to pass to the Builder can also be specified. As an example, a configuration file might
look something like:
builder: AccessCm2Builder
translator: DefaultTranslator
sources:
- path:
- /g/data/p73/archive/non-CMIP/ACCESS-CM2/bx944
- /g/data/p73/archive/non-CMIP/ACCESS-CM2/bx944a
- /g/data/p73/archive/non-CMIP/ACCESS-CM2/bx944b
- /g/data/p73/archive/non-CMIP/ACCESS-CM2/bx944c
- /g/data/p73/archive/non-CMIP/ACCESS-CM2/bx944d
metadata_yaml: /g/data/p73/archive/non-CMIP/ACCESS-CM2/bx944/metadata.yaml
ensemble: true
In most cases, adding a new Intake-ESM datastore to the ACCESS-NRI catalog should be as simple as adding a new entry to the configuration files and rebuilding the catalog.
metadata.yaml
files#
Each source in the catalog must have an associated metadata.yaml
file that includes key high-level
metadata about the data product. This is to ensure that there is core metadata associated with all data
products in the catalog. Additionally, this core metadata is added to the corresponding Intake-ESM
datastore’s metadata
attribute, meaning it is available to Translators and to catalog users wanting
to know more about a particular product. The contents of the metadata.yaml
files are validated against
access_nri_intake.catalog.EXP_JSONSCHEMA
(see Adding sources) when the script catalog-build
is called to ensure that all required metadata is available prior to building the catalog. The
metadata.yaml
file should include the following:
schema_version: <The version of the schema (string)>
name: <REQUIRED The name of the experiment (string)>
experiment_uuid: <REQUIRED Unique uuid for the experiment (string)>
description: <REQUIRED Short description of the experiment (string, < 150 char)>
long_description: <REQUIRED Long description of the experiment (string)>
model:
- <The name(s) of the model(s) used in the experiment (string)>
realm:
- <The realm(s) included in the experiment (string)>
frequency:
- <The frequency(/ies) included in the experiment (string)>
variable:
- <The variable(s) included in the experiment (string)>
nominal_resolution:
- <The nominal resolution(s) of model(s) used in the experiment (string)>
version: <The version of the experiment (number, string)>
contact: <Contact name for the experiment (string)>
email: <Email address of the contact for the experiment (string)>
created: <Initial creation date of experiment (string)>
reference: <Citation or reference information (string)>
license: <License of the experiment (string)>
url: <Relevant url, e.g. github repo for experiment configuration (string)>
parent_experiment: <experiment_uuid for parent experiment if appropriate (string)>
related_experiments:
- <experiment_uuids for any related experiment(s) (string)>
notes: <Additional notes (string)>
keywords:
- <Keywords to associated with experiment (string)>
Ideally this file will live in the base output directory of your model run so that it’s easy for others to find, even if they aren’t using the catalog (but it doesn’t have to).
Note
The access-nri-intake package includes some command-line utility scripts to help with creating and
validating metadata.yaml
files:
To create an empty
metadata.yaml
template in the current directory:$ metadata-template
You’ll then need to replace all the values enclosed in
<>
. Fields marked asREQUIRED
are required. All other fields are encouraged but can be deleted or commented out if they are not relevant.To validate a
metadata.yaml
file (i.e. to check that required fields are present with required types):$ metadata-validate <path/to/metadata.yaml>
Catalog versioning#
Note
New in version 0.1.4.
Catalog versions (as distinct from the package version of access_nri_intake_catalog
) are a date-formatted string,
e.g., v2024-11-29
.
When a new catalog version is built (see Releases), the build script will analyze both the catalog storage directory
defined by --build_base_path
, and the catalog YAML location defined by --catalog_base_path
, and then create or update
the catalog reference YAML (catalog.yaml
) as follows:
If no
catalog.yaml
exists in--catalog_base_path
, then a new one will be created, with the default catalog version set to the new catalog version. The minimum and maximum supported catalog versions will be calculated as follows:If there are no directories or symlinks in
--build_base_path
that match the version naming schema, it is assumed that no other catalog versions exists, and the minimum/maximum catalog version will be set to the new version;If there are existing catalog directories in
--build_base_path
, the build system will assume that those catalogs are compatible with the new catalog, and will compute a minimum and maximum catalog version to encompass those existing directories (i.e., the minimum and maximum catalog version incatalog.yaml
will be the minimum and maximum catalog versions currently in--build_base_path
, modulo the new version number).
If a
catalog.yaml
exists in--catalog_base_path
, and the newly-built catalog appears to have a consistent structure and schema to that defined in the existingcatalog.yaml
, then the existingcatalog.yaml
will be updated to have new default and maximum versions equal to the new catalog version; the previous minimum version will not be altered. The presence/absence of catalog directories in--build-base-path
will not be considered.If a
catalog.yaml
exists in--catalog_base_path
, but the newly-built catalog has a different structure/schema to what’s defined in the existing catalog, then a brand-newcatalog.yaml
will be created, describing the new catalog structure, and setting all versions (minimum, maximum, default) to the new catalog version. The existingcatalog.yaml
will be renamed tocatalog-<old min version>-<old max version>.yaml
, orcatalog-<version>.yaml
if it only supported a single catalog version. The presence/absence of catalog directories in--build-base-path
will not be considered.
access_nri_intake_catalog
only links a singular catalog.yaml
to the entry point intake.cat.access_nri
; either the
user’s local version, or if that does not exist, the live version on Gadi (see FAQs). To load outdated catalogs from Gadi, we recommend
copying the catalog-<old min version>-<old max version>.yaml
to ~/.access_nri_intake_catalog/catalog.yaml
.