access_nri_intake.cloud#
Attributes#
Headers to set on the container when we upload it to object storage. These are |
|
//stackoverflow.com/questions/76782018/what-is-actually-meant-when-referring-to-parquet-row-group-size |
|
Classes#
Mirror the intake catalog to the datalake. |
Functions#
|
CLI entry point for mirroring the intake catalog. |
Module Contents#
- access_nri_intake.cloud.logger#
- access_nri_intake.cloud.log_fmt = '%(asctime)s - %(name)s - %(levelname)s - %(message)s'#
- access_nri_intake.cloud.PARTITION_TABLE#
Headers to set on the container when we upload it to object storage. These are required to make sure the files are readable by anyone, and that we can do range requests on them (for efficient querying).
- access_nri_intake.cloud.CONTAINER_HEADERS#
- access_nri_intake.cloud.BUCKET_BASE_URL = 'https://object-store.rc.nectar.org.au/v1/AUTH_685340a8089a4923a71222ce93d5d323/access-nri-intake...#
//stackoverflow.com/questions/76782018/what-is-actually-meant-when-referring-to-parquet-row-group-size We are tuning down row group size here because we use this files to render an interactive UI, so we’re less interested in total throughput and more interested in getting the first few rows as quickly as possible.
- Type:
See https
- access_nri_intake.cloud.ROW_GROUP_SIZE = 10000#
- class access_nri_intake.cloud.CatalogMirror#
Mirror the intake catalog to the datalake.
Implementation Notes:
Could be improved with: - Fault Tolerance (Currently, one file breaking will break the whole thing). - Batching/Async (Fetch/Post multiple files at once) - Steaming (Is it totally necessary to download everything, do the work, and then post it? Smaller memory footprint might be helpful.)
- bucket_name = 'access-nri-intake-catalog'#
- local_json_files: list[pathlib.Path] = []#
- local_pq_files: list[pathlib.Path] = []#
- failed_json_files: list[pathlib.Path] = []#
- failed_pq_files: list[pathlib.Path] = []#
- local_mirror_path#
- metacat_path#
- basedir#
- mirror_intake_catalog(catalog_version=None, hidden=False)#
Mirrors the intake catalog to the datalake. Works by scp’ing the specified folder off of Gadi, and then doing a bit of processing to get it into the format we want for this server.
- Parameters:
- versiondate
The version date of the intake catalog to mirror. Defaults to today’s date.
- hiddenbool
Whether to mirror a hidden version of the catalog (prefixed with a dot). Defaults to False
- Returns:
- None
Notes
This function requires SSH access to Gadi and the Fabric library. As of right now, it will just copy a file structure to a local temp folder - further processing will be needed to integrate it into the datalake structure.
To get access to Gadi and run this command, you will require the credentials for the xp65_ci account. This needs to be configured in your ~/.ssh/config, which should contain something like: ```yaml Host xp65_ci-dm
Hostname gadi-dm.nci.org.au User xp65_ci ForwardAgent yes ForwardX11 true ForwardX11Trusted yes IdentityFile ~/.ssh/id_gadi_xp65_ci AddKeysToAgent yes UseKeychain yes
- restructure_metacat()#
We need to go into the parquet files we’ve just mirrrored and make a few changes.
This collapses duplicate names, aggregating lists columns together. This effectively removes the 123 entries across 3000 rows structure in the dataframe catalog. It could be removed in future if users find it unhelpful.
- update_esm_datastores()#
We need to go into each of the esm-datastore parquet files and make a few changes. Most important, we need to change the catalog_file field to point to the one next door to it.
- create_sidecar_files()#
Create sidecar files for each of the esm-datastore parquet files. These contain a single row, which is a list of all the available values in their corresponding main parquet files.
We also write the number of records into the parquet metadata.
- partition_parquet_files()#
Take each of the esm-datastore parquet files and partition them according to the PARTITION_TABLE above, before sorting non-partitioned columns using their cardinality.
This should optimise internal file structure for expected access patterns to make it as easy as possible for the interactive catalog to just grab the row groups it needs.
Notes
Row groups sizes are tuned down to 10,000 to optimise for fast page loads in the interactive catalog,
rather than total throughput. - We collect the whole dataframe in memory and then unlink the original file before we write it out, because if we partition, we need to change eg. FILE.parquet from a file to a folder, which the operating system won’t let us do without unlinking first. This might be able to be optimised if we run into memory issues. - We sort the data by the top 3 least cardinal columns that aren’t partition columns, to try and optimise for common access patterns in the interactive catalog. TLDR; if we have a column with eg. 10 values, and one with 100 values, we’re better off sorting by the one with 10 values first, because that will make it more likely that the row groups we need to load for a given query will be contiguous. This means it’s more likely we can skip row groups, partitions, etc, which minimises I/O, fetching, and should speed up page loads.
- write_to_object_storage()#
Upload the mirrored catalog to Nectar object storage.
## Access Requirements
This method requires credentials for the Nectar Cloud project that hosts the access-nri-intake-catalog object storage container.
### Getting Access
Log in to the Nectar Dashboard at https://dashboard.rc.nectar.org.au
Agree to the Nectar Terms and Conditions if prompted.
Note your username — it is the email address shown in the top-right corner of the dashboard after login.
Provide that email address to one of the tenant managers listed below so they can add you to the project.
Tenant managers (any of the following can grant access): - Jo Basevi - Aidan Heerdegen - Romain Beucher
### Configuring Credentials
Openstack uses a file called clouds.yaml for authentication. Place it at ~/.config/openstack/clouds.yaml. It should contain application credentials for the Nectar Cloud project. The default template names the cloud openstack — rename it to nectar to match the openstack.connect(cloud=”nectar”) call in this method.
See https://tutorials.rc.nectar.org.au/application-credentials/01-overview for a step-by-step guide to generating and installing application credentials.
- access_nri_intake.cloud.mirror_catalog(argv=None)#
CLI entry point for mirroring the intake catalog.