access_nri_intake.cloud
=======================

.. py:module:: access_nri_intake.cloud


Attributes
----------

.. autoapisummary::

   access_nri_intake.cloud.logger
   access_nri_intake.cloud.log_fmt
   access_nri_intake.cloud.PARTITION_TABLE
   access_nri_intake.cloud.CONTAINER_HEADERS
   access_nri_intake.cloud.BUCKET_BASE_URL
   access_nri_intake.cloud.ROW_GROUP_SIZE


Classes
-------

.. autoapisummary::

   access_nri_intake.cloud.CatalogMirror


Functions
---------

.. autoapisummary::

   access_nri_intake.cloud.mirror_catalog


Module Contents
---------------

.. py:data:: logger

.. py:data:: log_fmt
   :value: '%(asctime)s - %(name)s - %(levelname)s - %(message)s'


.. py:data:: PARTITION_TABLE

   
   Headers to set on the container when we upload it to object storage.  These are
   required to make sure the files are readable by anyone, and that we can do range
   requests on them (for efficient querying).


   ..
       !! processed by numpydoc !!

.. py:data:: CONTAINER_HEADERS

.. py:data:: BUCKET_BASE_URL
   :value: 'https://object-store.rc.nectar.org.au/v1/AUTH_685340a8089a4923a71222ce93d5d323/access-nri-intake...


   //stackoverflow.com/questions/76782018/what-is-actually-meant-when-referring-to-parquet-row-group-size
   We are tuning down row group size here because we use this files to render an interactive UI, so we're less interested in
   total throughput and more interested in getting the first few rows as quickly as possible.


   ..
       !! processed by numpydoc !!

   :type: See https

.. py:data:: ROW_GROUP_SIZE
   :value: 10000


.. py:class:: CatalogMirror

   
   Mirror the intake catalog to the datalake.

   Implementation Notes:

   Could be improved with:
   - Fault Tolerance (Currently, one file breaking will break the whole thing).
   - Batching/Async (Fetch/Post multiple files at once)
   - Steaming (Is it totally necessary to download everything, do the work, and then post it? Smaller memory footprint might be helpful.)


   ..
       !! processed by numpydoc !!

   .. py:attribute:: bucket_name
      :value: 'access-nri-intake-catalog'


   .. py:attribute:: local_json_files
      :type:  list[pathlib.Path]
      :value: []


   .. py:attribute:: local_pq_files
      :type:  list[pathlib.Path]
      :value: []


   .. py:attribute:: failed_json_files
      :type:  list[pathlib.Path]
      :value: []


   .. py:attribute:: failed_pq_files
      :type:  list[pathlib.Path]
      :value: []


   .. py:attribute:: local_mirror_path


   .. py:attribute:: metacat_path


   .. py:attribute:: basedir


   .. py:method:: mirror_intake_catalog(catalog_version = None, hidden = False)

      
      Mirrors the intake catalog to the datalake. Works by scp'ing the specified
      folder off of Gadi, and then doing a bit of processing to get it into the format
      we want for this server.


      :Parameters:

          **version** : date
              The version date of the intake catalog to mirror. Defaults to today's date.

          **hidden** : bool
              Whether to mirror a hidden version of the catalog (prefixed with a dot). Defaults to False


      :Returns:

          None
              ..


      .. rubric:: Notes

      This function requires SSH access to Gadi and the Fabric library. As of right now,
      it will just copy a file structure to a local temp folder - further processing
      will be needed to integrate it into the datalake structure.

      To get access to Gadi and run this command, you will require the credentials for
      the `xp65_ci` account. This needs to be configured in your ~/.ssh/config, which should
      contain something like:
      ```yaml
      Host xp65_ci-dm
          Hostname gadi-dm.nci.org.au
          User xp65_ci
          ForwardAgent yes
          ForwardX11 true
          ForwardX11Trusted yes
          IdentityFile ~/.ssh/id_gadi_xp65_ci
          AddKeysToAgent yes
          UseKeychain yes
      ````


      ..
          !! processed by numpydoc !!


   .. py:method:: restructure_metacat()

      
      We need to go into the parquet files we've just mirrrored and make a few
      changes.

      This collapses duplicate names, aggregating lists columns together. This
      effectively removes the `123 entries across 3000 rows` structure in the
      dataframe catalog. It could be removed in future if users find it unhelpful.


      ..
          !! processed by numpydoc !!


   .. py:method:: update_esm_datastores()

      
      We need to go into each of the esm-datastore parquet files and make a few
      changes. Most important, we need to change the `catalog_file` field to
      point to the one next door to it.


      ..
          !! processed by numpydoc !!


   .. py:method:: create_sidecar_files()

      
      Create sidecar files for each of the esm-datastore parquet files. These contain a single row, which is a list of all the available values in their corresponding main parquet files.

      We also write the number of records into the parquet metadata.


      ..
          !! processed by numpydoc !!


   .. py:method:: partition_parquet_files()

      
      Take each of the esm-datastore parquet files and partition them according
      to the PARTITION_TABLE above, before sorting non-partitioned columns using
      their cardinality.

      This should optimise internal file structure for expected access patterns to
      make it as easy as possible for the interactive catalog to just grab the row groups it needs.


      .. rubric:: Notes

      - Row groups sizes are tuned down to 10,000 to optimise for fast page loads in the interactive catalog,
      rather than total throughput.
      - We collect the whole dataframe in memory and then unlink the original file before we write it out,
      because if we partition, we need to change eg. `FILE.parquet` from a file to a folder, which
      the operating system won't let us do without unlinking first. This might be able to be optimised if we run into memory issues.
      - We sort the data by the top 3 least cardinal columns that aren't partition columns, to try and optimise for
      common access patterns in the interactive catalog. TLDR; if we have a column with eg. 10 values, and one with 100 values,
      we're better off sorting by the one with 10 values first, because that will make it more likely that the row groups we
      need to load for a given query will be contiguous. This means it's more likely we can skip row groups, partitions, etc,
      which minimises I/O, fetching, and should speed up page loads.


      ..
          !! processed by numpydoc !!


   .. py:method:: write_to_object_storage()

      
      Upload the mirrored catalog to Nectar object storage.

      ## Access Requirements

      This method requires credentials for the **Nectar Cloud** project that hosts the
      `access-nri-intake-catalog` object storage container.

      ### Getting Access

      1. Log in to the Nectar Dashboard at https://dashboard.rc.nectar.org.au
      2. Agree to the Nectar Terms and Conditions if prompted.
      3. Note your username — it is the email address shown in the top-right corner of the
         dashboard after login.
      4. Provide that email address to one of the tenant managers listed below so they can
         add you to the project.

      **Tenant managers** (any of the following can grant access):
      - Jo Basevi
      - Aidan Heerdegen
      - Romain Beucher

      ### Configuring Credentials

      Openstack uses a file called `clouds.yaml` for authentication. Place it at
      `~/.config/openstack/clouds.yaml`. It should contain application credentials for
      the Nectar Cloud project. The default template names the cloud `openstack` — rename
      it to `nectar` to match the `openstack.connect(cloud="nectar")` call in this method.

      See https://tutorials.rc.nectar.org.au/application-credentials/01-overview for a
      step-by-step guide to generating and installing application credentials.


      ..
          !! processed by numpydoc !!


.. py:function:: mirror_catalog(argv = None)

   
   CLI entry point for mirroring the intake catalog.


   ..
       !! processed by numpydoc !!