The MPCDF Metadata Tools: User Documentation

../../../_images/mmd01.png

The MMD Tools (short for MPCDF Metadata Tools) can be used to create and manage metadata in several common metadata schemata.

Introduction

The mmd tools consists of four callable scripts and one module which provides additional functionality. The tools should be available on MPCDF systems but could also installed easily on most other Linux systems.

Installing

The mmd tools should run on every system with a reasonably modern version of Python 3. We recommend using the Anaconda Python Distribution which is also available in MPCDF’s module system on the HPC clusters. The following procedure installs the mmd tools into a new virtual environment utilizing Anaconda Python on one of the HPC clusters.

At first, load the Anaconda Python Distribution via the module system:

module load anaconda/3/2021.11

You can check the availability of Anaconda via:

find-module anaconda

Now create a new Python virtual environment for the mmd tools (you can give it another name of course):

conda create --name mmd

Activate the new created environment:

conda activate mmd

The prompt of your shell should now change the name of the new created environment.

Next, clone the GitLab repository:

git clone git@gitlab.mpcdf.mpg.de:mmd/mmd-tools.git

Change to the new created directory and install the mmd tools together with all necessary Python libraries:

pip install .

The mmd tools are now ready to be used and should be accessible via shell completion. Try it via entering “mmd” and press the tab key - it should list all available mmd tools:

(mmdtest) thomz@cobra02:~> mmd
mmd          mmd2bagit    mmdCreate    mmdListBags  mmdLoad      mmdPublish   mmdShow

Let’s take a look at the individual tools of the mmd suite.

mmdCreate

The mmdCreate script can be used to create and edit metadata manually. As parameter, it needs an output file (parameter o) and a metadata format specification. These specifications can be found in the subfolder “formats” of the cloned repository. Please specify the path to the format definition as absolute or relative path:

mmdCreate.py -o ~/metadata.mmd --format /data/mmd/formats/dublinCore.json

With the command above, you can create a metadata file in the well known DublinCore format 1. The script will guide you step by step through the necessary fields:

Outputfile:  /tmp/metadata.mmd
Metadata format:  /data/mmd/formats/dublinCore.json
Fill in each field. Type "?" for a description of the field.

After the script guided you through the process of entering the metadata, you can find the result in the output file specified via the “-o” option. The file is in JSON format and can be displayed or further processed by the common JSON tools or libraries.

mmdShow

Once you have created a metadata file via the mmdCreate script, you can display its content via the mmdShow script. The parameter “-i” takes the input file:

python3 mmdShow.py --i /tmp/metadata.mmd

Optional, the parameter –outputformat can be set to “html” so that the printed output will be formated as simple HTML code.

mmd2bagit

With the mmd2bagit script, you can combine a folder with your data files and its metatada description in mmd format into a BagIt container 2:

python3 mmd2bagit.py --folder ~/testdata/ --metadata /tmp/metadata.mmd

Important

Please be aware that the script changes the structure of the input folder! All content of the folder will move to the “data” subfolder while on the top level, you can find some additional files which were created by the script!

mmdPublish

The mmdPublish script can be used to publish a metadata file into a CKAN instance. Before you can use the script, you need to create an access token in CKAN and store it into an environment variable:

export CKAN_API_KEY=YOUR_CKAN_ACCESS_TOKEN

Important

Without this access token, the script can not write into the CKAN instance. Please make sure that the environment variable can not be read by unauthorized people!

The script itself needs several parameters:

  • i: the input file in mmd metadata format

  • c: the URL of the CKAN instance, followed by the path to its API. For example: https://ckanexample.com/api/3/action/

  • t: the field in the metadata corresponding to the “Title” field in CKAN (not the title itself!)

  • o: the CKAN organisation under which the dataset should be stored

Note

TODO: Screenshots of the whole workflow!

The Metadata Formats

The basic idea of the MMD tools is to be as flexible as possible when it comes to the creation and management of metadata. Therefore, the MMD Tools are working schemaless with plain pairs of keys and values. Additional, some common metadata formats and schemata from our users are supported.

Important

If you need support of further metadata schemata, please contant the developers via support@mpcdf.mpg.de

The integrated metadata formats are stored in a JSON based format and can be found in the subfolder “formats” of the GitLab repository. So far, the following schemata are included:

  • DublinCore simple

  • DataCite Metadata Format

  • MPCDF default metadata schema

1

https://www.dublincore.org/

2

https://en.wikipedia.org/wiki/BagIt