Publishing Data for public access via S3

Projects are free to set public access for download when needed. S3 command line clients such as s3cmd and minio-client support setting objects to be publicly available.

For example, you can set public access by using s3cmd. Make the bucket listable:

s3cmd setacl --acl-public s3://public-bucket

Enable public download of objects in the bucket:

s3cmd setacl --acl-public --recursive s3://public-bucket

Individual objects can also be set public

s3cmd setacl --acl-public s3://public-bucket/my-public-object

Note: As new object are added the acess acl for these needs to be updated (this is not inherited from the bucket)

Objects may be returned to private access by using:

s3cmd setacl --acl-private --recursive s3://public-bucket

And similarly for listing and/or individual objects.

We advise against public access for upload since this would open the S3 storage to abuse.

A few things to note:

  1. Be aware that some clients set READ/WRITE access when you use the “public” option. e.g. for the minio-client: mc anonymous set public storage/public-bucket will actually allow anonymous writes as well as reads.

  2. Any content you make public readable is likely to be crawled be google etc

Publishing data

When publishing a data set it is advisble to provide a landing page with basic information about the data. This may be achieved by using the MPCDF metastore or by creating a landing page within the public bucket.

When using metastore, DataCite compatible metadata can be associated with the dataset which may be made available as links to the S3 objects. MetaStore makes the Findable as defined in FAIR data.

When creating a stand alone landing page within the S3 Bucket it is advisable to:

  1. Create an index.html page within the bucket

  2. Describe the dataset within the index.html page (origin, owners, size etc)

  3. Add a link to each object (including a checksum or a separate chceksum file)

  4. Provide basic information about how the objects can be downloaded (e.g. via curl, wget)

Digital Object Identifiers (DOIs) for published data

Digital Object Identifiers provide a persistent identifier for datasets which makes the data addressable and allows the underlying dataset to be moved in a transparant manner where the end users are simply re-directed to the new location.

A DOI may be obtained via metastore or directly from the MPDL MPDL-DOI.

Temporary Sharing:

You can give temporary access to data via presigned URLs. These allow you to generate a short-lived URL that has an obscure form and a configurable lifetime. These URLs may safely be passed to data users to retrieve individual objects.

More information about temporary file sharing can be found here