Accessing a single file in a Tar archive on S3

If you have an umcompressed (!) Tar archive stored in S3 and you want to retrieve only one or some of its files, you don’t need to donwload and extract the whole Tar archive. As S3 supports to download only a range of bytes of a stored file, you just need to know where in the Tar archive the wanted file is stored.

Creating a table of contents of a Tar archive

Before you upload your Tar archive to a S3 storage, you need to create a table of its content with start points and length of the files:

tar -tvf AA.tar -R > toctemp.txt

Next, you need a little awk script which calculates the length of the files and print the result together with the starting point. Save the folowing code as “calculateFiles.sh”:

awk '
BEGIN{
  getline;
  f=$8;
  s=$5;
}
{
  offset = int($2) * 512 - and((s+511), -512)
  print offset,s,f;
  f=$8;
  s=$5;
}'

Now, pipe the content of the previous file you created into that script and save the result as “toc.txt”:

cat toctemp.txt | /root/calculateFiles.sh > toc.txt

The toc.txt contains now lines like this:

316492288 3501474 AA/wiki_85

The first number indicates the starting byte of the file, the second one its length and the third column contains the filename.

Once you have uploaded the Tar archive to an S3 storage, you can now download an individual file with the following script. Don’t forget to adjust the S3 configuration, name of the Tar archive and the start position and length of the file you want to retrieve:

import boto3
import botocore
import s3credentials
import sys

# Initialize a S3 client:
session = boto3.session.Session()
s3_client = session.client(
    service_name='s3',
    aws_access_key_id=s3credentials.rdo["key"],
    aws_secret_access_key=s3credentials.rdo["secret"],
    endpoint_url=s3credentials.rdo["url"],
    config=botocore.client.Config(signature_version='s3'),
)

# Start position and length of the file we want to get from the Tar archive:
startPos = 316492288
length = 3501474

stopPos = startPos + length

# Lets get the file and print it:
resp = s3_client.get_object(Bucket='parttest', Key='AA.tar', Range='bytes={}-{}'.format(startPos, stopPos))
text=resp['Body'].read().decode("utf-8")

print(text, end='')