Indexing Tools

Download

Module for indexing actions for downloading a manifest of indexed file objects (against indexd’s API). Supports multiple processes and coroutines using Python’s asyncio library.

The default manifest format created is a Comma-Separated Value file (csv) with rows for every record. A header row is created with field names: guid,authz,acl,file_size,md5,urls,file_name

Fields that are lists (like acl, authz, and urls) separate the values with spaces.

gen3.tools.indexing.download_manifest.CURRENT_DIR

directory this file is in

Type:

str

gen3.tools.indexing.download_manifest.INDEXD_RECORD_PAGE_SIZE

number of records to request per page

Type:

int

gen3.tools.indexing.download_manifest.MAX_CONCURRENT_REQUESTS

maximum number of desired concurrent requests across processes/threads

Type:

int

gen3.tools.indexing.download_manifest.TMP_FOLDER

Folder directory for placing temporary files NOTE - We have to use a temporary folder b/c Python’s file writing is not

thread-safe so we can’t have all processes writing to the same file. To workaround this, we have each process write to a file and concat them all post-processing.

Type:

str

async gen3.tools.indexing.download_manifest.async_download_object_manifest(commons_url, output_filename='object-manifest.csv', num_processes=4, max_concurrent_requests=24, input_manifest=None, python_subprocess_command='python')[source]

Download all file object records into a manifest csv

Parameters:
  • commons_url (str) – root domain for commons where indexd lives

  • output_filename (str, optional) – filename for output

  • num_processes (int, optional) – number of parallel python processes to use for hitting indexd api and processing

  • max_concurrent_requests (int) – the maximum number of concurrent requests allowed NOTE: This is the TOTAL number, not just for this process. Used to help determine how many requests a process should be making at one time

  • input_manifest (str) – Input file. Read available object data from objects in this file instead of reading everything in indexd. This will attempt to query indexd for only the records identified in this manifest.

  • python_subprocess_command (str, optional) – Command used to execute a python process. By default you should not need to change this, but if you are running something like MacOS and only installed Python 3.x you may need to specify “python3”.

Index

Module for indexing object files in a manifest (against indexd’s API).

The default manifest format created is a Tab-Separated Value file (tsv) with rows for every record.

Fields that are lists (like acl, authz, and urls) separate the values with commas or spaces See the Attributes session for supported column names.

All supported formats of acl, authz and url fields are shown in the below example.

guid md5 size acl authz url 255e396f-f1f8-11e9-9a07-0a80fada099c 473d83400bc1bc9dc635e334faddf33c 363455714 [‘Open’] [s3://pdcdatastore/test1.raw] 255e396f-f1f8-11e9-9a07-0a80fada098c 473d83400bc1bc9dc635e334faddd33c 343434344 Open s3://pdcdatastore/test2.raw 255e396f-f1f8-11e9-9a07-0a80fada097c 473d83400bc1bc9dc635e334fadd433c 543434443 phs0001 phs0002 s3://pdcdatastore/test3.raw 255e396f-f1f8-11e9-9a07-0a80fada096c 473d83400bc1bc9dc635e334fadd433c 363455714 [‘phs0001’, ‘phs0002’] [‘s3://pdcdatastore/test4.raw’] 255e396f-f1f8-11e9-9a07-0a80fada010c 473d83400bc1bc9dc635e334fadde33c 363455714 [‘Open’] s3://pdcdatastore/test5.raw

gen3.tools.indexing.index_manifest.CURRENT_DIR

directory this file is in

Type:

str

gen3.tools.indexing.index_manifest.GUID

supported file id column names

Type:

list(string)

gen3.tools.indexing.index_manifest.SIZE

supported file size column names

Type:

list(string)

gen3.tools.indexing.index_manifest.MD5

supported md5 hash column names

Type:

list(string)

gen3.tools.indexing.index_manifest.ACLS

supported acl column names

Type:

list(string)

gen3.tools.indexing.index_manifest.URLS

supported url column names

Type:

list(string)

gen3.tools.indexing.index_manifest.AUTHZ

supported authz column names

Type:

list(string)

gen3.tools.indexing.index_manifest.PREV_GUID

supported previous guid column names

Type:

list(string)

Usages:

python index_manifest.py –commons_url https://giangb.planx-pla.net –manifest_file path_to_manifest –auth “admin,admin” –replace_urls False –thread_num 10 python index_manifest.py –commons_url https://giangb.planx-pla.net –manifest_file path_to_manifest –api_key ./credentials.json –replace_urls False –thread_num 10

class gen3.tools.indexing.index_manifest.ThreadControl(processed_files=0, num_total_files=0)[source]

Bases: object

Class for thread synchronization

gen3.tools.indexing.index_manifest.delete_all_guids(auth, file)[source]

Delete all GUIDs specified in the object manifest.

WARNING: THIS COMPLETELY REMOVES INDEX RECORDS. USE THIS ONLY IF YOU KNOW

THE IMPLICATIONS.

gen3.tools.indexing.index_manifest.index_object_manifest(commons_url, manifest_file, thread_num, auth=None, replace_urls=True, manifest_file_delimiter=None, output_filename='indexing-output-manifest.csv', submit_additional_metadata_columns=False, force_metadata_columns_even_if_empty=True)[source]

Loop through all the files in the manifest, update/create records in indexd update indexd if the url is not in the record url list or acl has changed

Parameters:
  • commons_url (str) – common url

  • manifest_file (str) – path to the manifest

  • thread_num (int) – number of threads for indexing

  • auth (Gen3Auth) – Gen3 auth or tuple with basic auth name and password

  • replace_urls (bool) – flag to indicate if replace urls or not

  • manifest_file_delimiter (str) – manifest’s delimiter

  • output_filename (str) – output file name for manifest

  • submit_additional_metadata_columns (bool) – whether to submit additional metadata to the metadata service

  • force_metadata_columns_even_if_empty (bool) –

    force the creation of a metadata column entry for a GUID even if the value is empty. Enabling this will force the creation of metadata entries for every column. See below for an illustrative example

    Example manifest_file:
    guid, …, columnA, columnB, ColumnC

    1, …, dataA, , 2, …, , dataB,

    Resulting metadata if force_metadata_columns_even_if_empty=True :
    ”1”: {

    “columnA”: “dataA”, “columnB”: “”, “ColumnC”: “”,

    }, “2”: {

    ”columnA”: “”, “columnB”: “dataB”, “ColumnC”: “”,

    },

    Resulting metadata if force_metadata_columns_even_if_empty=False :
    ”1”: {

    “columnA”: “dataA”,

    }, “2”: {

    ”columnB”: “dataB”,

    },

Returns:

list of file info [

{

“guid”: “guid_example”, “filename”: “example”, “size”: 100, “acl”: “[‘open’]”, “md5”: “md5_hash”,

},

] headers(list(str)): list of fieldnames

Return type:

files(list(dict))

Verify

Module for indexing actions for verifying a manifest of indexed file objects (against indexd’s API). Supports multiple processes and coroutines using Python’s asyncio library.

The default manifest format created is a Comma-Separated Value file (csv) with rows for every record. A header row is created with field names: guid,authz,acl,file_size,md5,urls,file_name

Fields that are lists (like acl, authz, and urls) separate the values with spaces.

There is a default mapping for those column names above but you can override it. Fields that expect lists (like acl, authz, and urls) by default assume these values are separated with spaces. If you need alternate behavior, you can simply override the manifest_row_parsers for specific fields and replace the default parsing function with a custom one. For example:

``` from gen3.tools import indexing from gen3.tools.indexing.verify_manifest import manifest_row_parsers

def _get_authz_from_row(row):

return [row.get(“authz”).strip().strip(“[“).strip(“]”).strip(”’”)]

# override default parsers manifest_row_parsers[“authz”] = _get_authz_from_row

indexing.verify_object_manifest(COMMONS) ```

The output from this verification is a file containing any errors in the following format:

{guid}|{error_name}|expected {value_from_manifest}|actual {value_from_indexd} ex: 93d9af72-b0f1-450c-a5c6-7d3d8d2083b4|authz|expected [‘’]|actual [‘/programs/DEV/projects/test’]

gen3.tools.indexing.verify_manifest.CURRENT_DIR

directory this file is in

Type:

str

gen3.tools.indexing.verify_manifest.MAX_CONCURRENT_REQUESTS

maximum number of desired concurrent requests across processes/threads

Type:

int

async gen3.tools.indexing.verify_manifest.async_verify_object_manifest(commons_url, manifest_file, max_concurrent_requests=24, manifest_row_parsers={'acl': <function _get_acl_from_row>, 'authz': <function _get_authz_from_row>, 'file_name': <function _get_file_name_from_row>, 'file_size': <function _get_file_size_from_row>, 'guid': <function _get_guid_from_row>, 'md5': <function _get_md5_from_row>, 'urls': <function _get_urls_from_row>}, manifest_file_delimiter=None, output_filename='verify-manifest-errors-1711461363.0524344.log')[source]

Verify all file object records into a manifest csv

Parameters:
  • commons_url (str) – root domain for commons where indexd lives

  • manifest_file (str) – the file to verify against

  • max_concurrent_requests (int) – the maximum number of concurrent requests allowed

  • (Dict{indexd_field (manifest_row_parsers) – func_to_parse_row}): Row parsers

  • manifest_file_delimiter (str) – delimeter in manifest_file

  • output_filename (str) – filename for output logs