Indexing Tools

Download

Module for indexing actions for downloading a manifest of indexed file objects (against indexd’s API). Supports multiple processes and coroutines using Python’s asyncio library.

The default manifest format created is a Comma-Separated Value file (csv) with rows for every record. A header row is created with field names: guid,authz,acl,file_size,md5,urls,file_name

Fields that are lists (like acl, authz, and urls) separate the values with spaces.

gen3.tools.indexing.download_manifest.CURRENT_DIR

directory this file is in

Type

str

gen3.tools.indexing.download_manifest.INDEXD_RECORD_PAGE_SIZE

number of records to request per page

Type

int

gen3.tools.indexing.download_manifest.MAX_CONCURRENT_REQUESTS

maximum number of desired concurrent requests across processes/threads

Type

int

gen3.tools.indexing.download_manifest.TMP_FOLDER

Folder directory for placing temporary files NOTE - We have to use a temporary folder b/c Python’s file writing is not

thread-safe so we can’t have all processes writing to the same file. To workaround this, we have each process write to a file and concat them all post-processing.

Type

str

async gen3.tools.indexing.download_manifest.async_download_object_manifest(commons_url, output_filename='object-manifest.csv', num_processes=4, max_concurrent_requests=24)[source]

Download all file object records into a manifest csv

Parameters
  • commons_url (str) – root domain for commons where indexd lives

  • output_filename (str, optional) – filename for output

  • num_processes (int, optional) – number of parallel python processes to use for hitting indexd api and processing

  • max_concurrent_requests (int) – the maximum number of concurrent requests allowed NOTE: This is the TOTAL number, not just for this process. Used to help determine how many requests a process should be making at one time

Index

Module for indexing object files in a manifest (against indexd’s API).

The default manifest format created is a Tab-Separated Value file (tsv) with rows for every record.

Fields that are lists (like acl, authz, and urls) separate the values with commas or spaces See the Attributes session for supported column names.

All supported formats of acl, authz and url fields are shown in the below example.

guid md5 size acl authz url 255e396f-f1f8-11e9-9a07-0a80fada099c 473d83400bc1bc9dc635e334faddf33c 363455714 [‘Open’] [s3://pdcdatastore/test1.raw] 255e396f-f1f8-11e9-9a07-0a80fada098c 473d83400bc1bc9dc635e334faddd33c 343434344 Open s3://pdcdatastore/test2.raw 255e396f-f1f8-11e9-9a07-0a80fada097c 473d83400bc1bc9dc635e334fadd433c 543434443 phs0001 phs0002 s3://pdcdatastore/test3.raw 255e396f-f1f8-11e9-9a07-0a80fada096c 473d83400bc1bc9dc635e334fadd433c 363455714 [‘phs0001’, ‘phs0002’] [‘s3://pdcdatastore/test4.raw’] 255e396f-f1f8-11e9-9a07-0a80fada010c 473d83400bc1bc9dc635e334fadde33c 363455714 [‘Open’] s3://pdcdatastore/test5.raw

gen3.tools.indexing.index_manifest.CURRENT_DIR

directory this file is in

Type

str

gen3.tools.indexing.index_manifest.GUID

supported file id column names

Type

list(string)

gen3.tools.indexing.index_manifest.SIZE

supported file size column names

Type

list(string)

gen3.tools.indexing.index_manifest.MD5

supported md5 hash column names

Type

list(string)

gen3.tools.indexing.index_manifest.ACLS

supported acl column names

Type

list(string)

gen3.tools.indexing.index_manifest.URLS

supported url column names

Type

list(string)

gen3.tools.indexing.index_manifest.AUTHZ

supported authz column names

Type

list(string)

gen3.tools.indexing.index_manifest.PREV_GUID

supported previous guid column names

Type

list(string)

Usages:

python index_manifest.py –commons_url https://giangb.planx-pla.net –manifest_file path_to_manifest –auth “admin,admin” –replace_urls False –thread_num 10 python index_manifest.py –commons_url https://giangb.planx-pla.net –manifest_file path_to_manifest –api_key ./credentials.json –replace_urls False –thread_num 10

class gen3.tools.indexing.index_manifest.ThreadControl(processed_files=0, num_total_files=0)[source]

Bases: object

Class for thread synchronization

gen3.tools.indexing.index_manifest.get_and_verify_fileinfos_from_manifest(manifest_file, include_additional_columns=False)[source]

Wrapper for above function to determine the delimeter based on file extention

gen3.tools.indexing.index_manifest.get_and_verify_fileinfos_from_tsv_manifest(manifest_file, manifest_file_delimiter='\t', include_additional_columns=False)[source]

get and verify file infos from tsv manifest

Parameters
  • manifest_file (str) – the path to the input manifest

  • manifest_file_delimiter (str) – delimiter

Returns

list of file info [

{

“guid”: “guid_example”, “filename”: “example”, “size”: 100, “acl”: “[‘open’]”, “md5”: “md5_hash”,

},

] headers(list(str)): field names

Return type

list(dict)

gen3.tools.indexing.index_manifest.index_object_manifest(commons_url, manifest_file, thread_num, auth=None, replace_urls=True, manifest_file_delimiter=None, output_filename='indexing-output-manifest.csv')[source]

Loop through all the files in the manifest, update/create records in indexd update indexd if the url is not in the record url list or acl has changed

Parameters
  • commons_url (str) – common url

  • manifest_file (str) – path to the manifest

  • thread_num (int) – number of threads for indexing

  • auth (Gen3Auth) – Gen3 auth or tuple with basic auth name and password

  • replace_urls (bool) – flag to indicate if replace urls or not

  • manifest_file_delimiter (str) – manifest’s delimiter

  • output_filename (str) – output file name for manifest

Returns

list of file info [

{

“guid”: “guid_example”, “filename”: “example”, “size”: 100, “acl”: “[‘open’]”, “md5”: “md5_hash”,

},

] headers(list(str)): list of fieldnames

Return type

files(list(dict))

Verify

Module for indexing actions for verifying a manifest of indexed file objects (against indexd’s API). Supports multiple processes and coroutines using Python’s asyncio library.

The default manifest format created is a Comma-Separated Value file (csv) with rows for every record. A header row is created with field names: guid,authz,acl,file_size,md5,urls,file_name

Fields that are lists (like acl, authz, and urls) separate the values with spaces.

There is a default mapping for those column names above but you can override it. Fields that expect lists (like acl, authz, and urls) by default assume these values are separated with spaces. If you need alternate behavior, you can simply override the manifest_row_parsers for specific fields and replace the default parsing function with a custom one. For example:

``` from gen3.tools import indexing from gen3.tools.indexing.verify_manifest import manifest_row_parsers

def _get_authz_from_row(row):

return [row.get(“authz”).strip().strip(“[“).strip(“]”).strip(“’”)]

# override default parsers manifest_row_parsers[“authz”] = _get_authz_from_row

indexing.verify_object_manifest(COMMONS) ```

The output from this verification is a file containing any errors in the following format:

{guid}|{error_name}|expected {value_from_manifest}|actual {value_from_indexd} ex: 93d9af72-b0f1-450c-a5c6-7d3d8d2083b4|authz|expected [‘’]|actual [‘/programs/DEV/projects/test’]

gen3.tools.indexing.verify_manifest.CURRENT_DIR

directory this file is in

Type

str

gen3.tools.indexing.verify_manifest.MAX_CONCURRENT_REQUESTS

maximum number of desired concurrent requests across processes/threads

Type

int

async gen3.tools.indexing.verify_manifest.async_verify_object_manifest(commons_url, manifest_file, max_concurrent_requests=24, manifest_row_parsers={'acl': <function _get_acl_from_row>, 'authz': <function _get_authz_from_row>, 'file_name': <function _get_file_name_from_row>, 'file_size': <function _get_file_size_from_row>, 'guid': <function _get_guid_from_row>, 'md5': <function _get_md5_from_row>, 'urls': <function _get_urls_from_row>}, manifest_file_delimiter=None, output_filename='verify-manifest-errors-1630075391.7987797.log')[source]

Verify all file object records into a manifest csv

Parameters
  • commons_url (str) – root domain for commons where indexd lives

  • manifest_file (str) – the file to verify against

  • max_concurrent_requests (int) – the maximum number of concurrent requests allowed

  • (Dict{indexd_field (manifest_row_parsers) – func_to_parse_row}): Row parsers

  • manifest_file_delimiter (str) – delimeter in manifest_file

  • output_filename (str) – filename for output logs