Indexing Tools¶

Download¶

Module for indexing actions for downloading a manifest of indexed file objects (against indexd’s API). Supports multiple processes and coroutines using Python’s asyncio library.

The default manifest format created is a Comma-Separated Value file (csv) with rows for every record. A header row is created with field names: guid,authz,acl,file_size,md5,urls,file_name

Fields that are lists (like acl, authz, and urls) separate the values with spaces.

gen3.tools.indexing.download_manifest.CURRENT_DIR¶

directory this file is in

Type:: str

gen3.tools.indexing.download_manifest.INDEXD_RECORD_PAGE_SIZE¶

number of records to request per page

Type:: int

gen3.tools.indexing.download_manifest.MAX_CONCURRENT_REQUESTS¶

maximum number of desired concurrent requests across processes/threads

Type:: int

gen3.tools.indexing.download_manifest.TMP_FOLDER¶

Folder directory for placing temporary files NOTE - We have to use a temporary folder b/c Python’s file writing is not

thread-safe so we can’t have all processes writing to the same file. To workaround this, we have each process write to a file and concat them all post-processing.

Type:: str

async gen3.tools.indexing.download_manifest.async_download_object_manifest(commons_url, output_filename='object-manifest.csv', num_processes=4, max_concurrent_requests=24, input_manifest=None, python_subprocess_command='python')[source]¶

Download all file object records into a manifest csv

Parameters:

commons_url (str) – root domain for commons where indexd lives
output_filename (str, optional) – filename for output
num_processes (int, optional) – number of parallel python processes to use for hitting indexd api and processing
max_concurrent_requests (int) – the maximum number of concurrent requests allowed NOTE: This is the TOTAL number, not just for this process. Used to help determine how many requests a process should be making at one time
input_manifest (str) – Input file. Read available object data from objects in this file instead of reading everything in indexd. This will attempt to query indexd for only the records identified in this manifest.
python_subprocess_command (str, optional) – Command used to execute a python process. By default you should not need to change this, but if you are running something like MacOS and only installed Python 3.x you may need to specify “python3”.

Index¶

Module for indexing object files in a manifest (against indexd’s API).

The default manifest format created is a Tab-Separated Value file (tsv) with rows for every record.

Fields that are lists (like acl, authz, and urls) separate the values with commas or spaces See the Attributes session for supported column names.

All supported formats of acl, authz and url fields are shown in the below example.

guid md5 size acl authz url 255e396f-f1f8-11e9-9a07-0a80fada099c 473d83400bc1bc9dc635e334faddf33c 363455714 [‘Open’] [s3://pdcdatastore/test1.raw] 255e396f-f1f8-11e9-9a07-0a80fada098c 473d83400bc1bc9dc635e334faddd33c 343434344 Open s3://pdcdatastore/test2.raw 255e396f-f1f8-11e9-9a07-0a80fada097c 473d83400bc1bc9dc635e334fadd433c 543434443 phs0001 phs0002 s3://pdcdatastore/test3.raw 255e396f-f1f8-11e9-9a07-0a80fada096c 473d83400bc1bc9dc635e334fadd433c 363455714 [‘phs0001’, ‘phs0002’] [‘s3://pdcdatastore/test4.raw’] 255e396f-f1f8-11e9-9a07-0a80fada010c 473d83400bc1bc9dc635e334fadde33c 363455714 [‘Open’] s3://pdcdatastore/test5.raw

gen3.tools.indexing.index_manifest.CURRENT_DIR¶

directory this file is in

Type:: str

gen3.tools.indexing.index_manifest.GUID¶

supported file id column names

Type:: list(string)

gen3.tools.indexing.index_manifest.SIZE¶

supported file size column names

Type:: list(string)

gen3.tools.indexing.index_manifest.MD5¶

supported md5 hash column names

Type:: list(string)

gen3.tools.indexing.index_manifest.ACLS¶

supported acl column names

Type:: list(string)

gen3.tools.indexing.index_manifest.URLS¶

supported url column names

Type:: list(string)

gen3.tools.indexing.index_manifest.AUTHZ¶

supported authz column names

Type:: list(string)

gen3.tools.indexing.index_manifest.PREV_GUID¶

supported previous guid column names

Type:: list(string)

Usages:: python index_manifest.py –commons_url https://giangb.planx-pla.net –manifest_file path_to_manifest –auth “admin,admin” –replace_urls False –thread_num 10 python index_manifest.py –commons_url https://giangb.planx-pla.net –manifest_file path_to_manifest –api_key ./credentials.json –replace_urls False –thread_num 10

class gen3.tools.indexing.index_manifest.ThreadControl(processed_files=0, num_total_files=0)[source]¶

Bases: object

Class for thread synchronization

gen3.tools.indexing.index_manifest.delete_all_guids(auth, file)[source]¶

Delete all GUIDs specified in the object manifest.

WARNING: THIS COMPLETELY REMOVES INDEX RECORDS. USE THIS ONLY IF YOU KNOW: THE IMPLICATIONS.

gen3.tools.indexing.index_manifest.index_object_manifest(commons_url, manifest_file, thread_num, auth=None, replace_urls=True, manifest_file_delimiter=None, output_filename='indexing-output-manifest.csv', submit_additional_metadata_columns=False, force_metadata_columns_even_if_empty=True)[source]¶

Loop through all the files in the manifest, update/create records in indexd update indexd if the url is not in the record url list or acl has changed

Parameters:

commons_url (str) – common url
manifest_file (str) – path to the manifest
thread_num (int) – number of threads for indexing
auth (Gen3Auth) – Gen3 auth or tuple with basic auth name and password
replace_urls (bool) – flag to indicate if replace urls or not
manifest_file_delimiter (str) – manifest’s delimiter
output_filename (str) – output file name for manifest
submit_additional_metadata_columns (bool) – whether to submit additional metadata to the metadata service
force_metadata_columns_even_if_empty (bool) –
force the creation of a metadata column entry for a GUID even if the value is empty. Enabling this will force the creation of metadata entries for every column. See below for an illustrative example

Example manifest_file:

guid, …, columnA, columnB, ColumnC
1, …, dataA, , 2, …, , dataB,

Resulting metadata if force_metadata_columns_even_if_empty=True :

”1”: {
“columnA”: “dataA”, “columnB”: “”, “ColumnC”: “”,

}, “2”: {

”columnA”: “”, “columnB”: “dataB”, “ColumnC”: “”,

},

Resulting metadata if force_metadata_columns_even_if_empty=False :

”1”: {
“columnA”: “dataA”,

}, “2”: {

”columnB”: “dataB”,

},

Returns:

list of file info [

{
“guid”: “guid_example”, “filename”: “example”, “size”: 100, “acl”: “[‘open’]”, “md5”: “md5_hash”,

},

] headers(list(str)): list of fieldnames

Return type:

files(list(dict))

Verify¶

Module for indexing actions for verifying a manifest of indexed file objects (against indexd’s API). Supports multiple processes and coroutines using Python’s asyncio library.

The default manifest format created is a Comma-Separated Value file (csv) with rows for every record. A header row is created with field names: guid,authz,acl,file_size,md5,urls,file_name

Fields that are lists (like acl, authz, and urls) separate the values with spaces.

There is a default mapping for those column names above but you can override it. Fields that expect lists (like acl, authz, and urls) by default assume these values are separated with spaces. If you need alternate behavior, you can simply override the manifest_row_parsers for specific fields and replace the default parsing function with a custom one. For example:

``` from gen3.tools import indexing from gen3.tools.indexing.verify_manifest import manifest_row_parsers

def _get_authz_from_row(row):: return [row.get(“authz”).strip().strip(“[“).strip(“]”).strip(”’”)]

# override default parsers manifest_row_parsers[“authz”] = _get_authz_from_row

indexing.verify_object_manifest(COMMONS) ```

The output from this verification is a file containing any errors in the following format:

{guid}|{error_name}|expected {value_from_manifest}|actual {value_from_indexd} ex: 93d9af72-b0f1-450c-a5c6-7d3d8d2083b4|authz|expected [‘’]|actual [‘/programs/DEV/projects/test’]

gen3.tools.indexing.verify_manifest.CURRENT_DIR¶

directory this file is in

Type:: str

gen3.tools.indexing.verify_manifest.MAX_CONCURRENT_REQUESTS¶

maximum number of desired concurrent requests across processes/threads

Type:: int

async gen3.tools.indexing.verify_manifest.async_verify_object_manifest(commons_url, manifest_file, max_concurrent_requests=24, manifest_row_parsers={'acl': <function _get_acl_from_row>, 'authz': <function _get_authz_from_row>, 'file_name': <function _get_file_name_from_row>, 'file_size': <function _get_file_size_from_row>, 'guid': <function _get_guid_from_row>, 'md5': <function _get_md5_from_row>, 'urls': <function _get_urls_from_row>}, manifest_file_delimiter=None, output_filename='verify-manifest-errors-1752160449.5966842.log')[source]¶

Verify all file object records into a manifest csv

Parameters:

commons_url (str) – root domain for commons where indexd lives
manifest_file (str) – the file to verify against
max_concurrent_requests (int) – the maximum number of concurrent requests allowed
(Dict{indexd_field (manifest_row_parsers) – func_to_parse_row}): Row parsers
manifest_file_delimiter (str) – delimeter in manifest_file
output_filename (str) – filename for output logs

Indexing Tools¶

Download¶

Index¶

Verify¶

Gen3 SDK

Navigation

Related Topics