Indexing Tools¶
Download¶
Module for indexing actions for downloading a manifest of indexed file objects (against indexd’s API). Supports multiple processes and coroutines using Python’s asyncio library.
The default manifest format created is a Comma-Separated Value file (csv) with rows for every record. A header row is created with field names: guid,authz,acl,file_size,md5,urls,file_name
Fields that are lists (like acl, authz, and urls) separate the values with spaces.
- gen3.tools.indexing.download_manifest.CURRENT_DIR¶
directory this file is in
- Type:
str
- gen3.tools.indexing.download_manifest.INDEXD_RECORD_PAGE_SIZE¶
number of records to request per page
- Type:
int
- gen3.tools.indexing.download_manifest.MAX_CONCURRENT_REQUESTS¶
maximum number of desired concurrent requests across processes/threads
- Type:
int
- gen3.tools.indexing.download_manifest.TMP_FOLDER¶
Folder directory for placing temporary files NOTE - We have to use a temporary folder b/c Python’s file writing is not
thread-safe so we can’t have all processes writing to the same file. To workaround this, we have each process write to a file and concat them all post-processing.
- Type:
str
- async gen3.tools.indexing.download_manifest.async_download_object_manifest(commons_url, output_filename='object-manifest.csv', num_processes=4, max_concurrent_requests=24, input_manifest=None, python_subprocess_command='python')[source]¶
Download all file object records into a manifest csv
- Parameters:
commons_url (str) – root domain for commons where indexd lives
output_filename (str, optional) – filename for output
num_processes (int, optional) – number of parallel python processes to use for hitting indexd api and processing
max_concurrent_requests (int) – the maximum number of concurrent requests allowed NOTE: This is the TOTAL number, not just for this process. Used to help determine how many requests a process should be making at one time
input_manifest (str) – Input file. Read available object data from objects in this file instead of reading everything in indexd. This will attempt to query indexd for only the records identified in this manifest.
python_subprocess_command (str, optional) – Command used to execute a python process. By default you should not need to change this, but if you are running something like MacOS and only installed Python 3.x you may need to specify “python3”.
Index¶
Module for indexing object files in a manifest (against indexd’s API).
The default manifest format created is a Tab-Separated Value file (tsv) with rows for every record.
Fields that are lists (like acl, authz, and urls) separate the values with commas or spaces See the Attributes session for supported column names.
All supported formats of acl, authz and url fields are shown in the below example.
guid md5 size acl authz url 255e396f-f1f8-11e9-9a07-0a80fada099c 473d83400bc1bc9dc635e334faddf33c 363455714 [‘Open’] [s3://pdcdatastore/test1.raw] 255e396f-f1f8-11e9-9a07-0a80fada098c 473d83400bc1bc9dc635e334faddd33c 343434344 Open s3://pdcdatastore/test2.raw 255e396f-f1f8-11e9-9a07-0a80fada097c 473d83400bc1bc9dc635e334fadd433c 543434443 phs0001 phs0002 s3://pdcdatastore/test3.raw 255e396f-f1f8-11e9-9a07-0a80fada096c 473d83400bc1bc9dc635e334fadd433c 363455714 [‘phs0001’, ‘phs0002’] [‘s3://pdcdatastore/test4.raw’] 255e396f-f1f8-11e9-9a07-0a80fada010c 473d83400bc1bc9dc635e334fadde33c 363455714 [‘Open’] s3://pdcdatastore/test5.raw
- gen3.tools.indexing.index_manifest.CURRENT_DIR¶
directory this file is in
- Type:
str
- gen3.tools.indexing.index_manifest.GUID¶
supported file id column names
- Type:
list(string)
- gen3.tools.indexing.index_manifest.SIZE¶
supported file size column names
- Type:
list(string)
- gen3.tools.indexing.index_manifest.MD5¶
supported md5 hash column names
- Type:
list(string)
- gen3.tools.indexing.index_manifest.ACLS¶
supported acl column names
- Type:
list(string)
- gen3.tools.indexing.index_manifest.URLS¶
supported url column names
- Type:
list(string)
- gen3.tools.indexing.index_manifest.AUTHZ¶
supported authz column names
- Type:
list(string)
- gen3.tools.indexing.index_manifest.PREV_GUID¶
supported previous guid column names
- Type:
list(string)
- Usages:
python index_manifest.py –commons_url https://giangb.planx-pla.net –manifest_file path_to_manifest –auth “admin,admin” –replace_urls False –thread_num 10 python index_manifest.py –commons_url https://giangb.planx-pla.net –manifest_file path_to_manifest –api_key ./credentials.json –replace_urls False –thread_num 10
- class gen3.tools.indexing.index_manifest.ThreadControl(processed_files=0, num_total_files=0)[source]¶
Bases:
object
Class for thread synchronization
- gen3.tools.indexing.index_manifest.delete_all_guids(auth, file)[source]¶
Delete all GUIDs specified in the object manifest.
- WARNING: THIS COMPLETELY REMOVES INDEX RECORDS. USE THIS ONLY IF YOU KNOW
THE IMPLICATIONS.
- gen3.tools.indexing.index_manifest.index_object_manifest(commons_url, manifest_file, thread_num, auth=None, replace_urls=True, manifest_file_delimiter=None, output_filename='indexing-output-manifest.csv', submit_additional_metadata_columns=False, force_metadata_columns_even_if_empty=True)[source]¶
Loop through all the files in the manifest, update/create records in indexd update indexd if the url is not in the record url list or acl has changed
- Parameters:
commons_url (str) – common url
manifest_file (str) – path to the manifest
thread_num (int) – number of threads for indexing
auth (Gen3Auth) – Gen3 auth or tuple with basic auth name and password
replace_urls (bool) – flag to indicate if replace urls or not
manifest_file_delimiter (str) – manifest’s delimiter
output_filename (str) – output file name for manifest
submit_additional_metadata_columns (bool) – whether to submit additional metadata to the metadata service
force_metadata_columns_even_if_empty (bool) –
force the creation of a metadata column entry for a GUID even if the value is empty. Enabling this will force the creation of metadata entries for every column. See below for an illustrative example
- Example manifest_file:
- guid, …, columnA, columnB, ColumnC
1, …, dataA, , 2, …, , dataB,
- Resulting metadata if force_metadata_columns_even_if_empty=True :
- ”1”: {
“columnA”: “dataA”, “columnB”: “”, “ColumnC”: “”,
}, “2”: {
”columnA”: “”, “columnB”: “dataB”, “ColumnC”: “”,
},
- Resulting metadata if force_metadata_columns_even_if_empty=False :
- ”1”: {
“columnA”: “dataA”,
}, “2”: {
”columnB”: “dataB”,
},
- Returns:
list of file info [
- {
“guid”: “guid_example”, “filename”: “example”, “size”: 100, “acl”: “[‘open’]”, “md5”: “md5_hash”,
},
] headers(list(str)): list of fieldnames
- Return type:
files(list(dict))
Verify¶
Module for indexing actions for verifying a manifest of indexed file objects (against indexd’s API). Supports multiple processes and coroutines using Python’s asyncio library.
The default manifest format created is a Comma-Separated Value file (csv) with rows for every record. A header row is created with field names: guid,authz,acl,file_size,md5,urls,file_name
Fields that are lists (like acl, authz, and urls) separate the values with spaces.
There is a default mapping for those column names above but you can override it. Fields that expect lists (like acl, authz, and urls) by default assume these values are separated with spaces. If you need alternate behavior, you can simply override the manifest_row_parsers for specific fields and replace the default parsing function with a custom one. For example:
``` from gen3.tools import indexing from gen3.tools.indexing.verify_manifest import manifest_row_parsers
- def _get_authz_from_row(row):
return [row.get(“authz”).strip().strip(“[“).strip(“]”).strip(”’”)]
# override default parsers manifest_row_parsers[“authz”] = _get_authz_from_row
indexing.verify_object_manifest(COMMONS) ```
The output from this verification is a file containing any errors in the following format:
{guid}|{error_name}|expected {value_from_manifest}|actual {value_from_indexd} ex: 93d9af72-b0f1-450c-a5c6-7d3d8d2083b4|authz|expected [‘’]|actual [‘/programs/DEV/projects/test’]
- gen3.tools.indexing.verify_manifest.CURRENT_DIR¶
directory this file is in
- Type:
str
- gen3.tools.indexing.verify_manifest.MAX_CONCURRENT_REQUESTS¶
maximum number of desired concurrent requests across processes/threads
- Type:
int
- async gen3.tools.indexing.verify_manifest.async_verify_object_manifest(commons_url, manifest_file, max_concurrent_requests=24, manifest_row_parsers={'acl': <function _get_acl_from_row>, 'authz': <function _get_authz_from_row>, 'file_name': <function _get_file_name_from_row>, 'file_size': <function _get_file_size_from_row>, 'guid': <function _get_guid_from_row>, 'md5': <function _get_md5_from_row>, 'urls': <function _get_urls_from_row>}, manifest_file_delimiter=None, output_filename='verify-manifest-errors-1728061593.9082778.log')[source]¶
Verify all file object records into a manifest csv
- Parameters:
commons_url (str) – root domain for commons where indexd lives
manifest_file (str) – the file to verify against
max_concurrent_requests (int) – the maximum number of concurrent requests allowed
(Dict{indexd_field (manifest_row_parsers) – func_to_parse_row}): Row parsers
manifest_file_delimiter (str) – delimeter in manifest_file
output_filename (str) – filename for output logs