Metadata Tools¶
Ingest¶
Tools for ingesting a CSV/TSV metadata manifest into the Metdata Service.
- gen3.tools.metadata.ingest_manifest.COLUMN_TO_USE_AS_GUID¶
file column containing guid to use
- Type:
str
- gen3.tools.metadata.ingest_manifest.GUID_TYPE_FOR_INDEXED_FILE_OBJECT¶
type to populate in mds when guid exists in indexd
- Type:
str
- gen3.tools.metadata.ingest_manifest.GUID_TYPE_FOR_NON_INDEXED_FILE_OBJECT¶
type to populate in mds when guid does NOT exist in indexd
- Type:
str
- manifest_row_parsers (Dict{str
function}): functions for parsing, users can override manifest_row_parsers = {
“guid_from_file”: _get_guid_for_row, “indexed_file_object_guid”: _query_for_associated_indexd_record_guid,
}
“guid_for_row” is the function to retrieve the guid from the given file “indexed_file_object_guid” is the function to retrieve the guid from elsewhere,
like indexd (by querying)
- gen3.tools.metadata.ingest_manifest.MAX_CONCURRENT_REQUESTS¶
Maximum concurrent requests to mds for ingestion
- Type:
int
- async gen3.tools.metadata.ingest_manifest.async_ingest_metadata_manifest(commons_url, manifest_file, metadata_source, auth=None, max_concurrent_requests=24, manifest_row_parsers={'guid_for_row': <function _get_guid_for_row>, 'indexed_file_object_guid': <function _query_for_associated_indexd_record_guid>}, manifest_file_delimiter=None, output_filename='ingest-metadata-manifest-errors-1728061594.165499.log', get_guid_from_file=True, metadata_type=None)[source]¶
Ingest all metadata records into a manifest csv
- Parameters:
commons_url (str) – root domain for commons where mds lives
manifest_file (str) – the file to ingest against
metadata_source (str) – the name of the source of metadata (used to namespace in the metadata service) ex: dbgap
auth (Gen3Auth) – Gen3 auth or tuple with basic auth name and password
max_concurrent_requests (int) – the maximum number of concurrent requests allowed
(Dict{indexd_field (manifest_row_parsers) – func_to_parse_row}): Row parsers
manifest_file_delimiter (str) – delimeter in manifest_file
output_filename (str) – filename for output logs
get_guid_from_file (bool) –
whether or not to get the guid for metadata from file NOTE: When this is True, will use the function in
manifest_row_parsers[“guid_for_row”] to determine the GUID (usually just a specific column in the file row like “guid”)
metadata_type (str) –
the type of metadata to be filled into the _guid_type field. If provided, will override the default logic per GUID: (GUID_TYPE_FOR_INDEXED_FILE_OBJECT
if is_indexed_file_object else GUID_TYPE_FOR_NON_INDEXED_FILE_OBJECT)
- async gen3.tools.metadata.ingest_manifest.async_query_urls_from_indexd(pattern, commons_url, lock)[source]¶
Gets a semaphore then requests a record for the given pattern
- Parameters:
pattern (str) – url pattern to match
commons_url (str) – root domain for commons where mds lives
lock (asyncio.Semaphore) – semaphones used to limit ammount of concurrent http connections