Metadata Tools

Ingest

Tools for ingesting a CSV/TSV metadata manifest into the Metdata Service.

gen3.tools.metadata.ingest_manifest.COLUMN_TO_USE_AS_GUID

file column containing guid to use

Type:

str

gen3.tools.metadata.ingest_manifest.GUID_TYPE_FOR_INDEXED_FILE_OBJECT

type to populate in mds when guid exists in indexd

Type:

str

gen3.tools.metadata.ingest_manifest.GUID_TYPE_FOR_NON_INDEXED_FILE_OBJECT

type to populate in mds when guid does NOT exist in indexd

Type:

str

manifest_row_parsers (Dict{str

function}): functions for parsing, users can override manifest_row_parsers = {

“guid_from_file”: _get_guid_for_row, “indexed_file_object_guid”: _query_for_associated_indexd_record_guid,

}

“guid_for_row” is the function to retrieve the guid from the given file “indexed_file_object_guid” is the function to retrieve the guid from elsewhere,

like indexd (by querying)

gen3.tools.metadata.ingest_manifest.MAX_CONCURRENT_REQUESTS

Maximum concurrent requests to mds for ingestion

Type:

int

async gen3.tools.metadata.ingest_manifest.async_ingest_metadata_manifest(commons_url, manifest_file, metadata_source, auth=None, max_concurrent_requests=24, manifest_row_parsers={'guid_for_row': <function _get_guid_for_row>, 'indexed_file_object_guid': <function _query_for_associated_indexd_record_guid>}, manifest_file_delimiter=None, output_filename='ingest-metadata-manifest-errors-1711461363.3879402.log', get_guid_from_file=True, metadata_type=None)[source]

Ingest all metadata records into a manifest csv

Parameters:
  • commons_url (str) – root domain for commons where mds lives

  • manifest_file (str) – the file to ingest against

  • metadata_source (str) – the name of the source of metadata (used to namespace in the metadata service) ex: dbgap

  • auth (Gen3Auth) – Gen3 auth or tuple with basic auth name and password

  • max_concurrent_requests (int) – the maximum number of concurrent requests allowed

  • (Dict{indexd_field (manifest_row_parsers) – func_to_parse_row}): Row parsers

  • manifest_file_delimiter (str) – delimeter in manifest_file

  • output_filename (str) – filename for output logs

  • get_guid_from_file (bool) –

    whether or not to get the guid for metadata from file NOTE: When this is True, will use the function in

    manifest_row_parsers[“guid_for_row”] to determine the GUID (usually just a specific column in the file row like “guid”)

  • metadata_type (str) –

    the type of metadata to be filled into the _guid_type field. If provided, will override the default logic per GUID: (GUID_TYPE_FOR_INDEXED_FILE_OBJECT

    if is_indexed_file_object else GUID_TYPE_FOR_NON_INDEXED_FILE_OBJECT)

async gen3.tools.metadata.ingest_manifest.async_query_urls_from_indexd(pattern, commons_url, lock)[source]

Gets a semaphore then requests a record for the given pattern

Parameters:
  • pattern (str) – url pattern to match

  • commons_url (str) – root domain for commons where mds lives

  • lock (asyncio.Semaphore) – semaphones used to limit ammount of concurrent http connections