Portable Format for Biomedical Data
Portable Format for Biomedical Data
PFB is a serialization file format designed to store bio-medical data, as well as metadata. The format is built on top Avro to make it fast, extensible and interoperable between different systems.
What is an Avro File?
A file with data records (JSON) and a schema (JSON) to describe each data record. Avro files can be serialized into a binary format and compressed.
Read more about Avro.
What is a PFB File?
A PFB file is special kind of Avro file, suitable for capturing and reconstructing biomedical relational data.
A PFB file is an Avro file with a particular Avro schema that represents a relational database. We call this schema the PFB Schema.
The data in a PFB file contains a list of JSON objects called PFB Entity objects. There are 2 types of PFB Entities. One (Metadata) captures information about the relational database and the other (Table Row) captures a row of data from a particular table in the database.
The data records in a PFB file are produced by transforming the original data from a relational database into PFB Entity objects. Each PFB Entity object conforms to its Avro schema.
Vanilla Avro vs PFB
Let’s say a client receives an Avro file. It reads in the Avro data. Now a client has the Avro schema and all of the data that conforms to that schema in a big JSON blob. It can do what it wants. Maybe it wants to construct some data input forms. It has everything it needs to do this since the schema has all of the entities, attributes, and types for those attributes defined.
Now what happens if the client wants to reconstruct a relational database from the data? How does it know what tables to create, and what the relationships are between those tables? Which relationships are required vs not? This is one of the problems PFB addresses.
Schema
PFB file is consist of list of Entities, where Entity is a container type (it stores the actual data) in object field. The id, name and relations fields used for linking each entity to other entities in the PFB file. The relations consists of a destination id and name: so each entity is linked to some other entity (note, there is no backward links). The id and relations is only relevant for clinical data, so for Metadata entity it’s null
.
The only entity which is required in the PFB file is Metadata: it describes the data model linking and stores ontology references to make interoperability with other systems easier. For each node and fields multiple links to ontologies can exists. The ontology is user defined and it’s possible to perform a schema evolution on the PFB file to harmonize all of the data to a single version of a given ontology.
The Avro IDL for Metadata:
record Metadata {
array<Node> nodes;
map<string> misc;
}
With Node
and Property
defined as:
record Node {
string name;
string ontology_reference;
map<string> values;
array<Link> links;
array<Property> properties;
}
record Property {
string name;
string ontology_reference;
map<string> values;
}
In both Node
and Property
the name
is the name of the field in the data model, ontology_reference
stores the name for ontology reference (any ontology, e.g. NCIt, caDSR) and values
stores any additional field regarding ontology: URL, ontology codes, synonyms etc. Same for Property
.
In addition, inside links for Node
it’s important to specify the linkage for the nodes: destination and multiplicity.
The Link
and Multiplicity
enum is presented below:
record Link {
Multiplicity multiplicity;
string dst;
}
enum Multiplicity {
ONE_TO_ONE,
ONE_TO_MANY,
MANY_TO_ONE,
MANY_TO_MANY
}
Types
Enum
Because Avro can’t store anything except alphanumeric and _
symbols, all enums are encoded in such a way, that all other symbols is being encoded with codepoint wrapped in single underscores. For example bpm > 60
will be encoded in: bpm_32__62__32_60
, so space ` ` is encoded as _32_
and greater sign >
into _62_
. Same for Unicode characters: ä
- _228_
, ü
- _252_
. The Avro schema also doesn’t allow for the first character to be a number. So we encode the first character in the way if the character happens to be a number.
Example
The schema on the image below shows the simplest model with only Demographic data.
The Avro IDL for this schema is located in sample.avdl.