Portable Format for Biomedical Data

PFB is a serialization file format designed to store bio-medical data, as well as metadata. The format is built on top Avro to make it fast, extensible and interoperable between different systems.

What is an Avro File?

A file with data records (JSON) and a schema (JSON) to describe each data record. Avro files can be serialized into a binary format and compressed.

Read more about Avro.

What is a PFB File?

A PFB file is special kind of Avro file, suitable for capturing and reconstructing biomedical relational data.

A PFB file is an Avro file with a particular Avro schema that represents a relational database. We call this schema the PFB Schema.

The data in a PFB file contains a list of JSON objects called PFB Entity objects. There are 2 types of PFB Entities. One (Metadata) captures information about the relational database and the other (Table Row) captures a row of data from a particular table in the database.

The data records in a PFB file are produced by transforming the original data from a relational database into PFB Entity objects. Each PFB Entity object conforms to its Avro schema.

Vanilla Avro vs PFB

Let’s say a client receives an Avro file. It reads in the Avro data. Now a client has the Avro schema and all of the data that conforms to that schema in a big JSON blob. It can do what it wants. Maybe it wants to construct some data input forms. It has everything it needs to do this since the schema has all of the entities, attributes, and types for those attributes defined.

Now what happens if the client wants to reconstruct a relational database from the data? How does it know what tables to create, and what the relationships are between those tables? Which relationships are required vs not? This is one of the problems PFB addresses.

Schema

PFB file is consist of list of Entities, where Entity is a container type (it stores the actual data) in object field. The id, name and relations fields used for linking each entity to other entities in the PFB file. The relations consists of a destination id and name: so each entity is linked to some other entity (note, there is no backward links). The id and relations is only relevant for clinical data, so for Metadata entity it’s null.

The only entity which is required in the PFB file is Metadata: it describes the data model linking and stores ontology references to make interoperability with other systems easier. For each node and fields multiple links to ontologies can exists. The ontology is user defined and it’s possible to perform a schema evolution on the PFB file to harmonize all of the data to a single version of a given ontology.

The Avro IDL for Metadata:

record Metadata {
    array<Node> nodes;
    map<string> misc;
}

With Node and Property defined as:

record Node {
    string name;
    string ontology_reference;
    map<string> values;
    array<Link> links;
    array<Property> properties;
}

record Property {
    string name;
    string ontology_reference;
    map<string> values;
}

In both Node and Property the name is the name of the field in the data model, ontology_reference stores the name for ontology reference (any ontology, e.g. NCIt, caDSR) and values stores any additional field regarding ontology: URL, ontology codes, synonyms etc. Same for Property.

In addition, inside links for Node it’s important to specify the linkage for the nodes: destination and multiplicity.

The Link and Multiplicity enum is presented below:

record Link {
    Multiplicity multiplicity;
    string dst;
}

enum Multiplicity {
    ONE_TO_ONE,
    ONE_TO_MANY,
    MANY_TO_ONE,
    MANY_TO_MANY
}

Types

Enum

Because Avro can’t store anything except alphanumeric and _ symbols, all enums are encoded in such a way, that all other symbols is being encoded with codepoint wrapped in single underscores. For example bpm > 60 will be encoded in: bpm_32__62__32_60, so space ` ` is encoded as _32_ and greater sign > into _62_. Same for Unicode characters: ä - _228_, ü - _252_. The Avro schema also doesn’t allow for the first character to be a number. So we encode the first character in the way if the character happens to be a number.

Example

The schema on the image below shows the simplest model with only Demographic data.

schema

The Avro IDL for this schema is located in sample.avdl.