Choosing a Common Data Model
for FAIR biomedical data
Much of The Hyve's day to day business can be summarized as supporting customers to make their biomedical data FAIR. In practice, this means that we now have executed dozens of projects harmonizing biomedical data, with healthcare sources ranging from local GP offices to clinical trials from the largest pharmaceutical companies in the world, and with biology data ranging from whole genome sequencing to physical activity data. In the course of these projects, we've also encountered and worked with a dozen different models for representing clinical data, such as i2b2/tranSMART, OMOP, FHIR, RDF, CDISC SDTM, ODM, etc. Using a standard, instead of just an Excel file and codebook, greatly helps to implement FAIR principle R1.3: (meta)data meet domain-relevant community standards.
In this blog post I will address some of the questions that people often ask me regarding the Common Data Model for FAIR biomedical data.
"How do you choose which data model applies to your data management process?"
This is not an easy question to answer, and it depends on many different factors, such as the nature (observational? interventional?) and source of the data, update frequency, intended usage, current and expected data modalities, applicable rules, regulations, best practices etc. A nice example of a context-based comparison of data models is the recent EMA report on comparing OMOP vs Sentinel for European healthcare data. But of course there are general patterns. A national health data network will have a different data integration approach than a bench-side cell line experiment. I would like to share a few insights based on 'frequently asked questions' that we often get about common data models as a step to make biomedical healthcare data FAIR.
"Can you give an example?"
Let's start with a generalized use case of looking for a data model and tooling could help to harmonize several biomedical studies (e.g. clinical trials, or investigator-initiated studies) for analytics purposes. An important question here is who the beneficiaries of this data integration are, and what type of analysis they are planning to do with the data (the 'use case' in IT lingo). In general, especially when starting out, you should try to demonstrate and maximize the utility of the approach directly to the end users (e.g. translational medicine scientists). If you would approach The Hyve which such a scenario today, you could get routed to a number of different technical integration approaches, for example:
using OHDSI tooling to query observational health datasets in a uniform way, leveraging the OMOP data model
leveraging an interface such as Glowing Bear on top of conformed clinical trial data using the i2b2/tranSMART data model
using cBioPortal to provide oncology researchers with direct access to the integrated clinical (survival status, cancer staging etc.) and omics (SNPs, CNVs etc.)
using PhUSE or similar RDF-based approach to ensure that the data is queryable using semantic web and search tools such as Disqover
"So what’s The Hyve doing with this?"
We are engaged with many of our academic hospital and big pharma company customers to build out some level of data harmonization to benefit analytics and data reuse per the FAIR principles, from a quick POC around a few studies, to a company-wide strategy for making R&D data FAIR. In this outtake from a recent Pistoia webinar, I went through a few examples (see the video below):
"Could you please explain the difference between OMOP and i2b2?"
This is a question I get often, and also the question when you would use one versus the other. Again here, there is much more to say than I can do here and a blog post cannot really do this question justice. On the other hand, it is definitely a valid question and there isn't a lot of practical information out there on these topics, so hereby an attempt to briefly summarize some of our findings at The Hyve, where we support all of them.
Let’s start with OMOP. The OMOP data model is very well suited for observational data and is widely used (with an estimated 1 billion health records worldwide represented in it). Its scope is to model (observational) medical history data in such a way that it enables systematic analysis of associations between interventions (drug exposure, procedures, healthcare policy changes etc.) and outcomes caused by these interventions (condition occurrences, procedures, drug exposure etc.). The default OHDSI tool ATLAS excels at performing this type of systematic analysis over multiple databases, and the associated tools also allow advanced analyses such as patient-level prediction. There's a ton of material on to highlight the use of OHDSI and OMOP, and also the ETL and analytics involved on the OHDSI YouTube channel. The OMOP model can be found here.
"Can you give us a flavour of the OHDSI community?"
The pharma company Janssen, one of the main driving forces behind the OHDSI community with Patrick Ryan, can be seen in these talks leveraging OMOP databases with several 100s of millions of patient records in it. The scale is impressive, it’s really worldwide, for example Korea has a very active K-OHDSI community. The meetings are also famous for their great atmosphere. Last year in Rotterdam we even had one of our close collaborators and leaders in OHDSI Europe, prof. Peter Rijnbeek from ErasmusMC, singing on stage! My personal favorite OHDSI gimmick is the 'LegendMed Central' which has millions of automatically generated research papers, showing how real the need is for systematic large-scale evidence generation in observational research, versus single hypothesis-based studies (background). It’s a great mix of passion for helping patients and exercising scientific rigor, while also enjoying friendship and a sense of community.
"What is The Hyve doing with OMOP/OHDSI?"
We are supporting multiple customers with their enterprise scale implementations of OMOP, and also provide OMOP mapping and OHDSI tools installation services. In the EHDEN project, where we lead the technical implementation workpackage together with Janssen, we are even rolling out a European federated health data network based on OMOP! I would highly recommend to join our upcoming OHDSI Europe symposium 29-31 March in Rotterdam if you are interested in hearing more about that.
"And what about i2b2/tranSMART and other models such as CDISC?"
The i2b2/tranSMART, FHIR, openEHR, CDISC, and cBioPortal data models and associated applications also all enable different use cases and have their own scope. For example, cBioPortal (both data model and application) is specifically built to analyse cancer genomics datasets and study the associations between the genomic make-up of the cancer cells and its relation to clinical outcomes such as survival. cBioPortal can be seen in action on its public website (see e.g. patient view and a cohort view). In addition, the CDISC models are optimized to represent clinical study and trial data in a way that is transparent for regulators, and the i2b2/tranSMART data model is well suited to represent this type of data alongside EHR data (which could even come from OMOP) and perform exercises such as data browsing, data access requests and cohort selection and sharing with tools like Glowing Bear and Podium. The i2b2/tranSMART model leverages a star schema, you can check out the overview and loading tools.
(Glowing Bear user interface walkthrough)
Again, this quick generalization doesn't really do all these tools and models justice, and there's a lot more to say about for example the rise of FHIR, the importance of architecting for scalability and the support of multiple modalities, the metadata that should go with these datasets, and our latest research on data models cross-overs such as representing clinical trials in OMOP. Please reply below with your opinions or some suggestions what you would like to see covered in my next post!