The latest news from The Hyve on Open Source solutions for bioinformatics

Recent Posts

TranSMART: Where should the platform go in the coming years?

April 11, 2015 | By Kees van Bochove

The TranSMART community has been growing at a very rapid pace. It’s seen more success in the last two years than its creators at J&J and Recombinant would have imagined in 2012. In this article, I would like to lay out some perspectives on where tranSMART should go, from a community, code and content perspective. Not incidentally, these perspectives are also the names of the ‘3C’ streams of the tranSMART Foundation. I will start with a small introduction on tranSMART and translational research data science.

 

Why is tranSMART so popular?

TranSMART has been a huge boon for pharmaceutical companies, hospitals, patient organisations and especially collaborative projects between such entities, like CTMM TraIT and IMI eTRIKS. It’s main strength, in my opinion, is that it provides an out-of-the-box toolset to deal with the very complex, often boring and sometimes frustrating challenges of translational data management.

Translational research, the chasm of science in between clinical research, bioinformatics and epidemiology that is so crucial to precision medicine, is still in its infancy when it comes to adequate tooling. Good data science in this area is for the most part still a pretty mundane activity: it consists of finding and cleaning data, parsing file formats, mapping clinical and biological concepts and identifiers, (re)normalizing data and the like. After all this hard and repetitive work, maybe 10% of the productive time of a data scientist is left for the actual relevant activity: interpreting the data in its clinical, biological and epidemiological context. And that’s assuming you’ve already dealt with all the legal, ethical and social aspects that come for free with patient data!

TranSMART offers at least some tools to deal with these challenges. I admit these tools are far from perfect (see Code below), but given the tremendous popularity of the product, it’s clear that it meets a serious need. So where should tranSMART go in the coming years to capitalise on this momentum?

 

The Community perspective

The current tranSMART community is probably the best reason to become involved with the platform. The fact that tranSMART somehow brought together a worldwide community of scientists across the academic, industry and non-profit sectors who are all committed to improving the state of translational research using one open source platform, is remarkable in itself. It was also promising to see is how fast the tranSMART Foundation was able to secure stable funding from across those sectors after its start in 2013, without relying on any upfront grants or funds.

However, that doesn’t mean that there is no work to be done. Apart from even more collaboration in the current community (case in point: tranSMART Foundation could really use a community manager in Europe, anyone?) and alignment with I2B2, I think that a lot of value and sustainability could be added by more integration and interfacing with neighbouring communities and organizations.

Born out of - amongst others - the Pistoia Alliance, tranSMART tends to be a bit industry-focused, and could do very well by interfacing more with important public funding agencies, such as EBI and NIH. I would also love to see the tranSMART community become more involved with the Global Alliance for Genomics & Health, to bring the translational angle in this important alignment effort in the area of genomic data.

 

The Code perspective

From a code perspective, tranSMART clearly needs to move up a level. TranSMART 1.2 is saturated with all kinds of features, which is great, but from a code maintenance perspective, we have reached a threshold of technical debt that calls for a major rewrite. Also, the full stack approach using Grails might have been great 7 years ago, today, this just doesn’t fit the way we think about software development and lifecycle management in R&D informatics anymore - or in the software development field in general, for that matter. Finally, the user interface is in need of a fundamental overhaul.

Luckily, this rewrite is already underway, as illustrated by the recent tranSMART 2.0 Architecture Recommendation that was co-authored by all core developers of the platform worldwide. The implementation of this proposal should launch tranSMART in a direction that will facilitate and accommodate the growth of the platform that is anticipated for the coming years.

The core function, the data warehouse for clinical and omics data, is exposed by a RESTful API (which by the way is already part of tranSMART 1.2, so gradual transitions to 2.0 will be possible). On top of this API, a whole range of applications can be built: a re-implementation of the ‘classic’ tranSMART cohort explorer GUI as a web app in AngularJS (a prototype is already in the works), but also workflows in R / Shiny, Jupyter / iPython, and Scala / Spark Notebook, focused downstream analysis portals such as cBioPortal, enterprise visualisation platforms such as Spotfire, etc.

We can also expect scalable genomics data storage and processing backends to be added to the stack soon, such as ADAM (based on Spark). Horizontal scalability is crucial for storing thousands of genomes. Also, columnar storage formats such as Parquet and Cassandra’s SSTable are much better suited for many genomics data retrieval use cases than their RDBMS counterparts.

Another important advantage of aligning the platform around one data API, is that there is no need for any full scale enterprise federation architecture. Instead, as long as we provide sufficient metadata and security in the API, we can create applications on top of multiple tranSMART data sources. This would allow interaction on a combination of content from in-house and public data repositories.

 

The Content perspective

From a content perspective, there is also still a world to conquer for tranSMART. Part of the attractiveness of tranSMART as a ready to go solution, is that its forces users to put in their data in a particular format, allowing downstream analysis tools to work with some assumptions about the provided data. On a scale between complete flexibility of the data model versus a completely rigid data model, tranSMART is pretty much at the side of rigid for omics data and more flexible on the clinical data side, where it uses the observation-centric i2b2 star model. However, the downside of that is that it’s not trivial to get data into tranSMART, and it’s left to the data management process of the organisation whether any ontologies are used to map clinical data, and how these standards are enforced.

 

If you ask early adopters of tranSMART (e.g. Pfizer, Sanofi, Takeda, TraIT, eTRIKS etc.) for success stories around tranSMART, you will quickly find out that the primary impact is improving availability of data. In fact, data sharing seems to be the primary business need that tranSMART fulfils. Only it’s an incredible complex kind of data sharing: translational research data is only really useful with the right informed consent from the patients, clinical data that is sufficiently mapped and harmonised with research-focused clinical data ontologies, biomarker data that is accompanied by the right provenance, mapped to universal biomolecule identifiers and correctly normalised (but not so much that the original data entropy is lost), and sample metadata that accurately reflects the properties and conditions of the source biosamples the data was acquired from. By these standards, there are still very few real translational research datasets available.

 

Data Sharing

A whole next level could be reached if the tranSMART community would not only share and exchange curated and annotated datasets, but also actively create a global marketplace where supply and demand of translational research data would meet. If tranSMART data repositories would for example implement the FAIR Guiding Principles, making sure that tranSMART datasets are globally findable, accessible, interoperable and re-usable, this would open a whole new world for translational data science.

 

Data from large data archives such as dbGaP, EGA, TCGA etc. often remain in practice inaccessible to researchers, simply because they don’t get around to the access procedures. However, it’s very hard to alleviate the fundamental tension between scientific interests and patient privacy rights. So many foremost scientists already have dedicated their thinking and resources to this quest. Maybe we should look for patient organisations to lead the way here, as demonstrated by the planned Michael J. Fox Foundation tranSMART datathon.

 

In short, there is a tremendous opportunity for tranSMART to become the de facto platform for exchanging translational research data, leading by example, but this requires a joint effort from the current community to position and improve tranSMART as such!

 

Kees van Bochove is CEO of The Hyve, a company based out of Utrecht, Netherlands and Cambridge, MA, USA, that provides commercial support for open source software in translational research. He leads the Architecture Working Group of the tranSMART Foundation.