The latest news from The Hyve on Open Source solutions for bioinformatics

Recent Posts

A business case for semantic data integration in Bioinformatics

August 02, 2011 | By Kees van Bochove

What is the major bottleneck in your bioinformatics research when it comes to data and software? I asked this a number of bioinformatics researchers from pharma and food industry at the ISMB conference. It's a very general question, but it never failed to provoke an answer. "Are you kidding? Where do you want me to start?" And then the story comes, about the challenges of managing the in-house data, connecting it with public data and annotations, and especially, interpreting what it all means in the context of their target biology. Because that is where the real bottleneck resides - in the data interpretation.

It's no coincidence that the EMBL Programme 2012-2016 theme is coined 'Information Biology'. It's also no coincidence that the first keynote talk of ISMB was titled 'Computational biology in the 21st century: making sense out of massive data'. Yes, we still have a a lot of challenges in data management - my favorite ISMB quote was Dominic Clark presenting the results of an EBI database user survey and showing that the most common activity of database users in industry is retrieving records from the databases. With a total of 10 petabytes storage at EMBL (of which I suspect EBI databases make up a significant part, see the ELIXIR business case) it makes sense that smaller businesses cannot afford a local mirror. They will have to use online database searches to carry out their research. However, keeping in sync with EBI databases is what I would view a small, solvable infrastructural problem. Interpreting the data, finding relevant literature, finding relevant databases, and then come up with a sensible, testable hypothesis of the biological mechanisms at hand, that's where the real time and money goes.

So what is the 'rate limiting step' in this process? I would say there are a number of them, as always in biology. One of them has to do with finding relevant literature, and extracting the information you need from that, turning it into knowledge.

I remember that during my Bioinformatics Master project, which took only a few months, I printed a pile of literally half a meter high of scientific journal articles, and I read most of it - with the single goal of finding relevant knowledge and datasets for the lipoprotein metabolism processes I was studying. There simply was no other way - I could deduce from the abstract whether a paper would be relevant at all, but not whether sentences in it would provide valuable hypotheses for the mechanisms I am studying, whether statistics from the summary tables would contain clues for my research, whether used datasets might also be useful for me… etc. OK, now it's 2011 - enter Mendeley, but back in 2008, such a tool did not exist, at least not to my knowledge.

Another important bottleneck, however, is semantic integration of the data. Let's suppose you have been studying cholesterol metabolism for quite some years, and that you have tons of old microarray data performed in both mouse and human, along with a fair amount of metabolomics analyses of the same samples. Currently, you are investigating the role of the PPAR-alpha receptor, and you come across an interesting paper stating that they found a relation between the presence of eicosapentaenoic acid (EPA) and the activity of PPAR-alpha. Immediately, a lot of questions come to your mind. Do I see this relation in our in-house data as well? Do I have some datasets in which PPARA m-RNA expression was changed, and where EPA levels were also measured? Do the mice/humans in these experiments happen to use any PPARA-targeting drugs or did they happen to have changed their EPA intake, maybe because of a diet intervention? Which diets might cause such an intake change? Do I have enough data to do a mouse model - human comparison on this? Lately, you also have been studying diabetes as a group - is there also an interaction with T2DM, or is there no significant difference in mRNA level change between T2DM and 'healthy' subgroups? Is there maybe any public microarray data from GEO or ArrayExpress which has PPAR-alpha or EPA - related annotations which might be of interest to compare with? O, and by the way, how does the data look with related molecules, other omega 3 fatty acids such as DPA? Good luck answering these questions when you have to look up these kind of annotations or knowledge by going through PDFs and merging Excel datasets! It will costs you days, and when you finally have an answer, a new publication comes out which forces you to re-arrange and re-evaluate the data.

Imagine you could see the answers to all these questions with a click of the button - that would really be a disruptive technology! Maybe, in 5 years, we really can. It would be interesting to evaluate what exactly we would need on a technical level to make that true for this use case, but that is another lengthy blog post. At the core of it is that you need semantic annotations of your data - you need to link concepts like 'EPA', 'PPARA' and 'T2DM' throughout both literature assertions and datasets. Interestingly, at ISMB, I spoke at length with some people from Biomax AG, a German company which sells a solution for exactly that. They invented a proprietary semantic annotation and inference system in the nineties, long before Semantic Web standards such as RDF came around. This proves that the demand for this kind of semantic integration has been there for a long time - it's just that now, the knowledge and data is generated at such an incredible rate, that we really need to let the computer do some of the semantics for us, or else we cannot keep up anymore. This is illustrated by the pharma companies teaming up to launch joint semantic annotation efforts, such as the Pistoia Alliance and IMI project such as OpenPHACTS. If competitors join forces like this, you know there is a real demand - and hence, a business case. Semantic Web technology and ontology developments are becoming mature right now, and once the community at large starts adopting it, we are looking at a paradigmatic change in the way we will do research.