The latest news from The Hyve on Open Source solutions for bioinformatics

Recent Posts

Data warehouses for life sciences: more than storage alone

July 12, 2017 | By Wibo Pipping

Sharing, visualizing and analyzing your data

Your institute or company is executing top-level research. You know that the generated data is available somewhere, but it is spread over many disparate databases and (shared) file drives in many different formats and terminologies. How to bring this all together? That is where data warehouses come in.

 

Business plan image with collage hand drawings.jpeg

But where to start to find a data warehouse that is suitable for you? What should the data warehouse do for you? Do you only want storage and harmonisation or also possibilities to analyze the data?

The first step is defining the precise problem you want to solve.

 

Problems with organizing collected data

As a researcher in life sciences, you gather a lot of different types data over time: medical data, genomics and other “omics” data, measurements from wearables, etcetera.You want to combine the data from these data sources and get answers to your research questions.

That is when you are faced with the following problems:

  • You do not have a clear overview of the data that is available to you. It is stored in different locations (e.g. an Access database, a SPSS file on a PhD’s computer) managed by different people
  • The data reside in different file formats
  • Different terminologies and definitions of variables were used in the different studies
  • Some data is on a level that it doesn’t make sense to integrate yet (e.g. images, raw sequencing reads)
  • Codebooks and metadata on how data was collected are missing

 

Trying to solve these problems all at once can be overwhelming, even if you are an expert in your field. A data warehouse can the solution for these problems.

 

What does a data warehouse do

A data warehouse does more than just storing your data. It forces you to store all your data in predefined formats. This increases the interoperability and integration possibilities of your data. It also reduces the amount of manual work when creating reports or visualisations based on your data.

Typically data warehouses allow you, once you have loaded your data, to:

  • Organize
  • Query
  • Share
  • Analyze
  • Visualize
  • Create reports
  • Add meta data on how and where the data were collected

 

Get our data warehousing infographic 

What does a data warehouse not do

A data warehouse does not magically harmonise all your data. It will still take some work to bring your data to the same format and terminologies. It does give you one single format, allows you to define company or institute-wide standards and gives your users a single point of truth for study data. It gives you and your users the power to make assumptions on how the data is stored which is crucial to provide standardised visualisations and analyses! 

 

So what is the best data warehouse for me?

Before you can start to answer this question you have to ask 'what should the data warehouse do for me?'. Different applications have been developed specifically for life science data, all with their area of expertise, advantages and disadvantages. Some focus on a very flexible data model allowing you to store many different kinds of data, others focus more on visualisations making more assumptions on the data types that can be stored in it. The decision on which data warehouse to choose comes down to what data you have and what do you want to do with it.

 

Some open source data warehouse examples are:

 

If you want to learn more on data warehousing consider subscribing to our mailing list, leave a comment with your question or get in touch directly