The latest news from The Hyve on Open Source solutions for bioinformatics

Recent Posts

Optimizing scientific data sharing

September 13, 2018 | By Ward Weistra

In the Biomedical Sciences, sharing research and patient data is becoming increasingly important. But how to make sure, on the one hand, a scientist receives the information he needs, while at the same time protecting the privacy of patients and study participants? Working with the Netherlands Twin Register (NTR), The Hyve has tackled this dilemma and found some clever solutions.


The Netherlands Twin Register has been collecting data on Dutch twins, their parents and siblings for over 30 years, mainly via questionnaires. The Register is run by the Psychology Department of the Vrije Universiteit (VU) in Amsterdam and contains data from hundreds of twins. The VU-scientists collaborate at a national and international level with a number of research groups. Hence, their wish to share the valuable data they collected over the past decades with these colleagues. At the same time, they want to exchange the data in a controlled manner, protecting the privacy of participants and without disclosing the entire dataset.

Data distribution flow

 

Figure 1: Schematic overview of the scientific data sharing workflow for the Netherlands Twin Registry. 1 – With low-barrier access to non-identifiable data researchers can easily select variables of interest. 2 – Following an online request flow the researcher can request access to the selected dataset. The data owner can configure their access approval process according to institute policies. 3 – With the Glowing Bear user interface data mangers can easily extract the requested set of data and export to a ready to use package for the researcher.

 

1. Data Catalogue: Preview your data

The exchange of research data can be rather labour intensive. At the moment, it usually starts with a scientist sending an e-mail with a request for specific data. This is often followed-up by a series of e-mails to clarify which subset of data is wanted, if the information has actually been collected, and if the number of selected patients is large enough to answer a certain research question.
To speed-up this process, The Hyve build a Data Showcase for the NTR. The Showcase taps into the data warehouse that contains all the research data (tranSMART in this case), but it displays only information that can be shared with third parties.

1DataShowcaseScreenshotSquare

Figure 2: The NTR Data Showcase. With this data catalogue researchers can search for NTR variables of interest, see their description, aggregate counts and demographic breakdowns, with a low barrier to entry. The selected variables can be exported to a list to be used for requesting access.


In this way, a researcher browsing the Data Showcase gets insight into the variables that have been measured, such as physical characteristics like length and weight, socio-demographic factors, physical and psychological health. At the same time, disclosure of person-specific information is prevented. Data on the number of participants is available, for example, but low numbers will be referred to as ‘less than 20’ and information on gender and age groups is pooled.
In the Data Showcase, the external researcher can select variables that are relevant to his research, export these as a file and attach it to his request for data. This makes it easier for the research group to select the requested data set and increases the chance that the fellow-researcher receives the right subset.

 

2. Data Request flow: Online request access to data and samples

Data request are commonly handled by a paper form or email request, like at NTR, to ensure the researcher has the right background to access the data and to ensure he/she will handle the data in an appropriate manner.
To streamline the request process, The Hyve had also developed Podium, a request portal that can be installed for a single customer but is also integrated into the BBMRI-NL website. Dutch biobanks and biomolecular research institutes collaborate in BBMRI-NL for the exchange of samples, images, and data. Moving from e-mail to an online form makes it easier to standardize the request process and gives a clear insight into the status of each request. NTR is currently considering joining the BBMRI-NL Podium Request Portal.

2PodiumScreenshotSquare

Figure 3: The Podium request portal. Data owners can define the necessary fields and full request workflow to make sure the right information is provided by the researcher for the data owner to handle the access request.


The way a request is handled might differ across institutions and research groups, but the general request workflow, which is fully configurable, is detailed below. Together with his request for data, images or sample material, the online form might ask the external researcher to provide details on his affiliation, research question, and/or research protocol. At this stage, the file with variables exported from the Data Showcase may be added.

Request process

Figure 4: The general Podium request workflow. This fully configurable flow allows the data owner to ensure the right governing bodies approve the request and allows both researcher and data owner to see the status of the requests.


After submission of the request, a coordinator will assess if the request is valid. If so, the coordinator will pass it on to a review board or data access committee for evaluation if the request complies with the institution’s guidelines on sharing scientific data, images and samples. After approval of the board, the coordinator or a data manager will gather the requested information, and possibly even physical samples, and send it to the external researcher. Upon receipt the researcher can then confirm the arrival and close the case.

 

3. Detailed cohort selection and analysis

The task of bringing together the data subset that the researcher has selected in the Data Showcase can be a daunting task for data managers, having to combine data from many different files and databases and decoding them with codebooks. That is why with NTR The Hyve has built a continuous pipeline from the NTR source data systems and codebooks to the tranSMART data warehouse. With the Glowing Bear user interface data managers can easily define inclusion and exclusion criteria for defining a patient cohort and select variables of interest to the researcher.

3GlowingBearScreenshotSquare

Figure 5: The Glowing Bear data selection interface. The interface allows for easy patient cohort selection with complex inclusion and exclusion criteria. Since both the Data Showcase and Glowing Bear are served from the same tranSMART data warehouse the selected variables from the Data Showcase can directly be used to limit the selected variables in the Glowing Bear dataset selection.


Since both the Data Showcase and Glowing Bear data selection interface are served from the same tranSMART database, data managers can directly use the set of exported variables from the Data Showcase to select the precise variables the researcher needs.
Glowing Bear allows the data manager to directly export the requested dataset to a text file to be used in Excel or SAS data package, to be shared with the researcher. The external researcher benefits as well, as he or she receives all the information in one file instead of a dozen or more separate files.


If you are interested to see how The Hyve can leverage their expertise and open source toolset to facilitate your data sharing, please get in contact!