The tranSMART data warehouse was originally designed for use in individual clinical studies with hundreds or thousands participants in which maybe tens of thousands observations were gathered. However, today tranSMART is also being adopted by hospitals and large population studies. Upscaling the data warehouse to many thousands of participants and millions of observations is easier said than done.
An excellent example of a large population study is the Netherlands Twin Register (NTR), run by the Psychology Department of the Vrije Universiteit (VU) in Amsterdam. Their data warehouse needs to work with a dataset of 200 thousand subjects and 500 million observations. Besides, these figures will continue to grow as participant information is still being collected.
Upscaling tranSMART and its user interface, Glowing Bear, for the NTR research data we quickly encountered the biggest problem: performing a query or creating an export could take up to hours or never finish at all.
To solve this performance problem, we tried and tested a range of solutions. Some proved helpful, other not so much. Here an overview of the tricks that IT specialists have up their sleeves to speed up processes:
One way to improve query speed, is by adding indexes: a well-organized list of variables referring to the location where the data can be found. Anyone who’s ever used the index at the back of a book know how it works: a keyword refers to the pages containing information on that topic. In case of the NTR database, the keywords might be participant characteristics such as weight, height, smoking, and other health and lifestyle aspects. In this case, the index does not refer to pages but to studies with observations on these characteristics. The Hyve’s colleagues tested several observation indexes and eventually found one that significantly improved query speed.
Another way to speed up queries is by creating partitions: breaking the dataset up in manageable compartments. The NTR-dataset, however, is complex and a lot of the information is interconnected. The register has not only a large number of participants, per person a massive amount of data is collected, and these data are collected in multiple studies. We noticed that partitioning the database did indeed improve queries from one specific study, but when exporting data from multiple studies it slowed down the process significantly. Therefore, this idea was abandoned for these data.
- Bit strings
What did prove helpful was the addition of bit strings. A bit string gives a quick answer per variable for how many participants there is a record in the database, say the number of subjects that smoke, like to read, or participate in sports. The bit string consists of a long line of 1’s and 0’s. If the bit at position n is 1, it means that subject n is in the set.
Using bit strings also speeds up queries that combine two or more variables, for example when combining the results for the number of twins with the number of smokers. This is useful if a researcher is only interested in twins that smoke.
- Saving in between
Saving subject sets for single and combined queries proved to speed up the next query asking for the same subset. For example, when the result for ‘smoking’ and ‘twins’ is saved, a later search for ‘smoking twins’ will be much faster. Also, the data for these subsets will be retrieved significantly quicker when the combined search has been saved in the database.
- Query splitting
Splitting a query in smaller portions speeds up the process as well. This might be achieved by splitting a search involving a number of variables in searches per variable. Breaking up a query in small sections means that the database is better able to execute the query in memory, instead of performing it on disk. Writing information to a disk and retrieving it makes the process much slower.
The same trick can be applied when exporting data: splitting a data export with 100.000 subjects into 10 sections of 10.000 participants and then combining those into one export file should speed up the export process.
- Database replication
Another way of making optimum use of a database’s computing power and reducing the need for time-consuming on disk analysis, is by creating two or three copies of the database: database replication. These duplicates are automatically synchronised, so they all contain the same information. When performing a query the computing power of all three databases is used. A load balancer makes sure the tasks are distributed evenly across the databases.
Whereas this approach seems very promising, and in practice it does indeed improve performance there is a big disadvantage: the system as a whole becomes less stable. A second downside is that a setup with multiple databases increases hardware costs. Our conclusion therefore was that the NTR would not benefit sufficiently from database replication.
- Increase memory
On the other hand, running the database on a machine with more memory, a relatively easy solution, did prove useful.
The achieved improvement
For testing the effects of each intervention, we did not use the complete NTR-dataset but a subset containing ~235,000 subjects and ~65,000,000 observations. At first, just requesting the number of subjects and getting a list of all the variables in the database took more than 12 minutes. With more memory, using bit strings, and saving subsets, this query just takes 75 seconds.
Selecting all twins participating in one or two specified studies, resulting in a subset of about 100.000 subjects, used to take over 10 minutes. Now, the result is presented in 31 seconds. Retrieving the list of variables available for these subjects takes less than 10 seconds and an export of 500 variables is completed in under 7 minutes.
What’s perhaps most important: when performing these analyses on the full dataset (~345,000 subjects and ~505,000,000), the results are comparable. This proves that tranSMART and Glowing Bear are indeed capable of working with large and complex datasets, when appropriate measures are taken.