For about a year now, I am technical project manager of the Nutrional Phenotype Database (dbNP)project, which was initiated by The European Nutrigenomics Organisation (NuGO). Nutrigenomics is what I would call ‘applied bioinformatics’. It is different from bioinformatics flavours such as next generation sequencing, metabolomics or transcriptomics in the sense that it has no data format of its own. The same goes for toxicogenomics or pharmacogenomics. Those bioinformatics disciplines just use whatever data they can use to answer biological research questions. Of course, you can still publish a nutrigenomics paper in which only metabolomics data is analyzed under for example two different diet conditions. But you won’t get a paper through just about ‘peak picking’.
So if one thing is important for nutrigenomics studies, it’s a clear description of your study design. Who or what did you measure, under which conditions, which events took place, when did you take your samples, etc. We would also need such a description of the study design to link all the data gathered over the course of such an experiment. These data sets can be huge in terms of diversity: in one NuGO study, there were over thirty different ‘omics assays’ performed on thousands of samples, all with different time points, groups and diet conditions. At that point, an Excel sheet simply can’t do the job anymore. So we decided to start out with this dearly needed part of the ‘database’. (Why do biologists call every computer application they need a database, especially when they don’t know exactly what they need? )
We want to build a fully open source solution. I was confident that there were many open source tools available which we could use for the description of study designs. We started out with the ISA tools from EBI – specifically created for this type of study descriptions. However, we soon encountered limitations, especially because the formats are tab-delimited. ISATAB is great for exchanging general metadata about a study between different platforms, but it falls short when comes to storing a full-blown study design such as the NuGO study I referred to earlier. The same was true for all the other formats
I reviewed, and there were many: ISATAB, MAGETAB, MAGE-ML, FuGE, GEN2PHEN, Pheno-OM, SimpleTox… They were either too platform specific (like MAGE, which is, as the name says already, meant for microarray experiments) or too general when it comes to study design. I also checked out a large number of (mostly) open source tools which try to realize this goal: ISACreator, Annotare, SetupX, and the as-of-yet proprietary ArrayTrack and WikiLIMS … And finally, I read quite a number of standards and work group documents, some of them complete, others never really finished: MIBBI, MIAME, MIAME-NUT, OpenTox, MSI, BioXSD, DAS, ontologies such as EFO, SWAN, EXPO, EDAM…
I realized that in the end, there are only a limited number of mindsets represented in all those documents and tools. There is the mindset of the lab analyst, who sees test tubes, protocols, machines and sample runs. Then there is the mindset of the bioinformatician, who sees omics data types, algorithms and file formats. And finally, there is the mindset of the biologist, who sees phenotypes and genotypes, species and environmental conditions. To unite all these mindsets and users in one network of open source tools, and still retain an active user base, is a very big project. Something that I haven’t seen accomplished anywhere yet (except maybe in caBIG), despite all the publications about ‘comprehensive’ tools with a promise among those lines.
So, where do we, where does the dbNP project team stand in this? Do we want to repeat caBIG, but better? Yes and no. We certainly want to be able to do multi-omics. That’s what the involved nutrigenomics researchers do now anyway, but we want to make life easier for them. Over and over running the same R scripts again, keeping track of your data in Excel sheets on shared servers, and mailing around results as PowerPoint presentations may get the job done in the end, but it is not very efficient and it certainly doesn’t contribute to the repeatability of the experiments and findings. So we will do what it takes to get there. On the other hand, we don’t want to become a stone in the extensive graveyard of nice bioinformatics tools and projects which by the time the funding ran out, were 80% complete, 40% documented and have zero users. Our key to that is community involvement, a tight cooperation between a small group of very active users which constantly give feedback to our team of developers. It is also the reason that over this last year, we did only one thing – implement a tool for the description of study designs, because that is an immediate need for all involved consortia, and it will serve as a starting point for our further activities. The tool is called GSCF.