The latest news from The Hyve on Open Source solutions for bioinformatics

Recent Posts

Archiving in a FAIR way, an Overview of Data Archive Costs

September 01, 2017 | By Jarno van Erp

One of the most underestimated parts about a FAIR data management plan is archiving your data. How am I going to store my data in such a way that people can access, download and share it? And what are the data archive costs included for storing your data in such a way? Even in just the way you store your data a lot of variables come into play to decide for which options you should go. Hopefully this gives you a better view on what is needed for archiving your data in a FAIR way.

 

fair data storage, data archive costs

Where?

One of the first questions you need to answer is: where are you going to store your data? This could be some cloud storage you rent, an open data repository or maybe a specific repository for your domain. Each one has its own pros and cons, which are described in the table below.


So you have decided to use a repository, but how do you find a relevant repository? Doing a quick Google search can already lead to relevant results and if there is the option to go for a discipline-specific repository than that should be the default go-to. Ask around, maybe your colleagues or friends in the field know a repository they always use or consult websites like fairsharing. Or take a look at Nature’s recommended data repositories, where they list different options for all types of domain specific repositories.

Preservation

Do you need to prepare your data to ensure preservation? Is the original file format unstable or not standardly used in your field? Do you need to anonymize the data? This is of course very dependent on what kind of research you do. Do you work with people? In this case you almost certainly need to anonymize your data before sharing. If you have no idea which part of your data should be anonymized, search for Personally Identifiable Information (PII). PII is a description of information which can be used to identify a person, on its own or combined with other information. Most are straightforward like date of birth and home address.


Another quickly overlooked question is: in which file formats do you want to store your data? Some file formats are not standard to use or even unstable. Converting your data to standard and stable formats will increase the quality of your data preservation and increases the number of possible sharing uses.

Software

What software or tool did you use for analysing your data? Is it publicly available and where can it be found? There is not much use to archiving your data and sharing it if others cannot use it because it is dependent on one tool. Converting your data to more standard and stable formats can help in such cases. Storing your data in a domain/discipline specific tool or repository can also help mitigate these problems.


This table gives you a list of options you have for storing data, their pros and cons, as well as how FAIR such archiving is.

 

Type Cloud storage Open data repository Domain specific repository/tool
Pros
  • You are the owner of the place where you store your data. This means there is not much data processing needed to store your data.
  • There are no other standards to be applied to your data. So you can just store your data there and leave it as is, or maybe link to it from a public place.
  • An open and already easily accessible way to store data without much processing needed for your data.
  • Your dataset is already easy to find in the repository, as the repository is build around this purpose.
  • The biggest benefit of storing your data in a repository or tool specific to your research domain is the possibility of interoperability between other datasets. The subject of the studies inside the repository/tool is the same, with a standardized data format it is easy to get datasets to work together.
Cons
  • You are responsible for making sure your data is not lost. If your server for some reason crashes, you should be able to restore your data.
  • Your data is not stored in a useful way: if another researcher wants to use your dataset they will probably need to do some form of preprocessing before they can use it.
  • There are currently a lot of initiatives for open data repositories. If one goes out of the air, your data could be lost if you did not make a backup of it.
  • If you want extra functionality, such as publishing your data, there will be added costs (as well as with the amount of data you store).
  • There is probably a lot of processing needed for your dataset before you can add it to a domain specific repository/tool.
  • You will have to curate your data to make sure it is not lost at some point, loses relevance and stays up to date.
FAIR This is the least FAIR way of archiving your data. It is not easily findable and accessing the data could be a big hassle. You need to configure if people can just download it or if they need to ask your permission. One of the easiest ways to increase the level of FAIRness is to publish the download link to an easily accessible page with a detailed description of what is inside your dataset. Storing your data in an open data repository is already a lot more FAIR than storing it in the cloud. There is a website in place to access it which is already developed for making your data findable. Storing your data in an open data repository with a FAIR data point will increase the level of FAIRness by a significant amount. This is the FAIRest way of storing your data out of these three. Most domain specific repositories/tools are known in their community and linking and contributing is always highly appreciated in these communities. However keep in mind that, to be as FAIR as you can be, there should be an easy way to only download the dataset.

 

How do I decide if the repository of my choice is FAIR?

If you can answer “yes” to all of these questions, the repository is considered FAIR:

  • Do the datasets have a globally unique and persistent identifier?
  • Can you upload metadata about essential information such as the origin of your dataset and submitter-defined metadata (ontologies used, naming of variables etc)?
  • Is it stated which license the repository uses (if not, can you choose one yourself)?
  • Is the metadata always publicly available?
  • Does the repository request a specific format to ensure machine readability?
  • Does the repository have a long-term plan for preserving archived data?

 

Check out our FAIR survey quiz!

Costs

The most ideal would be to archive your data forever. However, if you do a follow up research, you could decide to archive your new and old data in another place combined after X years. Or maybe you have a plan to store it inside another place after publishing your article. Keep these variables in mind when calculating the total costs of archiving your data.

 

Costs of Cloud data storage:

Amount 50 GB 500 GB 1 TB 5 TB
Costs per 5 year Amazon $73 $735 $1470 $7350
Costs per 5 year Google $69 $690 $1380 $6900
Costs per 5 year Microsoft $127 $688 $1278 $6454
Costs per 5 year Sia $6 $60 $120 $600
Cost for Amazon and Google are calculated by using the costs of the standard storage option and multiplying them. For Azure and Sia their price calculator were used

 

What provider do you use? Amazon, Google and Microsoft all have different prices based on the type of storage, the location of storage and the redundancy options (on how many different locations it gets stored). Eventually the costs of cloud data storage are also dependent on other variables such as what type of storage do you want? Standard, infrequent access or “cold” storage? “Cold” storage is the cheapest option, it is the type of storage where you expect your data to be rarely accessed. So it is not very FAIR and therefore undesirable.

 

Sia is a special case, it directly competes with Amazon, Google and Microsoft based on how much they cost. However keep in mind that with a provider like Sia, you will still need to configure and set things up. So these are not the final costs for using Sia.

 

As you can see, the prices go up very quickly and with the techniques available to us in the world of Life Sciences, having a dataset which exceeds 5 TB is not something unheard of.

While looking at these prices you also need to keep in mind that this is the laziest way of storing your data which is not FAIR at all, unless you do additional work yourself.

 

What Standard costs 50 GB 500 GB 1 TB 5 TB
DataDryad $120
publishing costs
$150
per year
$2400
per year
$4900
per year
$24900
per year
Calculation DataDryad: $50 for every 10 GB above the standard 20 GB.

 

As you can see, the costs of storing your data are high and these costs only cover the storage. They do not include all the other expenses, such as curating the data after research, setting up the server and maintaining the storage server. Even when you have calculated in these costs, your archiving at this point is not very FAIR and a lot of the potential value of your data will be gone. So why not get the best out of your money and make it as FAIR as possible?

 

And if you need guidance with making your data FAIR, selecting a FAIR data storage that fits your needs, or have any other questions on FAIRifying your data, do not hesitate to contact us. We are happy to help with making the world a FAIRer place, together with you.