One of the most underestimated parts about a FAIR data management plan is archiving your data. How am I going to store my data in such a way that people can access, download and share it? And what are the data archive costs included for storing your data in such a way? Even in just the way you store your data a lot of variables come into play to decide for which options you should go. Hopefully this gives you a better view on what is needed for archiving your data in a FAIR way.
One of the first questions you need to answer is: where are you going to store your data? This could be some cloud storage you rent, an open data repository or maybe a specific repository for your domain. Each one has its own pros and cons, which are described in the table below.
So you have decided to use a repository, but how do you find a relevant repository? Doing a quick Google search can already lead to relevant results and if there is the option to go for a discipline-specific repository than that should be the default go-to. Ask around, maybe your colleagues or friends in the field know a repository they always use or consult websites like fairsharing. Or take a look at Nature’s recommended data repositories, where they list different options for all types of domain specific repositories.
Do you need to prepare your data to ensure preservation? Is the original file format unstable or not standardly used in your field? Do you need to anonymize the data? This is of course very dependent on what kind of research you do. Do you work with people? In this case you almost certainly need to anonymize your data before sharing. If you have no idea which part of your data should be anonymized, search for Personally Identifiable Information (PII). PII is a description of information which can be used to identify a person, on its own or combined with other information. Most are straightforward like date of birth and home address.
Another quickly overlooked question is: in which file formats do you want to store your data? Some file formats are not standard to use or even unstable. Converting your data to standard and stable formats will increase the quality of your data preservation and increases the number of possible sharing uses.
What software or tool did you use for analysing your data? Is it publicly available and where can it be found? There is not much use to archiving your data and sharing it if others cannot use it because it is dependent on one tool. Converting your data to more standard and stable formats can help in such cases. Storing your data in a domain/discipline specific tool or repository can also help mitigate these problems.
This table gives you a list of options you have for storing data, their pros and cons, as well as how FAIR such archiving is.
|Type||Cloud storage||Open data repository||Domain specific repository/tool|
|FAIR||This is the least FAIR way of archiving your data. It is not easily findable and accessing the data could be a big hassle. You need to configure if people can just download it or if they need to ask your permission. One of the easiest ways to increase the level of FAIRness is to publish the download link to an easily accessible page with a detailed description of what is inside your dataset.||Storing your data in an open data repository is already a lot more FAIR than storing it in the cloud. There is a website in place to access it which is already developed for making your data findable. Storing your data in an open data repository with a FAIR data point will increase the level of FAIRness by a significant amount.||This is the FAIRest way of storing your data out of these three. Most domain specific repositories/tools are known in their community and linking and contributing is always highly appreciated in these communities. However keep in mind that, to be as FAIR as you can be, there should be an easy way to only download the dataset.|
How do I decide if the repository of my choice is FAIR?
If you can answer “yes” to all of these questions, the repository is considered FAIR:
- Do the datasets have a globally unique and persistent identifier?
- Can you upload metadata about essential information such as the origin of your dataset and submitter-defined metadata (ontologies used, naming of variables etc)?
- Is it stated which license the repository uses (if not, can you choose one yourself)?
- Is the metadata always publicly available?
- Does the repository request a specific format to ensure machine readability?
- Does the repository have a long-term plan for preserving archived data?
The most ideal would be to archive your data forever. However, if you do a follow up research, you could decide to archive your new and old data in another place combined after X years. Or maybe you have a plan to store it inside another place after publishing your article. Keep these variables in mind when calculating the total costs of archiving your data.
Costs of Cloud data storage:
|Amount||50 GB||500 GB||1 TB||5 TB|
|Costs per 5 year Amazon||$73||$735||$1470||$7350|
|Costs per 5 year Google||$69||$690||$1380||$6900|
|Costs per 5 year Microsoft||$127||$688||$1278||$6454|
|Costs per 5 year Sia||$6||$60||$120||$600|
|Cost for Amazon and Google are calculated by using the costs of the standard storage option and multiplying them. For Azure and Sia their price calculator were used|
What provider do you use? Amazon, Google and Microsoft all have different prices based on the type of storage, the location of storage and the redundancy options (on how many different locations it gets stored). Eventually the costs of cloud data storage are also dependent on other variables such as what type of storage do you want? Standard, infrequent access or “cold” storage? “Cold” storage is the cheapest option, it is the type of storage where you expect your data to be rarely accessed. So it is not very FAIR and therefore undesirable.
Sia is a special case, it directly competes with Amazon, Google and Microsoft based on how much they cost. However keep in mind that with a provider like Sia, you will still need to configure and set things up. So these are not the final costs for using Sia.
As you can see, the prices go up very quickly and with the techniques available to us in the world of Life Sciences, having a dataset which exceeds 5 TB is not something unheard of.
While looking at these prices you also need to keep in mind that this is the laziest way of storing your data which is not FAIR at all, unless you do additional work yourself.
|What||Standard costs||50 GB||500 GB||1 TB||5 TB|
|Calculation DataDryad: $50 for every 10 GB above the standard 20 GB.|
As you can see, the costs of storing your data are high and these costs only cover the storage. They do not include all the other expenses, such as curating the data after research, setting up the server and maintaining the storage server. Even when you have calculated in these costs, your archiving at this point is not very FAIR and a lot of the potential value of your data will be gone. So why not get the best out of your money and make it as FAIR as possible?
And if you need guidance with making your data FAIR, selecting a FAIR data storage that fits your needs, or have any other questions on FAIRifying your data, do not hesitate to contact us. We are happy to help with making the world a FAIRer place, together with you.