The view of a data scientist - BONUS BIO-C3 & BONUS INSPIRE
Delicate business of sharing the data
We recently published a new study of the zooplankton in the Baltic Sea, which was in many ways a big milestone for us. Zooplankton means all those tiny animals in the surface of the seas and lakes that have one thing in common - they drift along with the water. The special thing about this study was that it used probably the largest zooplankton database that I know of.
This database contains thousands of zooplankton observations from the Baltic Sea, starting already 60 years ago. And it took for us (me and ten other scientists) almost three years to build!
What’s the big whoop?
Large and easily usable observational datasets are rare in marine science! It’s partly because sampling the sea is expensive - no single institute could sample frequently enough for the data to be usable for all kinds of research questions. And if several institutes are involved, the ways the data is collected and organised, start to vary. Before any big science could be done with such data that comes from many sources, lot of work needs to be done, to make sure all numbers in the data mean the same thing.
HELCOM in the Baltic Sea has done lot of work in harmonising the sampling methods, and that is a good good thing. But the most valuable information – the raw data itself – is often still scattered in many computers and Excel worksheets around the Baltic Sea.
Why is marine plankton data so expensive?
It is hard to get, and you need lot of it! The sea is not very easy to access - you need a vessel to get out there, as well as proper sampling tools. And once you have the sample, you need a highly qualified person knowing the species to sit for hours behind a microscope, in order to count and identify all the little creatures in the sample.
And then, one sample is only a drop in the sea. Not much can be concluded from one, ten or even hundreds of samples. The variation caused by unknown factors is simply too vast, and the information we might be looking for, is hidden in there. But with thousands of samples, you can already do something. And the good news is that (at least as long as Baltic Sea and zooplankton is concerned) – those thousands of samples already exist, somewhere in the hundreds of computers and Excel spreadsheets!
For example, with only ten partners we managed to pool already 25000 samples!
Although, now, having explained why the plankton data is so expensive, I must correct myself. They are expensive, but (borrowing the words from Jim Cloern): long-term monitoring (of coastal seas) is cheap given the value of information we gain”.
Building up the network
So sharing the data is the way to better science. But since data is expensive, it is also very valuable for the researcher, and sharing it comes with a risk of loosing your investment - that someone else will take credit of your work.
Our co-operation started from long discussions and agreements, and re-agreements, between me and all the researchers who had some data. These agreements included a detailed data policy, a set of rules for using data, assuring that nobody will take advantage of another. This process took quite long time, before any data was actually involved.
My job as a data manager is to make sure that everyone’s data is safe and properly used. And that is the cornerstone of successful data-cooperation on the long run – build a very firm foundation of trust, clear agreements, and stick to them.
On the other hand, my job is also to make sure that the data is advertised, and used for as much as possible. Because, as expensive as the plankton data is, it does not compare to the price of the data that is not used.
Serving the society and science
Long-term observations from the sea became a valuable source of knowledge, especially over the last decade or two. Yet, long-term monitoring programs are always endangered when money is limited. Sustaining those monitoring programs is important, if we want to document and understand what's happening to our environment. To keep those programs going on, we need to demonstrate their value over and over again.
Although the reason for long-term monitoring is to keep an eye on what's changing, and figure out what is the role of human activity or climate in those changes, there is no reason to stop with that. The dirty work is done, data is tidy and ready, and it's power formally tested and proved. Now comes the fun part!
First question we asked from our new database, to test its power, was: how often should you go out there on the sea and take a sample, to make sure you won't miss out important biological events? For example, many of those tiny animals have their high-seasons, when they are 10-100 times more abundant in water than most of the year - kind of like the blossoming of dandelions in spring.
This question was also ironic, because we used the data we actually believed was sub-optimally sampled, to learn, what would have been optimal.
The answer is 2-3 weeks, if you want to know why, check our paper.
And these are those ten scientists that took the leap of faith and placed their precious numbers into our common basket:
Henn, Maiju, Piotr, Gunta, Anda, Anna, Evelina, Katja, Mart & Arno – thank you all for being so simple to work with!