Hakan Olsson is a professor and specialist in forestry remote sensing at the Swedish University of Agricultural Sciences, which is the nation’s main center of research in forestry. In a 2009 Nature article, Professor Olsson was quoted as saying Sweden’s archive for data collected during the International Polar Year (IPY) is helping house smaller independent projects’ data that never would reach large international databases.

What is your experience with data sharing?

I am a forest remote sensing scientist. I became a full professor in 1994 and since then I have initiated a large number of research projects where we have evaluated remote sensing data — for example, different types of satellite data, as well as developed methods for estimating forest and vegetation. In our research, we are comparing remote sensing data with large sets of field plots, often in the order of 1000 field-measured plots. Those data sets of field plots have often been of large interest also for other research groups, and we have shared them on a project by project basis, between research groups, with the condition that the data should be used for the agreed purpose only. I cannot recall when this started, but it has been ongoing since at least 1996, and maybe longer. This has worked very well and the experiences are only positive.

Since year 2002, we are also making nationwide raster databases with forest variables, obtained by training Landsat or SPOT satellite data with the data base of sample plots collected by the national forest inventory databases. These datasets are also possible to download free of charge (http://skogskarta.slu.se/). Since it is the only database of its kind on the Swedish forests, it is used by a multitude of users, from authorities to ecological researchers doing spatial modeling. We had to be careful spreading this data around in the beginning, since the national land survey tried to sell satellite data classifications (there are still several authorities that charge for data, even charging researchers). The second reason why we had to be careful with advertising this forest database very widely in the beginning was that some of the large forest companies were not really used to the idea that their forest resources should be openly published — at least open for those that could intersect our database with real estate boundaries. However, after a period of selling the data and keeping a low profile in the advertising of the data set, all players got used to our database and we can now distribute it freely over the internet. We get a lot of goodwill in this way and also substantial funding from authorities who like what we are doing.

I was also a member of the international, as well as the Swedish national, data committee within the International Polar Year (IPY). This was a very special case where the willingness to sharing data could be studied, since sharing of data with a short delay was one of the key motivations of the whole IPY. All IPY projects therefore had to sign that they would live up to this requirement. In Sweden, we therefore started a small national IPY-data repository at the Met-agency SMHI (www.smhi.se ), but my understanding is that the flow of data to that repository has been slow.

What type of data have you shared and what sort of work load and costs did it impose on you?

As indicated above, we have mainly shared:

1) Field plot data bases from our remote sensing test areas. This has mainly caused us some limited work time for communication and data distribution.

2) Final raster data bases with forest data. Nationwide raster databases involve each grid cell of 25m * 25 m carrying information about timber volume and tree species for that location.

Again, time has been spent on communication and data distribution, but since the product now is quite well known and possible to download freely, the interaction with those using it has decreased.

Has your data sharing experience been successful?

We made one collection of about 2000 sample plots in 1996 for one specific remote sensing project, but after a while about 30 other projects had used the same sample plot data for slightly different purposes. In another project, we collected about 800 field plots, which however were not much used by us, but another remote sensing group with radar scientists (which had not had the possibilities to collect similar datasets themselves), later used our data set for research about forest stem volume with C-band radar coherence, which was a bit of a breakthrough when those results arrived. We have continued working like this in other projects. I cannot recall all times we have shared plot data, but I guess it must be to over 100 external projects.

The nationwide freely downloadable raster data base of forest resources has provided an excellent data source — among other things as a basis for spatial ecological models, and it can be considered a great success.

As an example of what I am describing, use of the raster data base with forest variables, we are getting increasingly more observations of species which are associated with coordinates — for example, observations of rare plants that will be given GPS coordinates, or wild animals that have been captured and equipped with a GPS+Cell phone collar that reports coordinates back to the scientists. We have a large number of moose equipped that way. In order to understand the habitat requirements for those species observations, there is a need to also have a description of the habitat for comparison purposes, and this is where the ecologists use our nationwide raster data bases of the Swedish forest vegetation.

What problems/hurdles have you encountered personally or what problems/hurdles have you observed generally in the scientific realm?

The problems are not many, but the following potential problems could be mentioned:

1) Much of Swedish research is done as PhD projects, and it is a bit sensitive to give away data that a PhD student is dependent on to another researcher; this must build on trust that the other researcher will not use the data for the same purpose as the PhD student (this aspect is valid for other projects as well, but is extra sensitive for PhD students). Thus researcher-to-researcher communication works better than instances in which a researcher stores the data in a national data center.

Before a research group has analyzed the data and used it for the science it was planned for, often the same data is also of interest to other groups that are making slightly different science. The data center approach is more for the long-term preservation of data that not is critical for the research group anymore.

2) Part of the data we are using belongs to sample plot systems that are used for producing national statistics, for example sample plots from the National Forest Inventory. Since we are producing the NFI ourselves at our department, we can use those data to train remote sensing products. However, we cannot disclose the location of those plots to other organizations, since the plots are permanent and their location in the terrain is secret.

3) With remote sensing and NFI data in combination, we can produce products that might be perceived as a competitive treat both for map-producing authorities and private forest companies. This is, however, not a large problem.

How did you tackle those hurdles?

1) Don’t be too quick to give away data from ongoing PhD projects, and make careful written agreements when this any data is being shared.

2) When the location of sample plots are secret, we still can give away a data vector with forest attributes and associated remote sensing data but with the coordinates truncated. These data vectors can then be used by others as reference for their remote sensing data.

3) With the passage of time, authorities and companies have become used to the new products we have made and offered for free. Surprisingly often, there is an initial resistance that will disappear quite quickly.

Do you see a need for a national data sharing repository or smaller repositories for specialized arenas?

Yes, I think there is a need for some type of central facility in each country. Such a facility also could be a link to more specialized facilities as well as to the central repositories in other countries. In fact, I was involved in getting such a central facility in Sweden funded. It is funded by the National Research Council (www.vr.se), and hosted by our national Met-agency (www.smhi.se). The name in English should be something like “The Swedish National Data Center for Climate and Environmental Research Data”. However, it is so new that I don’t know in case they have an official English name yet. The abbreviation in Swedish is SND-KM. In addition to being a contact link to other local repositories at research institutions and specialized national or large international data centers, a national center could be a competence hub, as well as a holder of certain data sets of critical importance or data sets that have no other natural home. By this, I mean both that a central center should have information about decentralized and specialized centers AND (and maybe more importantly) that a central center should have knowledge that could be communicated to the specialized data repositories. A very concrete example is the implementation of standards, like ISO standards for geographical data.

Given my own experiences from data sharing, I think that a researcher-to-researcher mode of sharing data works best as long as a project is ongoing, since it involves trust and control. In addition, there is a need in the later stages of a project to deposit valuable data on a more permanent and institutional level, above the research group level, and this might be one role for more centralized repositories. The repositories will also come in to the loop earlier in case control and trust can be built into the technical systems in a way that the researchers can control how their data is distributed, at least in the beginning. A widespread adoption of data citation might help as well.

The situation in Sweden is that most research is done at universities, which formally are authorities and are obligated to archive their data. However, this is seldom done in a way that makes the data easy to retrieve and the documentation is often not sufficient for making the data useful for other researchers.

What agency should run it/them and why? If you advocate a system of smaller repositories organized around specific disciplines, should there be an overview agency that coordinates/supervises interactions among databases? If so, describe the agency’s proposed function.

In Sweden, the National Research Council “Vetenskapsrådet” organized a “beauty contest” where research-oriented organizations could bid for hosting the national data center. The bids were then evaluated by a panel of international experts. The bid winner was the Met-agency (www.smhi.se), which has a lot of experience in the technical and organizational aspects of data handling and data sharing.

I believe a national center should be a core hub of competence and information about who is doing what etc. But I don’t really see how a national center could become a steering body for all other local or specialized data repositories. A national center should know who is doing what, and it should also disseminate knowledge and practices to the other centers.

What are your suggestions on the most powerful ways to combat researchers’ resistance to data sharing? How effective do you see the requirements now in place by the NIH and high-profile journals requiring data sharing as condition of funding? Do you believe more teeth should be put into these requirements and, if so, how?

I think funding authorities and research institutions could help by requiring that data be stored in more centralized repositories after the fulfillment of projects. Also it will help to build in rewards like data citation in the research culture, and trust in the form of letting a researcher know who is requesting his data and give him the possibility to communicate, or maybe stop a request as long as the research group that produced the data has not fully used it yet for its intended purpose.

Research councils in some countries require researchers who have obtained grants to deliver their data to centralized data repositories when the project is finalized. This is also being discussed in other countries, including Sweden. The discussion in Sweden, however, has not yet lead to any firm decision by the national research councils.