Jean Bernard Minster is a University of California-San Diego distinguished professor of geophysics and chair of the International Council for Science’s World Data System Scientific Committee (ICSU WDSSC). He has been involved with data issues for over two decades, through the National Research Council’s Committee on Geophysical and Environmental Data, which he chaired for a while. In that capacity he chaired a number of review committees of NOAA and USGS National Data Centers and the eight NASA Data Active Archive Centers (DAACs). He also chaired or served on several NASA advisory committees on data issues, was chair of the NASA Global Change Master Directory Users Working Group and was a founding member of the International GPS service – now International GNSS Service – a member of the Federation of Astronomy and Geophysics data analysis Services.
He also has worked for many years with the International Council for Science (ICSU) – in 2008 helping establish a prototype World Data Center for Biodiversity and Human Health in Pretoria, South Africa. He says when this center is fully operational, it will be the first WDC located in the southern hemisphere dealing with such data sets. (The majority of WDCs store and distribute data in geophysics, solar physics and space sciences.) Most recently, he chaired the American Geophysical Union (AGU) panel on data policy and helped draft the current AGU data policy statement. He is now chair of the recently established ICSU World Data System Scientific Committee.
What has your discipline’s experience been with data sharing?
In my discipline, geophysical and environmental data, there is a long history and culture of data sharing, mostly because the science itself depends critically on such sharing. As a result the majority of data producers do share their data products. For the past half century, the World Data Centers have implemented a “full and open” data exchange policy, with no restriction on the distribution of data sets in their holdings, at no charge or at a modest charge not exceeding the Cost of Filling User Requests (COFUR).
What are the main hurdles to data sharing?
The main problem is to secure a broad participation of the scientific community and to credit properly data curation and data sharing efforts. This raises issues of long-term funding of such efforts. Other hurdles arise when data acquire commercial value. For instance, when American Commercial weather prediction companies started marketing weather predictions in Europe, using satellite data distributed freely by European countries, there was push back within the century-old World Meteorological Organization (WMO) and WMO passed “resolution 40”, that restricted the distribution of certain data sets setting up a “two-tiered” data system, . More recently a similar discussion arose in oceanography, but the ocean science community successfully avoided going down that road.
Of course there have been, and will continue to be issues of national security raised by most countries. These issues provide legitimate reasons to limit the distribution of certain data sets.
Intellectual property rights are always an issue, especially with data sets collected under non-government (commercial) funding. The protection of intellectual property rights is the purpose of a variety of laws worldwide. For example, the “European Database Directive” is a somewhat extreme case of how such concerns may impact data sharing.
A perennial issue is that of long-term data curation. Now that we have largely done away with paper archives, and now that digital data sets are being accumulated at the rate of many terabytes/day, the problem of “migrating” these data sets to even newer storage technologies is becoming very onerous, and a major concern of data center directors and managers. The National Archives and Records Administration (NARA) is funding research to develop ways to deal with this and related issues.
How have you tackled the hurdles you have faced?
Each case is different. For instance in the case of the WMO resolution 40 that I mentioned earlier, I helped draft a letter from the NRC to Vice President Gore, that ultimately led to the US taking a strong position to curb the proposed sharing restrictions. In general, — in the US– I found that a National Academy study is a rather effective way to develop a balance between different interests.
Do you see a need for a national data sharing repository or several smaller repositories for specialized arenas?
It is essential to the effective operation of any data repository that it have on-site researchers who actually use the data. This is the only effective way that was ever identified to ensure data integrity, metadata correctness, etc. Consequently, I am very much in favor of smaller repositories that are discipline oriented.
I have chaired numerous reviews of data centers over the years. Every single time the question arises when we ask the director of a center, “Do you have any resources on site?’’ If there are no resources on site, that is a sure sign someone is just going through the mechanics of saving data and has no way of detecting flaws in the data system. The only sure way of deciding whether a data system has errors is if a professional user catches it or recognizes that something is wrong. Otherwise, errors would go undetected and this would happen no matter how careful you are. Errors always happen. An instrument may be miscalibrated or someone copied the metadata erroneously and didn’t correct the data fields.
On the other hand,, multi-disciplinary and interdisciplinary science has grown tremendously in the past decades, and that requires access to many diverse data sets.
This means that interoperability of the smaller repositories is becoming ever more important. One way (but not the only way) to achieve that, is to federate the depositories and to provide a forum where interoperability issues are raised, discussed and resolved. This is an area of very active research supported by almost all US agencies.
What agency should run it/them and why?
Again, each agency should shoulder the responsibility for long-term curation, distribution and sharing of data that it collects. In some cases, this might be achieved by transferring the data sets to another agency that is better equipped to handle the task.
To mention an example: the NSF-sponsored Ocean Observatory Initiative is being developed using the Amazon “cloud” as an environment to handle its data system. This is a very convenient and flexible approach, that turns out to be quite cost-effective for managing and exchanging very inhomogeneous data sets. However, this strategy does not address the long-term management and curation of these data. No single, obvious solution has been identified to handle that. The same difficulty is being faced by international scientists who participated in the International Polar Year (IPY). This is where the new ICSU WDS might (and should) play a useful role.
Elaborate on the opposition to a national repository?
I am not “opposed” to the notion of a national repository. NARA manages such a repository. However, it would be a stretch to imagine, say, climate scientists doing research with data stored and managed by the National Archives. Nor would you want genomic data sets to be managed by the National Climate Data Center, which has a phenomenally large collection of weather data covering many decades, and therefore stored on many different analog and digital media. Similarly, a single experiment at a major particle accelerator facility will generate many terabytes of data in a fraction of a second, and capturing, verifying, storing and distributing such data absolutely requires the combined talents of physicists, computer scientists and engineers. In contrast, ecological data sets are typically minuscule, extremely inhomogeneous, and collected almost “by hand.” I cannot imagine a single organization, no matter how large and well-funded, that would be able to deal equally well with such different data collections.
At the same time, I can certainly imagine a forum where central issues, such as metadata standards, can be debated in a way that makes all the disciplines advance. Such forums do exist, and make good use of the rapid development of the web in order to function on a global scale.
Do you have suggestions on the most powerful ways to combat researchers’ resistance to data sharing?
The most straightforward way to promote data sharing is to give proper credit and recognition to data sharers. Look at the AGU data policy and the CODATA Task Group on data citation. Such recognition can be achieved nowadays through fully peer-reviewed publication of fully vetted data sets.
The day when sharing data leads to a researcher being able to add a publication to his/her resume, get promoted, get tenure, receive honors and prizes, etc, will be the day when data sharing will become a generally accepted —and expected— behavioral norm. This may not happen at the same time in all disciplines, because the hurdles to be overcome are very discipline-specific.
Certainly, when there is enormous monetary value to a data set (e.g. a new chemical product, a specific genomic sequence, a mineral or oil deposit image, etc) these hurdles will only be overcome if sharing is accompanied by rigidly enforced protections. Of course, national security is yet another issue. For instance, the 30-meter resolution data from the Shuttle Radar Topography Mission (SRTM) has not yet been released for most of the world, after 15 years! Google Earth still uses lower resolution data in most places.
Describe what you mean when you speak of the growth of the interdisciplinary sciences and what that means for data collection.
Climate science, for example, is by essence very much interdisciplinary. If you want to study something like climate and look at predictions of the evolution of the climate, you need data on ocean color, ocean temperature, remote sensing on land, someone who understands seasonal migrations of animal species, changes in land cover, someone who understands the technological underpinnings of remote sensing. It goes on and on and on and after all that is done, you still want to save the data from the simulations you create in addition to the observations: huge volumes of data are generated by computer programs.
In fact, it should be recognized that nowadays, in many disciplines, the bulk of data accesses have very little to do with humans. Most data are in fact requested and accessed by computer programs “on the fly” through computer-to-computer communications. This required a repository, or in fact a set of repositories, that are capable of delivering data sets with very small latencies (otherwise the requesting program is most likely to crash.) This requirement is particularly obvious in the case of emergencies: For instance, in the event of a large earthquake, it would not do to wait for a human operator to fumble with accessing seismic data sets “by hand”. In the event of a hurricane, very large volumes of data get piped automatically through many repositories and assimilation systems running on supercomputers. This, more and more, calls for very high bandwidth communications (we are talking about 10-60 Gbits/sec, which is about 3 orders of magnitude faster than fast Ethernet connections.) Again, even if a single national repository were equipped with such capability, it would still have to deal with 10s or 100s of simultaneous users, most of which are computer programs running on other computers.
Interdisciplinary data sets raise additional complexities: these data sets are not homogeneous. You may need something that models the topography of the land because this is how you decide how climate change will affects things such as floods. You may need to understand gravity data (from both ground and space measurements) because this will tell you about water table changes. If you are interested in the effect of climate change, you need to recognize the growth of swamps and the temperatures that may cause the spreading of disease vectors (e.g. anopheles and malaria). If you want to look at the effects of exploitation of rain forests, you need to be able to combine ground observation data sets – millions of them – with data sets that come from remote sensing from airplanes and satellites. That in turn may require you to process kinematic GPS and Inertial Navigation System (INS) data. Satellite data sets may be relatively few but can be very, very large. They will involve multi-spectral optical data, lidar data, and Radar interferometric data, each of which calls for a different processing chain, and yet all of which have to be integrated into the modeling. And so on!
So you have this terrible multidisciplinary problem that cannot be handled in a single data set. What you want is a mechanism by which you can seamlessly access those other data sets and then you can place the same sets in the same geographical reference frame. Where the raw data actually reside is not the relevant concern. That these data sets and the associated metadata are managed and maintained by disciplinary professionals who understand the data —instead of merely knowing how to index them and placing them in some storage unit—is what really matters.
You mentioned the possibility of a move “to federate the depositories and to provide a forum where interoperability issues are discussed and resolved.’’ What’s going on in this area now?
The forums (that have been established in an attempt to oversee interdisciplinary approaches) tend to be ad hoc because they are often not very well funded: No one in government is going to make a career by funding long-term data management. You don’t get a Nobel Prize for doing that. …You go to Congress and say we need money to manage data and they might stare at you and say, “You have your data, what more do you want?’’
You say there is an increased need for recognition of scientists’ data collection and curation skills?
You know that professors and researchers tend to only get promotion or tenure based on research and publications. If a professor spends a lot of time collecting data sets, there is no easy, or standard way to publish the outcome of his/her efforts at this moment. Even if you find a way to publish it, most of the time it is not something that you put in your resume. Data collection is not usually considered a major peer-reviewed publication. This is a mind set in academia that dates back to the late 19th century or early 20th century. We need to change that mind set. Some of my colleagues are devoting their lives to making sure data are correct and well-maintained and shared without restrictions with other colleagues in their discipline. That career path should be as rewarding as the more standard ones.

No Comments Yet - be the First