Bio: Jeannette M. Wing is a computer science professor and department head at Carnegie Mellon University. She also is former assistant director for computer and information science and engineering at the NSF. She earned her Ph.D. in computer science at MIT. She serves on nine journal editorial boards, including the Journal of the ACM and the Communications of the ACM. She also has served on numerous industry, academic, and government advisory boards.
Recently in her role at the NSF, she participated in the formulation of a new foundation data sharing policy, which took effect in October 2010. It requires that all proposals NSF receives include a data management plan – a supplementary document of not more than two pages spelling out how the researcher will provide others access to data collected or generated in the course of the research project.
When the policy was announced, Wing stated: “The change reflects a move to the Digital Age, where scientific breakthroughs will be powered by advanced computing techniques that help researchers explore and mine datasets.”
Wing recently left her NSF post to return to Carnegie Mellon.
What have your experiences been in the data sharing world taught you about the field?
The experience I had at the National Science Foundation was enlightening because I learned that each scientific discipline has its own culture with respect to sharing data. In computer science, academics share everything because we have a culture of openness. Of course companies, such as Google and Yahoo, which collect a lot of data on individuals, rightly don’t share that information. Thus, academics who do not have access to such data have difficulty validating their theories on large-scale realistic data sets. It’s an interesting dilemma the computer science community faces more so now than ever before.
More generally, the issue is not really whether scientists share data or not, but when. It is common practice in science to share data after a publication because reproducibility of scientific results is part of doing science. Sharing data before a publication runs the risk of someone else scooping you.
In some sciences, e.g., subareas of physics, there is not the culture to share data openly early because the data you collect may lead to a scientific breakthrough, and you would like to be the first to claim that breakthrough. But in astronomy, for example through the Sloan Digital Sky Survey, scientists share data openly and immediately. And in biology, we have seen a trend toward more openness early, partly because biologists are realizing that sharing data early can expedite scientific discovery.
Different communities have different data challenges, and so far, handle them in different ways.
In October 2010 the NSF adopted a new data management policy. At the time, in your capacity as an assistant director of computer and information and science and engineering, you stated: “The change reflects a move to the Digital Age, where scientific breakthroughs will be powered by advanced computing techniques that help researchers explore and mine datasets.” Tell us about the thinking behind this new policy.
At the NSF we struggled with how to come up with a single policy that could reflect and accommodate the different cultures among the different scientific disciplines. We decided that we could not. Instead, we decided to ask each researcher to send us a statement on what his or her intention is regarding access to data. That’s what we are requiring – a data management plan. We thought that this solution was a reasonable compromise – it acknowledges that different scientific communities have different cultures and we weren’t about to mandate a single policy. Through the peer review process the specific scientific community of an individual researcher could use the norms of that community to judge whether the data management plan makes sense. For example, one community might expect data to be released immediately; another might accept that data be released six months after the acceptance of a publication based on that data.
Was that a major step toward increased data sharing in the scientific world?
That was really a baby step toward tackling the bigger questions about openness: What and when? Scientists benefit from open and shared access to data, the tools and software that analyze the data, the instruments and systems that help produce the data, and the publications based on the data collected and analyzed. When should access to each of these resources be given to whom? One needs to balance being fair to the individual scientist and the interests of the broader scientific world.
What has the reception been to NSF’s new policy?
I don’t hear any grumbling. I think the only issue is that the PIs are not sure what to put in the data management plan. A lot of projects don’t generate data so there is concern whether those projects will not win approval. PIs for those projects simply need to assert that their research won’t produce any data. The data management plan is not meant to be an onerous requirement.
PIs should be thoughtful about what they put in the data management plan – they shouldn’t just put down a link to where the data is stored. They need to describe the data and meta-data. They have to think about how to make their data usable by others and by software tools and systems.
Overall, I don’t think there has been a negative reaction – people understand this move makes sense in the interest of science.
Is NSF’s action a sign of the times in the data sharing arena?
It is a sign of the times not only with the NSF, but also with the Obama Administration through its open government initiative. That’s why you see government agencies posting different data sets on their websites available for public viewing. It’s the spirit of openness overall that is sweeping DC.
Even without that spirit, though, NSF would have been addressing this issue. There is so much data out there that the more people who can read the data and help do the analysis on large data sets, the more likely there will be an expedited or unexpected discovery in science.
So it makes perfect sense to put the data out there. Scholars, and even citizen scientists, can use that data to try out their own techniques, test out their theories, identify relevant information, and find interesting patterns, all of which can lead to innovation and discovery.
That is what the NSF and science are all about.
Is resistance to raw data sharing ebbing?
I think it is, but I also think that some communities will still be quite protective of their data and maybe it will take them longer to realize a culture shift is happening.
What is the basis for the widespread resistance to data sharing among researchers?
The root of the resistance is historical. It used to be that data is gold – that it took years and years for let’s say, a biologist to produce one data point. Much time, energy, and research money went into producing that one data point, making it extremely valuable. It was your blood and sweat and tears to do the experiment and repeat the experiment, perhaps multiple times. While it still takes a lot of time and effort to run experiments, today’s scientific instruments and computational simulations are generating data at a rate faster than we can store. Ironically, because we can’t store everything, we are in danger of throwing away the very data point that would give evidence to a scientific phenomenon we hope to discover or prove. We are drowning in data. We now rely on data mining algorithms to extract knowledge from the data we are drowning in. Data is dirt. Knowledge is gold.
Do you see a need for a national data sharing repository in the medical arena or smaller discipline-based repositories for specialized arenas?
If you think more broadly, such a repository is not just about medical data sharing. It is about all scientific data. We could be thinking of a national digital archive where that data is stored and managed on a national scale. It would be akin to the Library of Congress. The problem is deciding what to store, organizing that data for easy access by different communities, and ensuring what you store can be read years later. Who should be in charge of such a national digital archive? What is the role of the federal government? What is the role of university libraries? For example, maybe the repositories should be distributed so there is no one central repository. Then you need to provide open access to all those repositories. Do you have discipline-specific repositories? These are the kinds of issues the broad scientific community should be asking. We might throw in the digital humanities and the arts while we’re at it. It’s a real problem and a real challenge.
What’s the NIH’s role here?
The NIH at least needs to be thinking about what it should do with the data its scientists gather – whether it’s in a single repository or a set of repositories – what to keep, how to retrieve from it, how to manage it, how to provide access to it. Ideally, the NIH would work with other federal agencies on these issues since their data sharing problems are similar and the solutions can be shared. Medical data, especially patient data, of course, raise privacy concerns that other scientific data do not.
In general, federal science agencies should have a broader scientific discussion about data sharing. Should there be a collective strategy on scientific and medical data? You can leverage one infrastructure built for use by all. Rather than each community spending time and money on its own, a concerted effort across communities can be a more effective use of resources.
What agencies are key in this arena?
The Office of Science and Technology Policy plays an important role. The National Science and Technology Council, which is chaired by the President and the Director of OSTP, reaches all the federal agencies that fund research and development in science and engineering; they in turn reach all the scientific and engineering communities. In fact, NSTC’s Committee on Science has an interagency working group specifically on digital data. So the Administration is very engaged and aware of the challenges presented by the deluge of digital data in science and engineering.
Do emerging technologies sometimes exacerbate the data sharing problem?
Two trends: cell phones and the cloud.
Consider cell phones. There are 4.6 billion cell phone subscribers in the world. Imagine if we decided to collect all the streams of real-time data that cell phones can collect. Imagine that each person with a cell phone is a sensor in a network the size of the population of the world. That’s a lot of data! With this data, we have massive amounts of information not just about many individuals but also about populations of people.
Now consider the cloud. The cloud provides the illusion to users of an infinite data store. People keep their email, photos, videos, and all personal information and communications in the cloud.
Now put the two together. Your cell phone is your portal to cyberspace. All the data you generate and collect through your cell phone can be potentially stored in the cloud, available anytime anywhere by anyone. The enormity of all this produced and stored data exacerbates the data sharing problem. Who has access to this data? Who gets to analyze it? Suppose you want to delete some data? How are citizens’ privacy concerns addressed?
These questions raise technical, cultural, and legal challenges. As technology evolves, it helps in terms of framing solutions, but it also is opening and raising new problems.