When did you create the Dataverse Network and what prompted you do so?
When I came to Harvard in 1987, what is now the Harvard-MIT Data Center was nothing more than a big pile of tapes and a long line of people. Whoever was at the front of the line was always screaming at our employees about why they needed their data now. Working in the Center was a pretty miserable experience, and it often took more than six weeks to access data. So well before the creation of the Web, we worked hard trying to automate the process. This particular incarnation, the Dataverse Network, was created in 2005.
Summarize what institutions are now making use of the network?
There are several hundred universities, governments, scholars, and corporations that have installed the Dataverse Network software or have a virtual archive (a dataverse) served by an existing network.
Is the network especially suited for particular fields or does it have potentially uniform application throughout the sciences?
We developed it for the social sciences, but we’re branching out. Social science data comes in highly diverse types and so it’s a good place to start. We’ve had interest from many other fields, who we’re working with.
Tell us more about Dataverse.
Dataverse programs in a variety of access rules. You can make the data completely open access. You can have it open access after standardized click-through licensing agreements (negotiated in great detail with Harvard’s lawyers); you can have a customized click-through licensing agreement; you can give access to only a specific set of people you identify; we could even arrange for you to require a credit card number; you could make it entirely open access; or you can make only the metadata publicly available with the rest in an air-gap secure facility.
A growing number of universities and other institutions (including private companies) have installed Dataverse Network software. And you can use it without any installations at all if you like. See http://TheData.org.
Is an incentive approach useful in trying to get scientist to share raw data?
We’re social scientists and so are interested in incentives. If investigators make data available from their web site or on request, that doesn’t last. And investigators are not professional archivists. If they don’t make the data completely available, they also can’t help but get the incentives all wrong. (Sure, you can see my data if at least implicitly you promise not to criticize me!). Alternatively, the investigators can send the data to a professional archive, but then the archive gets the credit when a new scholar analyzes the data.
Alternatively, with the Dataverse Network, as an investigator you (or your project) gets your own Dataverse, which is a virtual archive on your web site that gives you all the web visibility and scholarly credit, but it requires no installation, backups, or maintenance. We also have developed a standard for data citation that gives credit to the author, with plenty left for the archive too. This provides lots of incentives for data sharing.
Think of the continuum of data sharing as (1) not sharing, (2) sharing under difficult, annoying, or complicated circumstances, or in hard-to-use formats, etc., but preserved and (3) open access, easy-to-use, no restrictions, and preserved. My view is that the difference between (1) and (2) is so much larger than the difference between (2) and (3) that its essential to first get data preserved (most, after all, just vanish) and then to worry about access rules, ease of use and other issues later on. If all we can do is to make the metadata and documentation public, fine. But we can always make at least the metadata available, and we can almost always do a lot more.
Talk about data sharing’s proponents and opponents.
Obviously, a political movement like this needs supporters, and so the more the better. But the argument against often comes from younger or less established investigators who are afraid of being scooped. They need to understand that the danger isn’t being scooped; it’s being ignored. In fact, however, those who share data are cited much more frequently than those who do not, and so it is in everyone’s interest to share data.
How should data sharing work?
Not only should there be sharing of data, but there should be sharing of the systems that make data available. Dataverse Network software is open source and legally owned by the community rather than us.
Is security of the data much of an issue?
As data in the social sciences become exponentially more informative — think about Google street views, credit card transactions, continuous time info from your cell phone, electronic medical records and so on — security concerns become more and more essential in our area too. The community of those who work on Dataverse Network software thus continuously upgrade its security features as well.
Do you see a need for a national data sharing repository or smaller repositories for specialized arenas?
Other countries do this through the government. The US is more decentralized. The idea of the Dataverse Network is to keep control and credit local and distributed everywhere, but to keep the responsible archival features associated with large institutions capable of making long term commitments to preservation.
If you see such a need, what agency should run it/them and why? If you advocate a system of smaller repositories organized around specific disciplines, should there be an overview agency that coordinates/supervises interactions among databases? If so, describe the agency’s proposed function.
I don’t think there should be a czar of data. Lots of organizations have tried to put themselves in this position, but the incentives are all wrong and it doesn’t work. It’s not even necessarily good if it does. Instead, we need to make absolutely sure that individual researchers get scholarly credit and web visibility and at the same time preservation, backups, disaster recovery, formatting, networking, discoverability, and all the other difficult features that are served out by institutions that can provide those services. Individual scholars aren’t going to use the system unless it benefits them, fits with their goals, and is easy enough to use so it doesn’t distract from their research enterprise.
What are your suggestions on the most powerful ways to combat researchers’ resistance to data sharing?
Make it incentive compatible. If you have a dataverse and share your research data, you will be cited more and have more web visibility. In empirical studies, these are associated with higher rates of promotion and higher salaries. The risk, we need to explain, isn’t being scooped; it’s being ignored.
How effective do you see the requirements now in place by the NIH, the NSF and high-profile journals on data sharing? Do you believe more teeth should be put into these requirements and, if so, how?
They’ve been helpful but not sufficient. Most of the world’s really large data sets funded by these organizations are well preserved. But most of the data sets, subsets, code, and other replication information associated with individual publications remain unavailable. This means that the bulk of work in the scientific literature is difficult or impossible to replicate.
Are there other incentives you believe would enhance scientists’ acceptance of data sharing?
More web visibility and citations to data, not only citations to the published article associated with the data. With dataverse — or with some other system using the protocols we have created — you get both.
Twenty years hence, how do you see data sharing operating within the scientific community in the U.S. and worldwide? Will there be a national repository for raw data in the medical arena or in any other major disciplines?
A national repository seems like a difficult political challenge. But the scientists among us should not expect government to solve our problems. We need to get our act together on our own. It’s our job, it’s our mission, it’s our chosen task. Our goal is to create, distribute, and preserve knowledge, and sharing data enhances all three.
Are there observations about the realm of data sharing that you would like to make in addition to the questions I have posed?
We need more forums like yours to spread the word. Thanks!
Gary King bio – At Harvard University, Gary King is the Albert J. Weatherhead III University Professor; along with 21 other distinguished faculty at Harvard, he holds the title of university professor. He also is the director of the Institute for Quantitative Social Science. The focus of his work is on the application of empirical methods throughout applications across the social sciences. He is particularly interested in individual projects that span the range from innovative statistical theory to new practical applications. He is also a member of six honorary societies, including the National Academy of Sciences, the American Statistical Association and the American Association for the Advancement of Science. He is the recipient of more than 30 “best of” awards for his work, including the Career Achievement Award in 2010.
In 1980, King received a B.A. from SUNY New Paltz and in 1984 received his Ph.D. from the University of Wisconsin-Madison. Agencies supporting his work include the National Science Foundation, the Centers for Disease Control and Prevention, the World Health Organization, the National Institute of Aging and the Global Forum for Health Research.


No Comments Yet - be the First