What have your experiences been in the data sharing world and what do they convey to you about that arena’s benefits and problems? Did your experiences prompt you to reach larger conclusions about how data sharing should be organized in the macro picture?

Data sharing has come up in a variety of contexts in my experiences. These have fallen into the following general categories:

(a) Data collections: These are data sets, often represented as a database, made available to a disciplinary community or sub-community. Primarily these have been focused on collections used in the context of computational science in some way, but spanning many disciplines (e.g. protein structure databases, social sciences statistics, instrument and observational data).

(b) Sharing research results: In the world I most often work, there are copious amounts of simulation data produced that many researchers would like to share and they have not done so very effectively. Some teams are attempting to adapt data sharing services for other (primarily observational) data relevant to their community to share their simulation results. An example is the IRIS which has seismological observational data and folks from the Southern California Earthquake Center (SCEC) would like their simulation results incorporated. These efforts are non-trivial and meet with many hurdles.

(c) Projects producing data products: I think of this set slightly differently than described in (a). These are projects like the LSST, DES and others with a stated goal of collecting observational data and generating data products to be used by researchers. These are long- term projects with significant funding to support what they are doing. Unfortunately, these activities occur largely in isolation of one another from the perspective of how data is managed and shared with their respective communities

(d) Maintaining research results: This is the set of data that must be preserved either because it is a requirement of publications of research work that the supporting data be maintained for some period of time, or due to a requirement of the funding agency supporting the research that any research data be maintained and perhaps shared. The impending NSF requirements on what they call a “data management plan” for all proposals is of particular concern for many of the researchers I work with (given that they are primarily funded via NSF).

My observations through these experiences is that there has not been anything close to a holistic approach to the broad set of needs for the research community with respect to data, data management, data preservation, curation, provenance, etc. My view is that the majority of these have similar underlying needs with a need for more specialized services layered on top of primitives in order to support the range of capabilities required. There is interesting work at various levels of this stack (e.g. NSF’s DataNet projects), but there is no larger view and no national strategy that addresses a full picture from base data hardware infrastructure, through low- level data services and support for developing higher-level, specialized services targeted to particular communities.

In your data sharing work to date and in the work of your colleagues, can you point to specific breakthroughs that occurred because raw data was freely and readily shared among researchers? Be as specific as possible in this answer?

Here are some examples of impact by making data more accessible:

http://www.ncsa.illinois.edu/News/Stories/INDICATOR/

http://www.ncsa.illinois.edu/News/Stories/18thConnect/

http://www.ncsa.illinois.edu/News/Video/2010/psp10_minsker.html

http://www.ncsa.illinois.edu/News/Stories/big_data/

http://www.cct.lsu.edu/site.php?pageID=63&newsID=1009

How did you tackle any hurdles you encountered that made data sharing more difficult to accomplish?

At NCSA and within the TeraGrid project, there are many examples of jumping such hurdles.  The bit that concerns me is that only in such larger infrastructure and support projects has there been much hope that solutions found might be shared and applied in different contexts.  So my concern is not so much how a solution was found—there are lots of smart people out there who can find innovative solutions to problems.  My concern is the leveraging of the solutions more broadly.

What is the basis for the widespread resistance to data sharing among researchers, and is any of their criticism of data sharing based on valid concerns?

The concerns expressed by researchers that I am aware of include:

  1. Desire not to lose competitive advantage in publishing research based on the data: Many researchers have expressed that they wish to restrict access to data until they have published their research work first.  In most cases this is valid and reasonable.
  2. Research data is a competitive advantage in research:  Some researchers view the data they have collected as proprietary and wish to have an ongoing restriction on access in anticipation of possible future publications.  While one can have some sympathy to this, data that languishes unused to further research is a waste of the resources originally expended to collect this data.

Do you see a need for a national data sharing repository in the medical arena or smaller discipline-based repositories for specialized arenas?

Some communities have created or have begun to create data repositories and, even though some are much more usable than others, in general they are a Good Thing.  It is difficult for me to speak too specifically regarding the medical arena since that term encompasses a rather broad set of possible data.  I firmly believe there are some specific areas within the medical arena that could be highly beneficial to furthering various medical sciences, diagnosis and treatment.  It is not clear to me that we have sufficiently defined the services and developed the technologies to support a broad range of data types and use modalities.  This is an area that should be actively supported as we develop more specific data resources.

What agency should run it/them and why? If you advocate a system of smaller repositories organized around specific disciplines, should there be an overview agency that coordinates/supervises interactions among databases? If so, describe the agency’s proposed function.

This is an interesting question, but perhaps is too narrow in my mind.  I see a range of repositories that should be developed.  Within this range there will be subsets that are distributed/federated.  This will likely be driven along disciplinary lines but also strongly influenced by interoperability (or lack thereof) amongst various resources.  There is a strong “organic” nature to the bringing together of existing data that must be recognized and addressed.

As such, various agencies must be involved in pushing this forward.  If the focus of this discussion is limited to the medical arena, clearly the NIH should be a major player here.  I tend to advocate the establishment of community bodies that are supported by agencies to facilitate the coordination/interactions amongst data resource representatives.  Each body needs to have a defined scope which will indicate whom should participate.

In any case, a number of efforts should be supported for several reasons.

  1. A variety of more focused areas could benefit immediately from the accessibility of data.  These should be moved forward and we should reap the significant direct benefits of these efforts.
  2. Development of capabilities in a variety of areas engages a broader set of smart people to develop good solutions to problems in making data more easily shared.  While there will be some repetition of effort, there needs to be developed a much larger cadre of individuals knowledgeable and expert in this field.
  3. There is a real need for a much better understanding of the broad set of needs of those interested in using these various types of data.  On-the-ground efforts will help to illuminate this rather dark space.  Those needs then can be used by this larger cadre of data professionals to understand the solutions necessary and—of critical importance—to begin to define the standards necessary in order to construct the solution stacks necessary to implement those capabilities.

This then indicates the need for a capability of the broader community to develop and establish standards.  There are some standards bodies that can be leveraged, but it will require having a sufficient community of folks to drive this process.

Some have advocated an international agency that would serve as a clearing house – knowledgeable about the work being done by data sharing projects worldwide. Do you see any value in such an agency, and if so, how hard would it be to win the international cooperation needed to create it?

It would seem this is quite dependent on the purpose and scope of such a thing.  In the academic world, international boundaries are less relevant than elsewhere.  My experience is that communities are very good at developing international organizations to support such things.  In the standards space, we can see this a lot as well as in other areas.  These, again, are something that agencies in the US and other countries need to support to make happen until such time as they can become self-supporting if they can. I have not looked recently, but I suspect there are efforts already afoot in this space, but they are not organized and require some leadership to marshal them.

What are your suggestions on the most powerful ways to combat researchers’ resistance to data sharing? How effective do you see the requirements now in place by the NIH and high-profile journals requiring data sharing as condition of funding in some instances? Do you believe more teeth should be put into these requirements and, if so, how?

NIH and other agencies have been moving in this direction and they are certainly effective means to induce data sharing.  New requirements from NIH and NSF are Good Things, but enforcement is where it really matters.  Historically, NSF has been effective at enforcing requirements when it chooses to do so.  I do not have enough experience with NIH to know how effective they are, but those cases I do know seem to indicate there is compliance.  Unfortunately, these are all “stick” methods and it would be good to develop “carrot” methods to encourage data sharing.  On the other hand, there is at least a significant subset of researchers that see the derived/indirect benefits to sharing their data and are thus inclined to do so.  In fiercely competitive fields, this is more difficult, however.

Other incentives to enhance scientists’ acceptance of data sharing?

As mentioned, I would like to see incentives for sharing as opposed to punishments for not sharing data. Sorry… no brilliant ideas here.

If you could look down the road to a point maybe 20 years from now, how will the field of data sharing look? Will scientists be more willing to share their raw data quickly — as the Genome Project did, for example?

I do believe that there will be a general trend toward the sharing of data.  Unfortunately, this will be working against the current of increasing value of data.  In the end, owners of data typically will need to find a reason to share data in order for them to take the action to do so.  Even those that are not opposed often do not share data since it requires them to do something in order to make the data accessible.  That something can be quite onerous in some cases.  People rarely do things that require effort without having a reason for doing it.

In some communities there is a general recognition tht they all can benefit from the sharing of their data (e.g. the seismologists and the IRIS) – there is a sense that it “raises all boats” if you will.  For others, they are established senior researchers who share their data as a philanthropic act for their community, knowing that they themselves would never be able to extract the research results that others could. It has been my observation that there are typically such trends in most communities, but some take longer than others to develop.  In 20 years, we will still be discussing the need to share data, but at that point much more data will be shared.  The hard part is that there will be many newer sources of data and communities that are less mature with respect to this issue.

Are there specific laws and/or technological realities/hurdles that will continue to hamper or stop rapid sharing of raw data until they are addressed?

I think there are certainly the obvious legal limiters such as HIPPA and others of that ilk.  If we are to share data affected by those regulations, we must either find a way around them or have those laws changed.  From a technological perspective, the further development of standards is the key to addressing limiters.  We will continue to suffer from the simple inability to communicate our data if we do not both find means by which we can make legacy data accessible and move communities toward representations that are accessible to others.  We should learn from the experiences of communities that have been collecting data for long periods of time without standards to support sharing of the data (e.g. ecologists) and not repeat it.

John Towns bio – At the University of Illinois John Towns is director of the Persistent Infrastructure Directorate at the National Center for Supercomputing Applications (NCSA). In addition, he is Chair of the TeraGrid Forum—the TeraGrid project’s leadership body. He also serves as principal investigator on the NCSA Resource Provider/HPCOPS award for the TeraGrid project and principal investigator for the eXtreme Digital (XD) Technology Insertion Service. He comes from a background in computational astrophysics with a focus on application performance analysis. In his position at NCSA, he works to provide support to a wide range of projects across a range of science and engineering fields, using advanced computing, data, and visualization resources to do so.

In 1987 he received a bachelor’s degree in physics from the University of Missouri-Rolla and in 1990 and 1991 he received master’s degrees in physics and astronomy respectively at the University of Illinois.