Gardner is a professor of physiology and biophysics and a professor of neuroscience at Cornell University’s Weill Medical College. His recent work has been funded by the National Institutes of Health Blueprint for Neuroscience Research and by the Human Brain Project/Neuroinformatics through the National Institute of Mental Health and including as well the National Institute of Neurological Disorders and Stroke, the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and the National Science Foundation.

Gardner headed the initial years and phases for the Neuroscience Information Framework, an NIH-funded effort that created a central resource for neuroscientists through which they could delve into the large number of web-accessible neurodatabases, computational tool sites, and portals providing neuron and brain information and materials. This resource was created for the NIH by a multi-institution consortium directed by Weill Cornell’s Laboratory of Neuroinformatics. In addition, the laboratory is engaged in a multi-year project looking into two fundamental questions of neuroscience — neuronal identity and coding of neuronal signals.

What difficulties are posed by data sharing?

I have noted in my published work that there are several major difficulties associated with establishment of domain-specific and technique-related databases, and that these are both technical and human.

  • There is little incentive for experimentalists to make their data available to their peers, and this is likely to be the case until either journals or funding agencies require it. The barriers include those that can be overcome: With no funding or incentive, even experimenters sympathetic to the idea of data sharing are unlikely to take the time and effort. Others, to whom datasets are rare and valuable currency, are understandably reluctant to make these available before they have drawn as many conclusions from them as they can.
  • There is similarly little incentive to develop new databases for scientific data, other than in rare cases where projects are funded to do so, and this is unlikely to change soon.
  • In many fields–and I offer neuroscience as the one I know best–a very large number of individual labs generate very large amounts of data. It is certainly possible but very difficult to build a large archive to accept terabytes and petabytes of data. It is even more difficult to get the data to such an archive, given that most lab data is not presently web-accessible (this is the ‘hidden web problem’ that has been discussed a lot).
  • An archive is no use unless it is searchable, indexed, or otherwise easily accessible to directed queries. Again, the datasets in many fields are large numeric vectors or matrices that are of no use unless accompanied by large amounts of metadata, both numeric and text. Such metadata can be quite complex in meaning, and requires serious standardization efforts that particular experimental communities must agree on. Then the creators or submitters of such data must also become familiar with these standards and then take the time to annotate their datasets with the metadata.

How did you tackle these difficulties?

I, and the teams I headed, took a great deal of time and extramural funding to develop the databases–such as neurodatabase.org–and the meta-databases–such as the Neuroscience Information Framework (NIF). These were not easy, and I needed to leverage my personal background in both neuroscience and in computing, and similar backgrounds of my colleagues, to the successful development. It is possible to build upon these efforts–for example, we have built a neurodatabase construction kit–to reduce the energy barrier for others.

One of the barriers to data sharing is that there do not exist suitable repositories for many types of data, either because there are not yet standards for how such data are to be represented or archived, or because they require metadata that have not yet been agreed to, or because incentives and support for the needed development effort are lacking.

Also, PubMed has a link-out capability, which we utilized for the NIF project, and this allows it to serve its primary purpose and to be used to point the way to databases that readers can then access. This is an excellent way to increase data sharing with minimal changes to an existing infrastructure. PubMed itself can’t be a repository, for reasons described below.

Because of the scale, as well as the many different types and formats of data just in the neurosciences and the need for technique, level, and domain specific metadata, the idea of a single repository for all data is unworkable even here. Front ends such as the NIF can serve to direct both users and queries to the appropriate repository, and that is the reason they were conceived. For all science, the idea is completely unworkable.

You present several questions dealing with hypothetical agencies and agency control, but I see no need for any of this. From the start, we have used a publication metaphor for databases, and there is no ‘agency’ regulating journals.

What methods have you pursued to address concerns about data sharing?

In addition to recognizing the reluctance, or the practical objections that many have to sharing their data, it is important for resource or archive maintainers to appropriately recognize the ethical and intellectual property concerns associated with re-use of data. At neurodatabase.org, which we developed in the Laboratory of Neuroinformatics, users cannot access data without agreeing to the following, which we offer as a model to the data sharing community:

To enter the database, please read and acknowledge the following conditions. Clicking on the link below indicates your assurance that you will comply with these rules.

Each dataset and metadata description archived in this 000000 remains the intellectual property of the individuals, laboratories, or organizations responsible for the recording, processing, annotation, and submission of the attributed data.

Use of these data requires recognition of contributions of the above parties. For published datasets, this must include citation of literature references accompanying datasets. For unpublished datasets, this should include a citation of the form: (investigator(s) name(s), databased dataset(s)). Extensive re-use requires explicit permission of the submitter; in some cases, an agreed-upon collaboration may be appropriate.

We also ask that re-use of any data from this site include as well an acknowledgment such as: “Data used in this study were delivered via neurodatabase.org — a neuroinformatics resource funded by the Human Brain Project.

I acknowledge these conditions and my use of any data from this database will be compliant.

Many of the ideas I lay out above second those of your other correspondents, and so probably represent current consensus.

I note in particular those from MSKCC’s Vickers:

It is difficult for me to see how one national resource could meet the needs of everyone. …I think the way it is organized and the rules for what kind of data is stored is going to have to be decided by disciplines separately…. I don’t see the value of an agency to co-ordinate the small registries.

He proposes another interesting idea:

…journals should refuse to publish re-analyses of raw data without an author of the original data being a co-author, or offered a right of reply.

One of my comments above was also addressed by Parsons:

… the agency that funded the grant that produced the data should also be responsible for funding a repository for the data’ And as I noted, this is not always easy.

I also appreciated the suggestion by Minster:

… I can certainly imagine a forum where central issues, such as metadata standards, can be debated in a way that makes all the disciplines advance.

© 2010 Daniel Gardner