Rees is a principal scientist at Science Commons’ Neurocommons Project. A computer scientist, he worked in Millennium Pharmaceuticals’ computational biology group before joining Science Commons. At the company his work focused on large-scale curated protein interaction networks. Specifically, he was looking at the use of such networks in the analysis of high-throughput experimental data. He has engaged in similar work at Science Commons. Rees’ interests include the Semantic Web, biological knowledge representation, software technology and open access publishing. Jonathan is also an officer of the Cambridge Entomological Club and has brought to the Web 100 volumes of the journal Psyche — one of America’s oldest natural history journals. He received his PhD in electrical engineering and computer science from MIT.
Please describe in brief the goals of your Neurocommons project at Science Commons.
The Neurocommons project tests the idea that “knowledge bases” can be usefully integrated in an open, modular, extendable, manner using RDF and OWL. We have developed a particular platform philosophy around extensibility and reproducibility and used it to combine information from NCBI (National Center for Biotechnology Information), NLM (National Library of Medicine), EBI (European Bioinformatics Institute), OBO (Open Biological and Biomedical Ontologies), and a variety of other sources.
What drew you initially to this field of endeavor?
Life sciences provide a good demonstration of how “open source knowledge management” can be done. Alan Ruttenberg (Rees’ colleague at Science Commons) and I both had experience doing this kind of thing inside of a pharma company and wanted to do the same thing open source.
Briefly summarize the key efforts under way in the scientific community worldwide to create the technological steps to facilitate widespread sharing of raw data on the web.
We have not worked with raw data very much. The hardest part is ontologies – formalizing what a row of a spreadsheet means. In biology look at OBI, PATO, GO consortium.
In your data sharing work to date and in the work of your colleagues, can you point to specific breakthroughs that occurred because raw data was freely shared among researchers? Be as specific as possible please.
No. Most science is not organized into “breakthroughs”.
What problems/hurdles have you encountered personally or what problems/hurdles have you observed generally in the scientific realm in regard to data sharing?
Data being offline (broken links). Poor documentation. Confusing license terms.
How did you tackle those hurdles?
The hard way. Poor documentation requires reverse engineering. Licensing problems sometimes mean jumping through hoops to comply, sometimes contacting author/publisher to change license terms. Often it means just not using the data (e.g. HPRD).
What is the basis for the widespread resistance to data sharing among researchers, and is any of their criticism of data sharing based on valid concerns?
People who are good at gathering data may not be good at analyzing it. So they want to hoard the data while they analyze it, and if possible, keep others from doing better analyses that might invalidate their results or produce new results that they didn’t see.
Do you see a need for a national data sharing repository or smaller discipline-based repositories?
There is a need for archival data repositories, of as many different kinds as possible. Traditionally libraries (now called “institutional repositories”) would perform the service of “holding” materials produced by any source that their users might need. I don’t know why this practice is neglected.
Ideally materials are deposited in multiple mutually independent repositories, just as in the old days copies of book would be sent to multiple libraries.
What agency should run it/them and why? If you advocate a system of smaller repositories organized around specific disciplines, should there be an overview agency that coordinates/supervises interactions among databases? If so, describe the agency’s proposed function.
In my view repositories should be institutional, not discipline-based, since many disciplines will be unable to sustain their own repository.
I’m not sure why libraries don’t house material generated from outside their institution; could be licensing worries, or not knowing where to start, or worries about volume, or not knowing where to look for stuff.
Do you see PubMed as a future workable repository for the storage of raw data developed by researchers in the realm of medicine? If not, why not. And if so, what hurdles would such a plan confront and how long would it take to develop such a repository?
I don’t know about PubMed, but the National Center for Biotechnology Information (NCBI) has been doing a good job with Genbank and GEO and perhaps that role could be expanded.
Some scientists have advocated an international agency that would serve as a clearing house – knowledgeable about the work being done by data sharing projects worldwide. Do you see any value in such an agency, and if so, how hard would it be to win the international cooperation needed to create it?
I don’t see how you could say no to such a project. But I can’t imagine how it would be funded and sustained. Cooperation and coordination are expensive and difficult. Better for someone to just start doing it, and let others catch up.
What are your suggestions on the most powerful ways to combat researchers’ resistance to data sharing? How effective do you see the requirements now in place by the NIH and high-profile journals requiring data sharing as condition of funding in some instances? Do you believe more teeth should be put into these requirements and, if so, how?
Do everything possible to make data preparation and publication respectable in the eyes of tenure committees. See <http://neurocommons.org/report/data-publication.pdf>
My understanding is that data sharing requirements have no teeth. In one case I know of the journal editor had to threaten an author with a retraction before the author gave in and shared the data. In many cases the data is archived on a lab web site and the link is likely to go stale in a year or two. And the integrity of the data set is not guaranteed – the investigator may tamper with it post publication. Authors should not be allowed to host their own data.
Other incentives to enhance scientists’ acceptance of data sharing?
Recognition for doing so. Require sharing in order to get published.

No Comments Yet - be the First