Featured Articles

DataCite Summer Meeting highlights issues and advancements within data sharing communities

Late this August, California became a haven for proponents of data-sharing as the California Digital Library played host to the annual DataCite meeting in Berkeley. DataCite is a non-profit organization which aims to promote the sharing and re-use of research data by helping to provide tools to support a global infrastructure for data archiving, access, and citation. DataCite is composed [...]

DataCite DataCite

Interview with Stuart Shieber

Professor Stuart Shieber directs the Office for Scholarly Communication at Harvard University. He is also a professor of computer science in Harvard’s School of Engineering and Applied Sciences. As a strong open-access advocate, he has led a multi-year effort to shape Harvard’s policies in this arena.

shieber shieber

Interview with John Towns

We will continue to suffer from the simple inability to communicate our data if we do not both find means by which we can make legacy data accessible and move communities toward representations that are accessible to others.

jtowns jtowns

Interview with Gary King

Scientists should not expect government to solve our problems. We need to get our act together on our own. It’s our job, it’s our mission, it’s our chosen task. Our goal is to create, distribute, and preserve knowledge, and sharing data enhances all three.

gary-king gary-king

Interview with Jeannette Wing

Bio: Jeannette M. Wing is a computer science professor and department head at Carnegie Mellon University. She also is former assistant director for computer and information science and engineering at the NSF. She earned her Ph.D. in computer science at MIT. She serves on nine journal editorial boards, including the Journal of the ACM and the Communications of the ACM. [...]

jeanette_wing_0658 jeanette_wing_0658
Sharing data from clinical trials: where we are and what lies ahead

Published: July 30, 2013 Elizabeth Loder associate editor The drive to make clinical trial data more accessible has garnered widespread international support, but rearguard actions by the drug industry could delay substantial change. Elizabeth Loder looks at international developments in the sharing of clinical trial data.   BMJ Volume 347:f4794 , July 30 2013 Reprint -  Full text

Published: July 30, 2013

Elizabeth Loder associate editor

The drive to make clinical trial data more accessible has garnered widespread international support, but rearguard actions by the drug industry could delay substantial change. Elizabeth Loder looks at international developments in the sharing of clinical trial data.

 

BMJ Volume 347:f4794 , July 30 2013
Reprint Full text
By susan with 3 comments
Not by Metadata Alone: The Use of Diverse Forms of Knowledge to Locate Data for Reuse
zimmerman_ann

An important set of challenges for eScience initiatives and digital libraries concern the need to provide scientists with the ability to access data from multiple sources. This paper argues that an analysis of scientists’ reuse of data prior to the advent of eScience can illuminate the requirements and design of digital libraries and cyberinfrastructure.

Ann Zimmerman
School of Information
University of Michigan
105 S. State Street
3438 North Quad
Ann Arbor, MI 48109-1285
USA

email: asz@umich.edu

An important set of challenges for eScience initiatives and digital libraries concern the need to provide scientists with the ability to access data from multiple sources. This paper argues that an analysis of scientists’ reuse of data prior to the advent of eScience can illuminate the requirements and design of digital libraries and cyberinfrastructure. As part of a larger study on data sharing and reuse, I investigated the processes by which ecologists locate data that were initially collected by others. Ecological data are unusually complex and present daunting problems of interpretation and analysis that must be considered in the design of cyberinfrastructure. The ecologists that I interviewed found ways to overcome many of these difficulties. One part of my results shows that ecologists use formal and informal knowledge that they have gained through disciplinary training and through their own data-gathering experiences to help them overcome hurdles related to finding, acquiring, and validating data collected by others. A second part of my findings reveals that ecologists rely on formal notions of scientific practice that emphasize objectivity to justify the methods they use to collect data for reuse. I discuss the implications of these findings for digital libraries and eScience initiatives.

Keywords Data reuse · Data sharing · Ecology

full article

By Editor with 2 comments
New Knowledge from Old Data: The Role of Standards in the Sharing and Reuse of Ecological Data
zimmerman_ann

In this paper, I analyze the experiences of ecologists who used data they did not collect themselves. Specifically, I examine the processes by which ecologists understand and assess the quality of the data they reuse, and I investigate the role that standard methods of data collection play in these processes.


Ann Zimmerman
School of Information
University of Michigan
105 S. State Street
3438 North Quad
Ann Arbor, MI 48109-1285

email: asz@umich.edu

In this paper, I analyze the experiences of ecologists who used data they did not collect themselves.  Specifically, I examine the processes by which ecologists understand and assess the quality of the data they reuse, and I investigate the role that standard methods of data collection play in these processes.  Standardization is one means by which scientific knowledge is transported from local to public spheres.  While standards can be helpful, my results show that knowledge of the local context is critical to ecologists‟ reuse of data.  Yet, this information is often left behind as data move from the private to the public world.  The knowledge that ecologists acquire through fieldwork enables them to recover the local details that are so critical to their comprehension of data collected by others. Social processes also play a role in ecologists efforts to judge the quality of data they reuse.

Keywords: data sharing; data reuse; ecology; objectivity; standardization

full article

By Editor with 2 comments
Interview with Susanna-Assunta Sansone
Sansone

Susanna-Assunta Sansone is a Team Leader at the University of Oxford e-Research Centre, UK. There her work is focused on standards and software development to facilitate the data annotation, sharing and meta-analysis of biological, biomedical and environmental studies. She is the co-founder of MIBBI and the BioSharing initiatives.

Susanna-Assunta Sansone is a team leader at the University of Oxford e-Research Centre. There her work is focused on ontology, standards and software development.

Before her work at Oxford, she worked as a coordinator of international collaborative projects at the European Bioinformatics Institute in Cambridge.

In addition, she is the co-founder of the Minimum Information for Biological and Biomedical Investigations (MIBBI) and the BioSharing initiatives. She received her PhD in Molecular Biology from Imperial College of Science, Technology and Medicine in London.

Have you had specific experiences with data annotation and sharing, and if so, what is your experience?

When I was a ‘wet bench experimentalist’, in my case data was of low volume shared mainly by email, or on a disk, as text, images or in some machine specific, proprietary format. With the rise of the high-through experiments in the genetics, genomics and functional genomics domains, I moved into bioinformatics and developed a significant experience in the area of standardization for the purpose of enabling data reporting and sharing. An increasing variety of ‘standard’ minimal information checklists, terminologies and exchange formats are being developed by the international grassroots community, such as the Genomic Standards Consortium (GSC), to enable the description of biological, biomedical and environmental studies in an unambiguous manner. If annotated in a standard manner, these studies will be comprehensible and (in principle) can be reproduced — a principle supported by the rising number of data sharing policies developed by funding agencies and large consortia.

With my team and international collaborators, I contribute to the development of some of these standards and collaboratively we build software to empower researchers to uptake these community-defined standards.

What type of data have your collaborators shared; what sort of workload and costs did the data annotation and sharing impose on them?

I collaborate with a variety of communities, working in biological, biomedical and environmental domains. Their studies often run source material through several kinds of assays in parallel, such as genomic sequencing, protein-protein interaction assays, or the measurement of metabolite concentrations and fluxes. However, often these studies are only shared internal, or within a consortium or a set of close collaborators; in general, a subset of the studies is released in the public domain, mainly upon publication.

When these studies are shared, the main workload is the annotation, or reporting, phase. Data must be shared — accompanied by enough contextual information (i.e., metadata; sample characteristics, technology and measurement types, instrument parameters and sample-to-data relationships) to make the resulting data comprehensible and reusable, and standards should be used to harmonize the description. To accomplish this, however, takes time and expertise, something the researcher does not necessarily have or is not paid to do, in many cases.  Standards are just ‘a means to an end’, but we need to develop (easy to use) tools to educate and empower researchers to perform basic curation tasks, by enabling them to access the emerging portfolio of community-defined standards to annotate their data in a timely and effective manner.

What problems/hurdles have you encountered personally in data annotation and sharing or what problems/hurdles have you observed generally in the scientific realm?

In addition to ethical and security issues and the concern of having others exploiting the data, the barriers to sharing remain significant for three more reasons.  First, there is an increasing variety of standards and the evolving landscape is still quite unstable. Second, there is a lack of (easy to use) tools that enable researchers to access the emerging portfolio of standards. Lastly, there is the difficulty of utilizing shared data, and in turn this can only further discourage the will to share. Shared data is of little value if it is not sufficiently well annotated in a standard manner.

How did you tackle those hurdles?

With my team and collaborators we work to tackle both standards and tools-related hurdles, in parallel.

Dr. Dawn Field and I have founded BioSharing (http://biosharing.org) to expedite the communication and the production of an integrated, standards-based framework for the capture and sharing of high-throughput genomics and functional genomic bioscience data in particular. This project stems from i) the initial work published in Science in collaboration with a range of representatives from US, UK and European funding agencies (Field, Sansone et al. 2009) and ii) the MIBBI project (Taylor, Field, Sansone, 2008), we established with Chris Taylor, in 2006. BioSharing works at the global level to build stable linkages in particular between journals and funders implementing data sharing policies, and well-constituted standardization efforts in the biosciences domain. This objective is achieved via the creation of web-based catalogues of policies and standards (minimal information checklists, terminologies and exchange formats) and a communication forum. In this first phase we work on the prototypes of the catalogues that will be enriched and enhanced iteratively. As these become increasingly stable, we will move into the next phase to promote and coordinate interactions among what otherwise might be an increasing variety of non-interoperable standards. The BioSharing catalogues aim at:

  • Providing a “one-stop shop” for those seeking data sharing policy documents and information about the standards and technologies that support them;
  • Exposing core information on well-constituted, community-driven standardization efforts and linking to their reporting standards;
  • Linking to exiting complementary portals, such as MIBBI (http://mibbi.org), BioPortal but also open access resources, such as BMC Research Notes and Nature Preceding, with documents or publications on standards, but also standards-compliant systems and research data.

With my team, we also work on the ISA software suite (http://isa-tools.org; Rocca et al, 2010), an open source effort in collaboration with many international groups that work to serve researches to annotate and share their data. The tools are targeted to curators and experimentalists and:

  • assist in the reporting and local management of experimental metadata (i.e. sample characteristics, technology and measurement types, sample-to-data relationships) from studies employing one or a combination of technologies;
  • empower users to uptake community-defined, minimum information checklists and terminologies, where required;
  • format studies for submission to a growing number of international public repositories.

Do you see a need for a national data sharing repository or smaller repositories for specialized arenas?

In addition to the main institutes, such as NCBI (http://www.ncbi.nlm.nih.gov/), there are many groups that have strong expertise in a specific area of science and also are skilled at developing specialized systems. Our collaborators, for example, have successfully deployed the ISA software components to enable data reporting and sharing for stem cell data. The Harvard Stem Cell Discovery Engine (SCDE, http://discovery.hsci.harvard.edu) brings together stem cell-based experimental systems and high-throughput data from the Harvard Stem Cell Institute and other researcher communities, including data from public repositories, in a common ‘standardized’ manner. Their re-annotation and harmonization work, using the community-defined standards served via ISA tools, is of pivotal importance to those researchers working with stem cells, in particular, but also to the scientific community at large, working on the meta-analysis of related datasets.

Do you see value in a centralized repository of data?

The whole argument of centralized vs. federated databases has been discussed at length; I believe a central system cannot cater for everybody’s needs, and there is expertise is the community that should also be leveraged. So often the best solution is in a mixed approach. Obtaining rolling funds to maintain each database is, of course, the main issue, and the other is the adoption of widely accepted common standards. If the latter issue was solved, then it would be easy to move information from one system to another.

The agency’s proposed function in this specific case can be two-fold: support – and progressively enforce – the use of these community-defined standards in the data management and in grant applications, and ensure applicants evaluate the reuse of open source tools prior to developing a new system. However, only a few agencies actively monitor adherence to the proposed plans and even in these cases, the execution of such plans is rarely scored. Unfortunately, often it is a pre-requisite of a grant proposal to develop something, and it is often easier for a developer to create something de novo to have the full control of what can be done. The result is today’s problem: an unnecessary duplication of efforts in many cases. We have to deal with a variety of (arbitrarily) different and incompatible standards, even in the same domain, which limit the development of interoperable tools to enable data sharing.

From a technical perspective it will be necessary to both remove redundancies and fill gaps between standards. These are difficult but not insurmountable tasks. By contrast, the sociological barriers involved in these kinds of large-scale collaborations can be far more challenging, and extensive liaison is necessary between communities. Managing this process of consensus-building from start to finish takes time, resources, and expertise. The time invested in these efforts to build commonalities and synergies among projects is often very little due to lack of resources. The massively collaborative nature of this undertaking requires frequent face-to-face workshops to create the necessary conditions for the building of consensus.

Utilizing public-funded data is a right, but sharing data produced through public funding should be a duty. Scientists will need a combination of incentives and enforcements, or ‘carrot and stick’ as it often is said, but also a lot of help from those like me and my collaborators, working in the service area of science. Data curation, development and harmonization of standards must be recognized as indispensable means to data sharing and therefore properly funded.

By Editor with 10 comments
Dryad: an international repository of data
dryad

DRYAD is an international repository of data underlying peer-reviewed articles in the basic and applied biosciences. DRYAD is governed by a consortium of journals that collaboratively promote data archiving and ensure the sustainability of the repository.

DRYAD is  an international repository of data underlying peer-reviewed articles in the basic and applied biosciences. DRYAD enables scientists to validate published findings, explore new analysis methodologies, repurpose data for research questions unanticipated by the original authors, and perform synthetic studies. DRYAD is governed by a consortium of journals that collaboratively promote data archiving and ensure the sustainability of the repository.

As of Jan 30, 2011, Dryad contains 440 data packages and 1093 data files, published in 57 journals.

DRYAD Home Page

DRYAD Partners list

YouTube Video: How to Deposit Data in DRYAD

By Editor with 1 comment
Beyond the Data Deluge: A Research Agenda for Large-Scale Data Sharing and Reuse
faniel_ixchel

The purpose of this paper is to develop a research agenda for scientific data sharing and reuse that considers these three areas: broader participation in data sharing and reuse, increases in the number and types of intermediaries, and more digital data products.
by Ixchel M. Faniel and Ann Zimmerman

Ixchel M. Faniel and Ann Zimmerman,
School of Information,
University of Michigan

December 2010

Abstract
There is almost universal agreement that scientific data should be shared for use beyond the purposes for which they were initially collected. Access to data enables system-level science, expands the instruments and products of research to new communities, and advances solutions to complex human problems. While demands for data are not new, the vision of open access to data is increasingly ambitious. The aim is to make data accessible and usable to anyone, anytime, anywhere, and for any purpose. Until recently, scholarly investigations related to data sharing and reuse were sparse. They have become more common as technology and instrumentation have advanced, policies that mandate sharing have been implemented, and research has become more interdisciplinary.  Each of these factors has contributed to what is commonly referred to as the “data deluge.” Most discussions about increases in the scale of sharing and reuse have focused on growing amounts of data.  There are other issues related to open access to data that also concern scale that have not been as widely discussed: broader participation in data sharing and reuse, increases in the number and types of intermediaries, and more digital data products. The purpose of this paper is to develop a research agenda for scientific data sharing and reuse that considers these three areas.

pdf of full paper

By Editor with 12 comments