Before her work at Oxford, she worked as a coordinator of international collaborative projects at the European Bioinformatics Institute in Cambridge.
In addition, she is the co-founder of the Minimum Information for Biological and Biomedical Investigations (MIBBI) and the BioSharing initiatives. She received her PhD in Molecular Biology from Imperial College of Science, Technology and Medicine in London.
Have you had specific experiences with data annotation and sharing, and if so, what is your experience?
When I was a ‘wet bench experimentalist’, in my case data was of low volume shared mainly by email, or on a disk, as text, images or in some machine specific, proprietary format. With the rise of the high-through experiments in the genetics, genomics and functional genomics domains, I moved into bioinformatics and developed a significant experience in the area of standardization for the purpose of enabling data reporting and sharing. An increasing variety of ‘standard’ minimal information checklists, terminologies and exchange formats are being developed by the international grassroots community, such as the Genomic Standards Consortium (GSC), to enable the description of biological, biomedical and environmental studies in an unambiguous manner. If annotated in a standard manner, these studies will be comprehensible and (in principle) can be reproduced — a principle supported by the rising number of data sharing policies developed by funding agencies and large consortia.
With my team and international collaborators, I contribute to the development of some of these standards and collaboratively we build software to empower researchers to uptake these community-defined standards.
What type of data have your collaborators shared; what sort of workload and costs did the data annotation and sharing impose on them?
I collaborate with a variety of communities, working in biological, biomedical and environmental domains. Their studies often run source material through several kinds of assays in parallel, such as genomic sequencing, protein-protein interaction assays, or the measurement of metabolite concentrations and fluxes. However, often these studies are only shared internal, or within a consortium or a set of close collaborators; in general, a subset of the studies is released in the public domain, mainly upon publication.
When these studies are shared, the main workload is the annotation, or reporting, phase. Data must be shared — accompanied by enough contextual information (i.e., metadata; sample characteristics, technology and measurement types, instrument parameters and sample-to-data relationships) to make the resulting data comprehensible and reusable, and standards should be used to harmonize the description. To accomplish this, however, takes time and expertise, something the researcher does not necessarily have or is not paid to do, in many cases. Standards are just ‘a means to an end’, but we need to develop (easy to use) tools to educate and empower researchers to perform basic curation tasks, by enabling them to access the emerging portfolio of community-defined standards to annotate their data in a timely and effective manner.
What problems/hurdles have you encountered personally in data annotation and sharing or what problems/hurdles have you observed generally in the scientific realm?
In addition to ethical and security issues and the concern of having others exploiting the data, the barriers to sharing remain significant for three more reasons. First, there is an increasing variety of standards and the evolving landscape is still quite unstable. Second, there is a lack of (easy to use) tools that enable researchers to access the emerging portfolio of standards. Lastly, there is the difficulty of utilizing shared data, and in turn this can only further discourage the will to share. Shared data is of little value if it is not sufficiently well annotated in a standard manner.
How did you tackle those hurdles?
With my team and collaborators we work to tackle both standards and tools-related hurdles, in parallel.
Dr. Dawn Field and I have founded BioSharing (http://biosharing.org) to expedite the communication and the production of an integrated, standards-based framework for the capture and sharing of high-throughput genomics and functional genomic bioscience data in particular. This project stems from i) the initial work published in Science in collaboration with a range of representatives from US, UK and European funding agencies (Field, Sansone et al. 2009) and ii) the MIBBI project (Taylor, Field, Sansone, 2008), we established with Chris Taylor, in 2006. BioSharing works at the global level to build stable linkages in particular between journals and funders implementing data sharing policies, and well-constituted standardization efforts in the biosciences domain. This objective is achieved via the creation of web-based catalogues of policies and standards (minimal information checklists, terminologies and exchange formats) and a communication forum. In this first phase we work on the prototypes of the catalogues that will be enriched and enhanced iteratively. As these become increasingly stable, we will move into the next phase to promote and coordinate interactions among what otherwise might be an increasing variety of non-interoperable standards. The BioSharing catalogues aim at:
- Providing a “one-stop shop” for those seeking data sharing policy documents and information about the standards and technologies that support them;
- Exposing core information on well-constituted, community-driven standardization efforts and linking to their reporting standards;
- Linking to exiting complementary portals, such as MIBBI (http://mibbi.org), BioPortal but also open access resources, such as BMC Research Notes and Nature Preceding, with documents or publications on standards, but also standards-compliant systems and research data.
With my team, we also work on the ISA software suite (http://isa-tools.org; Rocca et al, 2010), an open source effort in collaboration with many international groups that work to serve researches to annotate and share their data. The tools are targeted to curators and experimentalists and:
- assist in the reporting and local management of experimental metadata (i.e. sample characteristics, technology and measurement types, sample-to-data relationships) from studies employing one or a combination of technologies;
- empower users to uptake community-defined, minimum information checklists and terminologies, where required;
- format studies for submission to a growing number of international public repositories.
Do you see a need for a national data sharing repository or smaller repositories for specialized arenas?
In addition to the main institutes, such as NCBI (http://www.ncbi.nlm.nih.gov/), there are many groups that have strong expertise in a specific area of science and also are skilled at developing specialized systems. Our collaborators, for example, have successfully deployed the ISA software components to enable data reporting and sharing for stem cell data. The Harvard Stem Cell Discovery Engine (SCDE, http://discovery.hsci.harvard.edu) brings together stem cell-based experimental systems and high-throughput data from the Harvard Stem Cell Institute and other researcher communities, including data from public repositories, in a common ‘standardized’ manner. Their re-annotation and harmonization work, using the community-defined standards served via ISA tools, is of pivotal importance to those researchers working with stem cells, in particular, but also to the scientific community at large, working on the meta-analysis of related datasets.
Do you see value in a centralized repository of data?
The whole argument of centralized vs. federated databases has been discussed at length; I believe a central system cannot cater for everybody’s needs, and there is expertise is the community that should also be leveraged. So often the best solution is in a mixed approach. Obtaining rolling funds to maintain each database is, of course, the main issue, and the other is the adoption of widely accepted common standards. If the latter issue was solved, then it would be easy to move information from one system to another.
The agency’s proposed function in this specific case can be two-fold: support – and progressively enforce – the use of these community-defined standards in the data management and in grant applications, and ensure applicants evaluate the reuse of open source tools prior to developing a new system. However, only a few agencies actively monitor adherence to the proposed plans and even in these cases, the execution of such plans is rarely scored. Unfortunately, often it is a pre-requisite of a grant proposal to develop something, and it is often easier for a developer to create something de novo to have the full control of what can be done. The result is today’s problem: an unnecessary duplication of efforts in many cases. We have to deal with a variety of (arbitrarily) different and incompatible standards, even in the same domain, which limit the development of interoperable tools to enable data sharing.
From a technical perspective it will be necessary to both remove redundancies and fill gaps between standards. These are difficult but not insurmountable tasks. By contrast, the sociological barriers involved in these kinds of large-scale collaborations can be far more challenging, and extensive liaison is necessary between communities. Managing this process of consensus-building from start to finish takes time, resources, and expertise. The time invested in these efforts to build commonalities and synergies among projects is often very little due to lack of resources. The massively collaborative nature of this undertaking requires frequent face-to-face workshops to create the necessary conditions for the building of consensus.
Utilizing public-funded data is a right, but sharing data produced through public funding should be a duty. Scientists will need a combination of incentives and enforcements, or ‘carrot and stick’ as it often is said, but also a lot of help from those like me and my collaborators, working in the service area of science. Data curation, development and harmonization of standards must be recognized as indispensable means to data sharing and therefore properly funded.