Andrew Vickers is an associate attending research methodologist at Sloan-Kettering Cancer Center in New York City. Through his published studies, commentary and leadership role in his own research field, he has become known worldwide as one of the foremost advocates of scientific data sharing. In this high-profile position, he has called repeatedly for widespread sharing of raw data and simultaneously has been outspoken in his criticism of researchers who refuse to share their data even though they received grants predicated on the sharing of their raw data.

Dr. Vickers’ research focuses on three main areas – randomized trials, surgical outcome research and molecular marker studies. He is especially interested in the detection and initial treatment of prostate cancer. He is now working on demonstrating that a single measure of prostate specific antigen taken in middle age can be predictive of cancer up to 25 years later. In his methodological research, he looks for new ways of analyzing the clinical value of predictive tools. In addition to his research work, Dr. Vickers has a strong interest in teaching statistics, and is author of “What is a p value anyway? 34 stories to help you actually understand statistics”.

Have you had specific experiences with data sharing, and if so, when did you begin your data sharing effort?

You can see in my Trials paper published in 2006 (http://www.trialsjournal.com/content/7/1/15) some experiences of data sharing. I have also had numerous cases where I have set up large international collaborations where we share data by common agreement amongst several investigators.

What type of data have you shared; what sort of work load and costs did the data sharing impose on you and your colleagues?

Data from clinical trials and prospective cohort studies. Costs of data sharing were trivial: we have well-annotated data sets in order for us to analyze them so it is just a case of de-identifying. This might take, say, 30 minutes at most.

Was your data sharing effort successful? Please be as specific as possible about any benefits derived from the data sharing.

Sure. I have several major papers published on the basis of shared data. I also am the chair of the Acupuncture Trialists’ Collaboration (http://www.acupuncturetrialistscollaboration.org). There have been dozens of trials throughout the world looking at whether acupuncture helps. Trials’ results differ. What we are doing is putting all the data together as one data set – with about 18,000 patient s – and we are analyzing that data set to see what it works for patients.

I am chair of the Prostate Biopsy Collaborative Group, looking at the results of 25,000 prostate biopsies. The overall aim of the collaboration is to understand the relationship between PSA levels in blood and prostate cancer risk.

These large collaborative groups are reasonably common. They are difficult to set up, but they are examples of how shared data are extremely valuable. This is going on right now and is working quite well. I am well known in prostate cancer and can go to people and say “let’s pool our data”. That works slowly but surely. What isn’t working so well is when one researcher wants to look at another’s data and doesn’t know the person.

In science generally, there are many high-profile examples where researchers have come together and shared their data. Look at the Early Breast Cancer Trialists Collaborative Group cancer study begun in the early 1980’s at Oxford. It has collected data from I believe a quarter million women. It is from this data that guidelines were developed in adjuvant therapy for breast cancer (adjuvant therapy is chemotherapy and hormones after surgery to help stop cancer from coming back). That is routine for women now and it’s based on these large data sets.

What problems/hurdles have you encountered personally or what problems/hurdles have you observed generally in the scientific realm?

Scientists believe that if they control the data, they can control the science: Don’t give anyone our data, they say, it might be misused.

I think scientists have to trust science to look after itself. I work with investigators who ask what if someone misuses our data? I say that if someone takes our data and misuses it, we will write a letter to the editor of the journal where the author’s paper appeared or, for egregious cases, to the chair of the researcher’s department, saying that our data was misused and we were never contacted.

Scientists should not act as gatekeepers for data. Scientists often say they wouldn’t want someone to analyze their data and come to a different conclusion. Let’s say that I do a trial showing that Drug X works. One point of view is that I shouldn’t share my data in case someone uses it to show that Drug X doesn’t work. My view is that I should share the data, and then we can have an open scientific debate about the value of drug X.
Colleagues have said to me what if they give their data to another researcher and the data is misused. I say okay if the other researcher gets the data wrong, it is their career that suffers.

How did you tackle those hurdles?

Writing papers and articles.

Do you see a need for a national data sharing repository or smaller repositories for specialized arenas?

Data sets are too diverse to be in a single data repository. There should be a variety of different ones.

I don’t see value in a PubMed-type master operation. (PubMed is a free resource operated by the National Center for Biotechnology Information at the National Library of Medicine; it is a repository for journal citations and abstracts in the fields of medicine, nursing, dentistry, veterinary medicine, the health care system and preclinical sciences.) PubMed has papers. Scientific papers are all pretty much in the same form, irrespective of the discipline. Data sets, on the other hand, come in enormously different forms. In a medical data set you might one patient per row. In other data sets, a person’s data would be on multiple rows. Trying to set up a system that would meet the data for all different types of science is just about impossible. In anthropology, part of the data set might be a digital version of a song sung by a tribe. An archeologist might have many photos as part of the data set. In astronomy you have huge lists of numbers. It is difficult for me to see how one national resource could meet the needs of everyone. To have one rule to work for everyone would be very, very tough.

If the repository were just a dump, of course dump whatever you want. There would have to be some control over it and that would be really tough. I don’t agree with just dumping stuff. It has to be organized in some way. I think the way it is organized and the rules for what kind of data is stored is going to have to be decided by disciplines separately.

If you advocate the repository system, what agency should run it/them and why? If you advocate a system of smaller repositories organized around specific disciplines, should there be an overview agency that coordinates/supervises interactions among databases? If so, describe the agency’s proposed function.

I don’t see the value of an agency to co-ordinate the small registries.

What are your suggestions on the most powerful ways to combat researchers’ resistance to data sharing? How effective do you see the requirements now in place by the NIH and high-profile journals requiring data sharing as condition of funding? Do you believe more teeth should be put into these requirements and, if so, how?

I have published a paper in PLOS (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2739314/?tool=pmcentrez) showing that journal policies just saying “you have to make your data available if asked” have no teeth. The only way to overcome scientists’ reluctance would be for journals to refuse to publish research papers unless they could confirm that the raw data were available somewhere (e.g. a repository).

Are there other incentives you believe would enhance scientists’ acceptance of data sharing?

First,embargos. For example, scientists should have to deposit data before submitting a paper, but they could opt to embargo the data for, say, a year or two. The data repository should be set up so that someone could check that the data were there (e.g. they could see one complete row of data and one complete column), but not be able to access the whole data set until the embargo was over. This would give scientists the first shot to exploit their data.

Second, guidelines like I published in Trials. For example, journals should refuse to publish re-analyses of raw data without an author of the original data being a co-author, or offered a right of reply. This would make scientists happy because they could debate what they might see as unfair analyses of their data; it would also improve their publication record.