Wilbanks is the executive director of the Science Commons project at Creative Commons. Before taking on this role, he founded a bioinformatics company that developed semantic graph networks that could be used in the research and development of pharmacological products. Earlier Wilbanks worked at the Berkman Center for Internet and Society at Harvard Law School, where he was that institution’s first assistant director. He also previously was a legislative aide to U.S. Rep. Pete Stark, D-Calif. Currently Wilbanks serves on several advisory boards, including the National Library of Medicine’s PubMed Central, and on the board of directors of the Fedora Commons digital repository organization.

What drew you to the Creative Commons/Science Commons endeavor in the first place?

I knew the founders of Creative Commons. Also, I had started a bioinformatics company and ran it for a while. It showed me the importance of having some foundational infrastructure. I had seen the utility of open data from the entrepreneurial perspective. Then there was the Human Genome Project. You could build on the experience of the project. But in science, we keep being required to pay for access to articles and to publicly funded databases. When Creative Commons decided to get Science Commons off the ground, it was a no brainer for me to become involved.

If you were going to summarize the mission of Science Commons in a few sentences, what would you say?

In lay terms, we are trying to make the web work for scientists in the way it works for commerce. But the web hasn’t yet had that sort of transformative effect on science. We work with publishers of science, we work with database providers, universities, libraries, pharmaceutical companies, individual scientists. These users use the tools we create, but the tools themselves are free. We are a non-profit so they aren’t really our customers. They are our users. We provide standards both technically and legally that aid in the movement of information on the web. Probably our biggest users are the funders of disease research. One of our biggest users is the foundation that invests in Huntington’s disease research. There are quite a few neurodegenerative diseases and all have different foundations invested in brain research. If you take them together, it means hundreds of millions in research dollars, and if you can help make what they do inter-operable, that increases the possibility of scientific breakthroughs.

Do you see any increase in the acceptance of data sharing within the scientific community?

There is some movement in that direction. But simply putting raw data on line without curating it isn’t that useful. If you put observations of the night sky on line but don’t tell what portion of the sky you’re looking at, it’s not that useful. You need to make the data understandable. What has hampered movement is that people post data and no one does anything with it. One of the reasons we want open stuff is that we look at the efforts of something like Wikipedia, where the efforts of thousands come together, and it’s more than the sum of its parts. It’s hard to do that with scientific research. If parameters, standards and infrastructure aren’t published with the data, it becomes difficult to understand the data.

You need infrastructure to support the data – server space and network availability. At a higher level, you need database software that serves the data when you query it. You need systems that allow you to visualize the data. You need statistical packages that you can run across the data. Most data isn’t automatically readable by humans. If you begin to share data on a broad scale, you need infrastructure to integrate it.

The Human Genome has a standardized base. It has an organization with government funding to store, preserve and archive the data. It has an enormous amount of software that lets you search for similarities among genes, and that makes it a very open piece of data.

One of the reasons there aren’t generic demands for data now is because currently there aren’t places to put the data and know it will be useful. Where you have infrastructure, the data becomes very powerful.

There was a brutal process – a standardization effort that I was part of in the late 1990s and early 2000s – on how to capture all the information. The result is called MIAME – Minimum Information About a Microarray Experiment. It allows you to have some kind of data that comes off these machines. It’s useful to be shared with people because it can be understood. It was the standards plus infrastructure that made it worthwhile to mandate data sharing.

How would you rank the acceptance of raw data sharing within the scientific community?

There is a gradient of desire for data sharing. It is hard to make a blanket statement because it is discipline specific. Within each discipline it depends on the kind of data.

In places where there is a desire, the degree of its importance tends to depend on standards and infrastructure. In chemistry there is very little data sharing because if data is published, that precludes the ability to get a patent on a chemical: If you release the structure of the chemical first, you can’t get a patent. In astronomy and physics, it already works. In biology they share the results of the Genome Project but don’t share the results of what they found in the lab last week. The biologists want to write a paper on it first because that’s how they get tenure. The places where you see data sharing working is where they have in place the infrastructure, the standards and incentives.

Do you think a national repository would be a good idea?

The more you dig into national policies, the more you find that it is very difficult to integrate clinical data because every nation has different requirements. A lot of scientists from other countries hate the U.S. policy: The Patriot Act allows reverse identification of people. The re-identification people are really good, and they demonstrate that you can re-identify someone you have de-identified. So take Canada and the United Kingdom. As a government policy, they say they won’t put their citizens’ data in a place where it is subject to another government’s disclosure.

You won’t get an international data repository anytime soon because of the complexity of privacy laws. The Gates Foundation has done most of the work on this. They have worked at gathering tissue samples related to malaria. It has taken them six to ten years just to negotiate rights to maintain a bank of samples of malaria. That’s because nations’ privacy laws cut across data sharing, especially regarding experiments about people.

When it comes to foundations, they can create their own databases because they aren’t trying to create a public database. Companies can share inside themselves. Another way it works is in malaria and TB: You can have an institutional review board build a network among themselves. The IRB is the body at each university that reviews research on humans. They write standards that govern the research. When you do multi-national research, all institutions involved have to sign off on an IRB agreement. So if you want to do international data sharing, you need an IRB for the work.

What do you think about PubMed being a repository for raw medical data?

There is a massive privacy problem. You own your data, not the government. The law is so screwy that it is difficult to give consent to have your medical records released. So it is very hard to share medical information about people. But it is relatively easy to share it in the context of a really difficult disease.  Healthy people, on the other hand, are more comfortable with keeping their medical data unless they get in a crisis situation. With a rare disease, you are more likely to get the patient’s data because these are people who want to change the system because the system doesn’t work for them. (The pharmaceutical companies aren’t going to invest heavily in finding a cure for a disease that few people have.)

I have heard people say that it will take a legislative, judicial or executive action to change the system’s privacy laws. PubMed storage is not going to happen until there is a change in privacy law.

In the meantime, there will be pockets of data sharing where people think it is worth giving consent.

ADNI  is a great example. It has standardized hardware images and it provides infrastructure. They thought of standards, infrastructure and the legal part of it, but scaling that out to the public – and healthy people where consent is more difficult – is a lot harder.

If you have a rare disease, there’s no return on a pharmaceutical company looking for a cure. If you have a rare disease, the only thing you care about is cure and anything that accelerates that process is on the table. That is why so many people with rare disease in their families start their own foundations. They are going to look at collaboration to move on faster. If there are 10 foundations researching rare disease, the odds of one finding some breakthrough increase if they combine efforts.

Should standards requiring data sharing on government-funded projects be tougher?

There is an entire open access movement for access to research. First the NIH policy covering NIH-funded articles was voluntary, and there was only 4 per cent compliance (in making the raw data available). When NIH turned it into a mandate, compliance shot up to 60 per cent. This is for any researched articles published with NIH money: They now must be deposited in PubMed Central in full text within 12 months of publication. NIH moved from a request to a mandate with a place for the research to go; now the researcher’s cooperation with this requirement is actually used to evaluate their ability to secure future grants. That creates all the incentives needed to comply.

What we learned is you have to have a mandate. You have to have follow-up and a measurement of success and use it to evaluate the next grant. Nothing less than that works.

Are there other incentives?

Along with creating incentives through funding, another place is the tenure process. University communities are beginning to measure whether faculty copied their research into a university’s digital library. Harvard, for example, mandates that if you are writing an article, a copy of it has to go into the university’s digital repository, and if you don’t do it over time, it gets taken into account when you are evaluated for tenure.