Monday, 1 June 2015

How do we make data count?

Below is a copy of a post I recently wrote for The Sydney Conference: Scholarly Communication Beyond Paywalls blog. I'm really looking forward to this event which promises to deliver in an unconference participant-driven style. I'll be co-facilitating Thread 5: Dive in and out of communications (multi dimensional).

Data generated through the course of research is as valuable an asset as research publications. Access to research data enables the validation and verification of published results, allows the data to be reused in different ways, helps to prevent duplication of research effort, enables expansion on prior research and therefore increases the returns from investment. Yet the quality and quantity of a researcher’s publications continue to provide the key measure of their research productivity. Sharing data, it seems, still does not count for nearly enough.

In recent years there have been a proliferation of policies strongly encouraging and sometimes even requiring researchers to share their data for the reasons outlined above. This includes policies from governments (e.g. USA, Australia), publishers (e.g. PLOS, Nature), and research funders (e.g. NIH, ARC). These policies are certainly opening up more data but even more research data remains locked away and therefore undiscoverable. So how do we unlock more data? One of the ways is to figure out how to make data count so that researchers have more incentives to undertake the extra (and in the main, unfunded) work required to share their data.

A 2013 study by Heather Piwowar and Todd Vision looked into the link between open data and citation counts. They found that the citation benefit intensified over time: with publications from 2004 and 2005 cited 30 per cent more often if their data was freely available; every 100 papers with open data prompted 150 "data reuse papers" within five years; original authors tended to use their data for only two years, but others re-used it for up to six years. More studies like this one are needed to demonstrate and track over time the link between opening up data and making it count, in this case in the form of citations which – like it or not – is still the primary measure of research impact.

Counting data citations – whether to gather citation metrics or alternative metrics (altmetrics) - is challenging in and of itself because data is cited very differently to publications. Data can be cited within an article text rather than in the references section, which means the article must be open access in order for the citation to be discovered. Sometimes the article that referenced the data is cited rather than the data itself even where the reference applies only to the data. Reference managers don’t tend to recognise datasets and therefore don’t record the Digital Object Identifier (DOI), which creates difficulties since DOIs make it so much easier to track citations. There are also many self-citations, where researchers are citing their own data, and so it difficult to distinguish an article that has cited another person’s data. And there are likely to be differences between how data is cited in the sciences as compared to the humanities.

Fortunately, California Digital Libraries, PLOS and DataONE have partnered in an NSF-funded project called Make Data Count. The project will “design and develop metrics that track and measure data use i.e data-level metrics”. The findings promise to be highly valuable and may also shape future recommendations for the way data should be cited in order for it to be counted.

Sharing impact stories of data reuse is perhaps another way that can help make data count. A number of organisations around the world that promote better data management have been collecting data reuse stories (e.g. DataONE, ANDS). Some researchers may see these stories as a negative because they show that “someone else might get the scoop on ‘my’ data”. But these stories can also inspire researchers to spend the extra effort to make their data available when they feel they are ready to. The rewards may not only be in the metrics but in the unexpected ‘buzz’ of seeing ‘your’ data have a longer life and be reused in ways you had not even imagined. Are there other ways that we can help make data count? It’s worth thinking about because "data sharing is good for science, good for you"