A digital tracker dog for datasets

Why do researchers have to personally dig through all these abstracts? wondered statistician Rens van de Schoot. Image: still from a UU YouTube video.

He can see it, crystal clear: a Tinder-like app that, while you swipe left or right a few times, learns an algorithm what you’re looking for in large sets of data. The computer can then do the heavy lifting.

It might just prove to be the result of a project led by recently-appointed professor of Statistics, Rens van de Schoot, and UU programmers from the Information & Technology Services (ITS). Together, they developed software that manages to pluck out relevant scientific articles from libraries’ large databases.

The UU Library and Digital Humanities researchers have since joined the team. “We’re still in the testing stage, checking whether the algorithm works as well as we think it does.”

Faster assessment of abstracts
The software could save scientists a lot of time-consuming work. Currently, if people wish to know what’s been published about any given topic, they’ll generally have to sift through thousands of abstracts or summaries. Not even close to all of the articles that show up in a search like that are actually relevant to the research question. Last year, 283 searches like these were conducted in Utrecht. Van de Schoot: “Imagine you’ll be able to assess about forty of these abstracts an hour, and you have to go through ten thousand of them. That’ll take you a while.”

Van de Schoot that he and the university’s programmers also contribute to a higher quality of the scientific process: “Every PhD candidate should really do a systematic review of existing literature at the beginning of his or her PhD track. That’s the only way to find out what’s already been published about your subject. Hopefully, it’ll take only a day or so to do this in the future.”

Recipe for success
The concept isn’t entirely new. However, the UU software scores much better than two other systems currently in use, Van de Schoot says. In test cases, the prototype correctly identified as irrelevant 80 percent of articles from a database of 10,000 articles and only missed five relevant articles. “That’s probably no worse than if a human being had done the same work – after all, people make mistakes too.”

The recipe for success of the model, Van de Schoot says, is that is ‘self-learning’. It shows five articles to researchers and repeats this a few times, at which the scientist indicates whether or not the articles are relevant. The success ratio constantly increases. “The goal is now to perfect that process. It has to be as easy as possible and with as little effort from scientists as possible.”

The software is already available on open source platform GitHub, but users do have to be familiar with programming language Python. Still, Van de Schoot has already received a lot of positive feedback from colleagues. “Everyone who’s ever done a review study like this has said: if only this had existed back then.”

Team effort
Van de Schoot emphasises that it’s a team effort, in which he – as researcher from the focus area Applied Data Sciences – collaborated with the ITS department. The testing stage, as mentioned, now also includes employees from the University Library and Humanities. “I could say what I’d like to do based on the content, but I need the knowledge of other people at the university to realise it. This project has shown me just how valuable this collaboration is.”

Along with employees of Utrecht Holdings, who try to find a market for scientific knowledge, Van de Schoot is looking at ways to give the new open source software a nice interface that could make his programme more user-friendly to a broader audience. Perhaps it will actually become a Tinder-style app. “That would be awesome,” Van de Schoot says with glee.

Video below box

For this project, Rens van de Schoot had received a 25,000 euro grant from the university’s innovation fund for IT applications in research projects. Utrecht University wants to use the fund to support small-scale, high-risk projects that aim to provide IT innovations that strengthen research. So far, 14 projects have received financial support, including two projects by sociology professor Arnout van de Rijt.

In one of these projects, Van de Rijt is working with economist Dirk Gerritsen. They’re studying gambling markets, specifically whether the estimates of winning chances from the past still have influence on later estimates. They built a system that places early bets on a candidate in internet gambling matches. One of the questions is whether large-scale gambling markets, for instance in cases of presidential elections or referendums, may be wrong as a result of social influencing. This could partially explain why the Brexit and Trump results came as a surprise.

Together with engineer Erik Jan van Leeuwen, Van de Rijt is studying the circumstances in which fake news can easily spread throughout a population. To do this, they’re setting up a large-scale experiment that simulates real social networks. These projects also involve employees from the ITS department: programme manager Menno Rasch and IT engineer Martin Schukman.