department of informatics

ZenCrowd - Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking


Abstract

We tackle the problem of entity linking for large collections of online pages; Our system, ZenCrowd, identifies entities from natural language text using state of the art techniques and automatically connects them to the Linked Open Data cloud. We show how one can take advantage of human intelligence to improve the quality of the links by dynamically generating micro-tasks on an online crowdsourcing platform. We develop a probabilistic framework to make sensible decisions about candidate links and to identify unreliable human workers. We evaluate ZenCrowd in a real deployment and show how a combination of both probabilistic reasoning and crowdsourcing techniques can significantly improve the quality of the links, while limiting the amount of work performed by the crowd.

Contact: gianluca [dot] demartini [at] unifr [dot] ch

Paper [PDF, 2MB] : 
Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux: ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking21st International World Wide Web Conference (WWW2012), Lyon (France), April 16-20, 2012 [acceptance 12%].

Datasets

In this page we make available the data we collected via Amazon MTurk which provides human intelligence input to entity linking approaches.

Specifically, we provide the links to the original news articles from which entities were extracted, the outcome of the Human Intelligence Tasks when run with US workers and when run with Indian workers. Moreover, we provide our ground-truth linking information which is in the MTurk output format as well.

US workers MTurk files

Indian workers MTurk files

Ground Truth MTurk files

News Article Links