Crowdsourcing vocabulary

Crowd-Voc is a vocabulary for describing data that has been produced through crowdsourcing. Crowdsourcing is understood broadly, as any type of collective activity to solve a problem or achieve a goal. Data refers to the contributions of the crowd in their consolidated form, which are released in form of a 'dataset' which can be used in other contexts. The first goal of the vocabulary is to allow crowdsourcing projects to describe what the data is about and how it came about to facilitate reuse and sharing across projects. The vocabulary should also help repeat and reproduce research that involves crowdsourcing by capturing the key design parameters of a crowdsourcing task and the way crowd answers have been validated, aggregated and otherwise processed before their release as a dataset. Finally, the vocabulary sets the ground for acknowledging contributions by the crowd, towards a fairer model for licensing crowdsourced data.

Table 1: Namespaces used in the document
crowd-voc	<http://qrowd-project.eu/def/crowd-voc#>
owl	<http://www.w3.org/2002/07/owl#>
rdfs	<http://www.w3.org/2000/01/rdf-schema#>
xsd	<http://www.w3.org/2001/XMLSchema#>
dcterms	<http://purl.org/dc/terms/#>
org	<http://www.w3.org/ns/org#>
s	<http://schema.org/>
skos	<http://www.w3.org/2004/02/skos/core#>
foaf	<http://xmlns.com/foaf/0.1/>
ex	<http://example.com/>

The core of the vocabulary is shown in the figure below. We reuse as much as possible existing vocabularies. The main entity is Task, a subclass of prov:Activity that describes the crowdsourcing task, ultimately generating a dcat:Dataset. An instance of Task is linked to a Crowd, which is a subclass of prov:Agent. Crowds consists of instances of Crowdworker, a type of prov:Person. A crowdworker may be part of many crowds, assembled to run different tasks. To establish the link between crowds and crowdworkers, we reuse the Organization ontology, making Crowd an instance of org:Organization, allowing the usage of the org:hasMember property. Crowdworkers are identified by ids, depending on the platform where the crowdsourcing task is being run. The aim is not to store personal data about contributors, which is not accessible when running a crowdsourcing task, but to be able to distinguish between multiple contributors - this is needed when validating and aggregating the data submitted by the crowd. The crowd of a task refers to the set of contributors to the task rather than all users registered on the crowdsourcing platform.

A task is comprised by an rdf:list of Questions that models the order on which questions are asked to crowdworkers. Each question has a hasText property, and a rdf:seeAlso that links to a more complex representation of the question, if available.

A task is a collection of different activities that ultimately results in a dataset. It can consist of subtasks of the same or different types. Our aim is to describe the design of the crowdsourcing effort, including standard classes of tasks, which are approached in a particular way in crowdsourcing (e.g., classification, sentiment analysis, transcription, review etc.) and activities the crowd was asked to carry out (e.g. answer questions, check a website, translate from one language to another etc.). The reader is provided with a broad overview of the workflow used, rather than a fully specified data or control flow, as in a workflow language. Each task (and all its subtasks) is linked to an input dataset (e.g. a collection of images) and an output dataset, which the crowd has helped produce. The output dataset is a consolidated version of the answers provided by the crowd, including validation, aggregation etc. A task can contain several questions, which refer to the exact wording used to seek contributions (e.g. "Identify all people, places, and organisations in the following paragraph").

The below figure describes the subdivision of a task in subtasks, to model more complex situations on which different tasks (possibly with different crowds) are applied. The subtask list is modeled as a rdf:list of Tasks. Each subtask takes as input the dataset generated by the previous task.

Vocabulary description

Classes

Task
ContentVerificationTask
ClassificationTask
MetadataFindingTask
MediaTranscriptionTask
RankingTask
ModerationTask
FeedbackTask
SentimentAnalysisTask
ContentCreationTask
Question
QCMechanism
AggregationMethod

Task

^c back to ToC or Class ToC

IRI: TaskIRI

A crowdsourcing task

characteristics (is disjoint, in range of, has subclasses, etc): class or property name ^c

ContentVerificationTask

^c back to ToC or Class ToC

IRI: TaskIRI

A task where the crowdworkers are asked to check predefined properties of an item by carrying out specific checks or consulting external sources

characteristics (is disjoint, in range of, has subclasses, etc): class or property name ^c

ClassificationTask

^c back to ToC or Class ToC

IRI: TaskIRI

A generic classification task on text or other media

characteristics (is disjoint, in range of, hassubclasses, etc): class or property name ^c

MetadataFindingTask

^c back to ToC or Class ToC

IRI: TaskIRI

A task of finding a set of predefined characteristics of an item, for instance the author or year of a book

characteristics (is disjoint, in range of, hassubclasses, etc): class or property name ^c

MediaTranscriptionTask

^c back to ToC or Class ToC

IRI: TaskIRI

A generic task of transcription for different types of media

characteristics (is disjoint, in range of, has subclasses, etc): class or property name ^c

RankingTask

^c back to ToC or Class ToC

IRI: TaskIRI

A task ranking different items

characteristics (is disjoint, in range of, has subclasses, etc): class or property name ^c

ModerationTask

^c back to ToC or Class ToC

IRI: TaskIRI

A task moderating content from other tasks

characteristics (is disjoint, in range of, has subclasses, etc): class or property name ^c

ContentCreationTask

^c back to ToC or Class ToC

IRI: TaskIRI

A task producing new content, for instance free labels or descriptions for items, shortened versions of text, media etc.

characteristics (is disjoint, in range of, has subclasses, etc): class or property name ^c

SentimentAnalysisTask

^c back to ToC or Class ToC

IRI: TaskIRI

A sentiment analysis task on text or other media

characteristics (is disjoint, in range of, has subclasses, etc): class or property name ^c

SubTask

^c back to ToC or Class ToC

IRI: TaskIRI

A SubTask that is part of CrowdSourcing Task, for example, an image labelling task might consists of one subTask about identifying a certain entity in the image, and another about cropping the image to center the

characteristics (is disjoint, in range of, has subclasses, etc): class or property name ^c

Question

^c back to ToC or Class ToC

IRI: QuestionIRI

A question that is shown to the Crowdworker

characteristics (is disjoint, in range of, hassubclasses, etc): class or property name ^c

Quality control mechanism

^c back to ToC or Class ToC

IRI: QualityControlIRI

Quality control mechanism applied to a task: verifiable/control question, pre-screening/test question, honeypot(sample-based filtering), agreement-based/consensus, none

characteristics (is disjoint, in range of, hassubclasses, etc): class or property name ^c

Properties

Reused properties

dcterms:hasPart
prov:used
prov:generated
prov:wasAssociatedWith
vocab-org:hasMember