Abstract

Wiktionary is a rich source of linguistic knowledge and an example of a successful application of the crowdsourcing model. Knowledge in Wiktionary is only weakly structured, so in order to enable the use of that knowledge, it is necessary to represent it in a structured form which can be automatically searched and processed. Semantic web structures are especially suitable for this task because of the developed standards for interlinking different semantic web knowledge bases. Basic Wiktionary extraction has already been done as a part of DBpedia project. We present the extraction of detailed grammatical data which is obtained by merging unstructured content contained within different pages of the MediaWiki XML dump file. As an example, we'll process French verb conjugations, which is currently one of the few such examples of sufficient complexity found on Wiktionary. The main problem we will solve is analyzing and parsing a subset of the MediaWiki template system and its control structures. Based on that, we will generate RDF triples which will completely cover all domain data that is currently included in Wiktionary.

Keywords: Crowdsourcing, semantički web, Wiktionary
Published on website: 4.2.2014