Table of Contents

  1. Wikidata introduction

    1.1 Why wikidata was created?

    1.2 What is Wikidata?

    1.3 Use of Wikidata by Google

    1.4 The Wikidata Community

    1.5 The Wikidata project Covid-19

    1.6 The wikidata architecture, an introduction to the “semantic web”

    1.7 The wikidata data model

    1.8 SPARQL

  2. An overview of R libraries to query Wikidata

Wikidata introduction

back to the TOC

Why wikidata was created?

  • connect together with one unique identifier all wikipedia pages related to one concept written in x languages: example: Pneumonia
  • make the multi-langual update of wikipedia pages much easier for structured information

What is Wikidata

  • It is a giant graph of knowledge
  • Itis completely free, even for commercial usage (CC0)
  • Anybody can contribute
  • It can be read and edited by both humans and machines
  • Covers all domains of knowledge
  • Extensive item history, talk pages, projects, users
  • Integration with the semantic web
  • High performance query engine (SPARQL)
  • stable! Long term support not dictated by funding cycles
  • Actively developed
  • Already has large number of active users, editors, contributors
  • Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others. Wikidata central data hub
  • Wikidata in brief

Use of Wikidata by Google

Google to wikipedia

Google to wikipedia

Wikidata to Google

Wikidata to Google

The wikidata community

The wikidata architecture, an introduction to the “semantic web”

  • Wikidata has a very advanced architecture. It implements the so called “linked open data (LOD)” architecture.
  • LOD is the implementation of a fomral knowledge graph, formal meaning here that it can be processed by a computer
  • Only when you understand the underlying architecture and concepts, you will use the full potential of wikidata.

The wikidata datamodel

The triples

The triples

The triples

The triples

The triples

The triples

  • The two basic component pieces of Wikidata are items and properties.
  • An item is a thing - a concept, object or topic that exists in the real world, such as “Rush”.
  • These items each have statements associated with them - for example, “Rush is an instance of: Rock Band”. In that statement, “Rock Band” is a property: a class or trait that items can hold.

An overview of R libraries to query Wikidata

(taken from a blog of Envel Le Hir)

This code tutorial is taken from the code of OpenVirus. See the discussion here Thomas Shafee is the author of the code used

Let us use some naming conventions (suffixes) for wikidata specific objects:

.qr = Query result(s) .qid = Wikidata QID number(s) .qs = Wikidata item(s) summary .q = Wikidata item(s) in full .p = Properties of a Wikidata item(s) .wh = Wiki page in html .wx = Wiki page in xml

Let us define some helper functions to test the nature the wikidata type of chain of characters:

is.qid  <- function(x){grepl("^[Qq][0-9]+$",x)}
is.pid  <- function(x){grepl("^[Pp][0-9]+$",x)}
is.date <- function(x){grepl("[0-9]{1,4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}",x)}
is.quot <- function(x){grepl("^\".+\"$",x)}

WikidataR

  • The package WikidataR is an API Client Library for Wikidata
  • sources can be found on github
  • authors: Oliver Keyes, Serena Signorelli & Christian Graul
  • last commit December 2017 :-(
as_qid <- function(x){if(!all(is.qid(x))){WikidataR::find_item(x)[[1]]$id}else{x}}
as_pid <- function(x){if(!all(is.pid(x))){WikidataR::find_property(x)[[1]]$id}else{x}}

Writting to wikidata with R

dataframe/tibble to quickstatements

see this github issue

Ideas of queries

List of all persons in wikidata that died because the Covid-19

P509 (cause_of_death) Q84263196 (Covid-19)

WikipediR: A MediaWiki API client library

Many websites run on versions of MediaWiki, most prominently Wikipedia and its sister sites. WikipediR is an API client library that allows you to conveniently make requests for content and associated metadata against MediaWiki instances.

Retrieving content

“content” can mean a lot of different things - but mostly, we mean the text of an article, either its current version or any previous versions. Current versions can be retrieved using page_content, which provides both HTML and wikitext as possible output formats. Older, individual revisions can be retrieved with revision_content. These functions also return a range of possible metadata about the revisions or articles in question.

Diffs between revisions can be generated using revision_diff, while individual ‘’elements’’ of a page’s content - particularly links - can be extracted using page_links, page_backlinks, and page_external_links. And if the interest is in changes to content, rather than content itself, recent_changes can be used to grab a slice of a project’s Special:RecentChanges feed.

Retrieving metadata

Page-related information can be accessed using page_info, while categories that a page possesses can be retrieved with categories_in_page - the inverse of this operation (what pages are in a particular category?) uses pages_in_category.

User-related info can be accessed with user_information, while user_contributions allows access to recent contributions by a particular user: this can be conveniently linked up with