WikiParq

What is it?

WikiParq is a parsed version of Wikipedia into a new resource using parquet. It is available for the following languages:

There is also a corresponding Wikidata resource that has been filtered for these languages. All download links are available under the resources tab.

Source

The input are dumps from between Februari 3rd and 4th, 2016, Wikidata is from February 22nd, 2016.
All dumps were downloaded from: https://dumps.wikimedia.org/backup-index.html

License

As defined by Wikipedia: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)

Levels

There are different levels of information in these resources, numbered from 1 - 4
The files here are large, please only download what you need and one at a time.

Parquet Level 2

Id Language Format SHA1 Front page
en-s2-20160310 English tar (15 GiB) 8cc0222bffae0a7495edd904873171083cd66f46 https://en.wikipedia.org/wiki/Main_Page
fr-s2-20160310 French tar (4.7 GiB) f69e0953ebb580794c12e5ac19695f1a235a4ec9 https://fr.wikipedia.org/wiki/Wikipédia:Accueil_principal
de-s2-20160310 German tar (5.7 GiB) 924378225aa5b4074e4a2e11eb3efc8a99538c8c https://de.wikipedia.org/wiki/Wikipedia:Hauptseite
ru-s2-20160310 Russian tar (3.7 GiB) 9eaa4abd2e0f44940f8b8c40876dc90eea8b6124 https://en.wikipedia.org/wiki/Заглавная_страница
es-s2-20160310 Spanish tar (3.0 GiB) 1436409966c93346c9ed128e2d10037a5809cb94 https://es.wikipedia.org/wiki/Wikipedia:Portada
sv-s2-20160310 Swedish tar (1.8 GiB) 82f7615e104f9628796613aee8562ab3246a6381 https://sv.wikipedia.org/wiki/Portal:Huvudsida

Parquet Level 3

Language Format SHA1
Swedish tar (22 GiB) 4e14d59e672a02775cd3aeb5b847c8d7ac9f679b

Other resources

Resource Format SHA1
Wikidata Parquet tar (1.2G GiB) 6c6dd4b0149fa8e1864311dc8e1c1937da2542eb

Sample Code (Scala)

We used Toree, that gives a Scala REPL in the Jupyter Notebook environment aswell as a running Spark system.

Notebook (ipynb) Executed Notebook (html)

Schema

Every parquet resource has the same structure:

Column Format Description
uri string Unique identifier for an entity (Wikidata) or article (Wikipedia), e.g. urn:wikidata:Q34
lang string Language identifier, ISO 639-1 codes + mul for multilingual entries
doc string (might be null) Type of document: article, category, disambiguation
source string (might be null) Layer name e.g. node/token, edge/dependency-relation
sourceId int (might be null) Sequential unique id per document, can be used to find a specific node without the need to including source, because it is unique over all layers.
predicate string Relation between data
value[1-2] string (might be null) Depends on relation.
valuei[1-2] int (might be null) Depends on relation.
type string Defines the type of the value fields, e.g. reference, wikilink, doclink, weblink and range

Types

Type Description
string value1 field contains a string
range valuei1, valuei2 contains start, end
wikilink value1 contains a target uri, value2 contains the text of the link
weblink value1 contains a target url, value2 contains the text of the link
doclink value1 contains which layer, valuei1 contains the sourceId value
reference value1 contains the target uri

Relations

The full relation is bold parent concatenated with child, e.g "document:title" is one relation.
Relation Description
document
:title Primary title of the article/disambiguation page
:alt_title Alternative title of the article/disambiguation page that has been extracted via redirects
:wiki_page_id Wikipedia page id as stored in the original wikipedia dump
:text Unformatted fulltext of the article
:category Which categories this article belongs to
link
:resolved_target Resolved targets, either to a Wikidata entity or a normlized wikipedia page.
:unresolved_target Weblinks, wikipedia pages that does not yet exist or pages that failed to resolved.
category
:member-of Which categories this category is a member of
:title Title of category
token
:idx Index in sentence
:cpostag Coarse Part-of-speech (POS) tag using Google Universal POS tag set @github
:deprel Dependency relation
:feats Morphological features
:head Head in the dependency tree
:lemma Baseform of word
:norm Normalized version of word
:pos Language dependent POS tag
:text The token as a string
range
:clean
:heading
:italic
:link
:list_item
:list_section
:named_entity
:paragraph
:section
:sentence
:strong
:token
All the ranges of annotations in layers, the number of ranges varies with the level.
edge
:deprel:label The dependency relation label
:head Head of head
:tail Tail of head
ne:label Named entity label
paragraph:source The source of the paragraph, if converted from list or table, useful for filtering.
section:title The page sections.

Marcus Klang 2016 | Last Edit: 2016-03-10