What is it?

WikiParq is a parsed version of Wikipedia into a new resource using parquet. It is available for the following languages:

There is also a corresponding Wikidata resource that has been filtered for these languages. All download links are available under the resources tab.


The input are dumps from between Februari 3rd and 4th, 2016, Wikidata is from February 22nd, 2016.
All dumps were downloaded from:


As defined by Wikipedia: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)


There are different levels of information in these resources, numbered from 1 - 4
The files here are large, please only download what you need and one at a time.

Parquet Level 2

Id Language Format SHA1 Front page
en-s2-20160310 English tar (15 GiB) 8cc0222bffae0a7495edd904873171083cd66f46
fr-s2-20160310 French tar (4.7 GiB) f69e0953ebb580794c12e5ac19695f1a235a4ec9édia:Accueil_principal
de-s2-20160310 German tar (5.7 GiB) 924378225aa5b4074e4a2e11eb3efc8a99538c8c
ru-s2-20160310 Russian tar (3.7 GiB) 9eaa4abd2e0f44940f8b8c40876dc90eea8b6124Заглавная_страница
es-s2-20160310 Spanish tar (3.0 GiB) 1436409966c93346c9ed128e2d10037a5809cb94
sv-s2-20160310 Swedish tar (1.8 GiB) 82f7615e104f9628796613aee8562ab3246a6381

Parquet Level 3

Language Format SHA1
Swedish tar (22 GiB) 4e14d59e672a02775cd3aeb5b847c8d7ac9f679b

Other resources

Resource Format SHA1
Wikidata Parquet tar (1.2G GiB) 6c6dd4b0149fa8e1864311dc8e1c1937da2542eb

Sample Code (Scala)

We used Toree, that gives a Scala REPL in the Jupyter Notebook environment aswell as a running Spark system.

Notebook (ipynb) Executed Notebook (html)


Every parquet resource has the same structure:

Column Format Description
uri string Unique identifier for an entity (Wikidata) or article (Wikipedia), e.g. urn:wikidata:Q34
lang string Language identifier, ISO 639-1 codes + mul for multilingual entries
doc string (might be null) Type of document: article, category, disambiguation
source string (might be null) Layer name e.g. node/token, edge/dependency-relation
sourceId int (might be null) Sequential unique id per document, can be used to find a specific node without the need to including source, because it is unique over all layers.
predicate string Relation between data
value[1-2] string (might be null) Depends on relation.
valuei[1-2] int (might be null) Depends on relation.
type string Defines the type of the value fields, e.g. reference, wikilink, doclink, weblink and range


Type Description
string value1 field contains a string
range valuei1, valuei2 contains start, end
wikilink value1 contains a target uri, value2 contains the text of the link
weblink value1 contains a target url, value2 contains the text of the link
doclink value1 contains which layer, valuei1 contains the sourceId value
reference value1 contains the target uri


The full relation is bold parent concatenated with child, e.g "document:title" is one relation.
Relation Description
:title Primary title of the article/disambiguation page
:alt_title Alternative title of the article/disambiguation page that has been extracted via redirects
:wiki_page_id Wikipedia page id as stored in the original wikipedia dump
:text Unformatted fulltext of the article
:category Which categories this article belongs to
:resolved_target Resolved targets, either to a Wikidata entity or a normlized wikipedia page.
:unresolved_target Weblinks, wikipedia pages that does not yet exist or pages that failed to resolved.
:member-of Which categories this category is a member of
:title Title of category
:idx Index in sentence
:cpostag Coarse Part-of-speech (POS) tag using Google Universal POS tag set @github
:deprel Dependency relation
:feats Morphological features
:head Head in the dependency tree
:lemma Baseform of word
:norm Normalized version of word
:pos Language dependent POS tag
:text The token as a string
All the ranges of annotations in layers, the number of ranges varies with the level.
:deprel:label The dependency relation label
:head Head of head
:tail Tail of head
ne:label Named entity label
paragraph:source The source of the paragraph, if converted from list or table, useful for filtering.
section:title The page sections.

