Dataset download and OAI-PMH service

Besides our main Europeana APIs for searching and retrieving metadata about objects, we also offer other methods for downloading and harvesting metadata that are better suited if you want to extract large amounts of data. On this page, you can explore the two available solutions. 

If you want a full discrete dataset from a single data provider, or if you want just a snapshot of our data, then we suggest downloading object metadata from our FTP server as pre-generated compressed ZIP files.

If you want to be kept up-to-date as metadata is changed or if you already use harvesting software, then we recommend using our Harvesting solution using the OAI-PMH service. OAI-PMH serves your files in XML format, which is ideal for data processing activities, especially for digital cultural heritage research. For researchers who are used to working with semantic frameworks and tools such as JENA and SPARQL, we also offer compressed zip files for download formatted in Turtle.

Before starting to use either of these options, please read our Introduction page on how data is structured into Records and Datasets, the API Terms of Use and the Usage Guidelines for metadata.

 

Europeana’s FTP server

Our FTP server serves ZIP files containing the metadata of all objects in Europeana's repository, organised by dataset, readily available for bulk download. These files are generated on Sunday evening each week, which guarantees that the data is as up-to-date as possible.

FTP listing and file structure

All the files are available on our FTP server at ftp://download.europeana.eu/dataset/. You can connect to an FTP server by using software programs like FileZilla, or you can connect to an FTP server as a Shared Network Location or using the Command Prompt. If you are using a Linux OS, you can run the command: wget -m ftp://download.europeana.eu/dataset/XML

Europeana’s FTP server login credentials:

Host:

ftp://download.europeana.eu/dataset/

User:

anonymous

Password:

[leave blank]

Port:

21

Our FTP server is structured as follows:

  • two top-level directories, ‘XML’ and ‘TTL’, split the data in RDF-XML format and in Turtle format respectively.

  • Within those directories, every ZIP file has all of the metadata for each Dataset in Europeana, where the name of the file is the dataset identifier (e.g. 2021672.zip). Every ZIP file has a corresponding MD5 checksum file under the file extension .md5sum (e.g. 2021672.zip.md5sum) which can be used to validate the file upon download.

  • In each compressed zip file there will be a file for each Europeana metadata record where the name of the file will be the local identifier of the Record in Europeana.

Example

The data for the Girl with the Pearl Earring from the Mauritshuis encoded using the RDF-XML format will be available at the following URL ftp://download.europeana.eu/dataset/XML/2021672.zip. To find to which dataset any record belongs, you can check the URL of the record (for the Girl with the pearl earring, the Europeana item URL is https://www.europeana.eu/item/2021672/resource_document_mauritshuis_670 ), or you can find the dataset name next to the field 'Collection Name' in the 'More Metadata' tab on the item page.

The FTP server will provide you with a ZIP file with the metadata for all the objects in the dataset with the dataset number '2021672' if you request the URL ftp://download.europeana.eu/dataset/XML/2021672.zip. Unzipping the ZIP File will give you an XML file for every digital cultural heritage object. You can find the metadata for the “Girl with the Pearl Earring” in the ZIP file with the ID of that object, 'resource_document_mauritshuis_670' in the XML file named "resource_document_mauritshuis_670.xml"

Accessing images in high resolution: downloading data

To foster the reuse of the data that is published in Europeana as part of the Newspapers Thematic Collections, we make both the metadata and the full-text available for bulk download as compressed zip files. The metadata is available as CC0 the same way as all the metadata exposed via the API (see Terms of Use) while the full-text is available as Public Domain Mark.

List of datasets

The table below lists all the datasets that are published and available for download. If you are looking for the complete text of a Newspaper then we suggest using the (4) option, as opposed to using (3) where the trascription is partioned per page.

Given the fact that the files are very big and can take many hours to download, as an alternative to download directly via the browser, you can login to the FTP server at "download.europeana.eu" with username "anonymous". This will allow you to resume if the download gets stuck.

dataset number

Metadata1

Full-text (ALTO)2

Page level full-text (EDM)3

Issue level full-text (EDM)4

dataset number

Metadata1

Full-text (ALTO)2

Page level full-text (EDM)3

Issue level full-text (EDM)4

9200300

download

 (229M) (MD5)

download

 (63G) (MD5)

download

 (116G) (MD5)

download

 (113G) (MD5)

9200301

download

 (37M) (MD5)

download

 (13G) (MD5)

download

 (20G) (MD5)

download

 (20G) (MD5)

9200338

download

 (213M) (MD5)

download

 (158G) (MD5)

download

 (278G) (MD5)

download

 (277G) (MD5)

9200339

download

 (39M) (MD5)

download

 (11G) (MD5)

download

 (21G) (MD5)

download

 (17G) (MD5)

9200355

download

 (212M) (MD5)

download

 (97G) (MD5)

download

 (159G) (MD5)

download

 (157G) (MD5)

9200356

download

 (137M) (MD5)

download

 (40G) (MD5)

download

 (17G) (MD5)

download

 (17G) (MD5)

9200357

download

 (23M) (MD5)

download

 (5G) (MD5)

download

 (9G) (MD5)

download

 (9G) (MD5)

9200396

download

 (4M) (MD5)

download

 (849M) (MD5)

download

 (2G) (MD5)

download

 (1G) (MD5)

Legend:

  1. The original metadata in EDM XML format before being ingested into Europeana. There are slight differences between this data and the one published. For more information see the EDM documentation page.

  2. The full-text encoded using ALTO (Analyzed Layout and Text Object) as it was delivered to Europeana. The ALTO is an open XML Schema meant to describe text coming from OCR and layout information of pages for digitized material. For more information see the official documentation page at the Library of Congress.

  3. The full-text encoded using the EDM profile for IIIF fullltext after being preprocessed for publication in Europeana. A note that as opposed to the format used by the API (ie. JSON-LD), the data is in RDF/XML as it is the format used for ingestion into Europeana.

  4. Very similar to (3) but wih the full-text represented at the Issue level. This means that the edm:FullTextResource will convey the complete transcription of the Newspaper.

Dataset structure

On each compressed zip file, there will typically be a file per each item (ie. metadata or issue level full-text) or page (ie. ALTO and page level full-text) with the following structure:

Item

DATASET_ID/LOCAL_ID.xml

Page

DATASET_ID/LOCAL_ID/PAGE_ID.xml

 

That structure can be translated into links to the Europeana Collection portal where the item can be displayed or into the several APIs described on this page.

OAI-PMH

The Europeana OAI-PMH Service offers a way to collect large amounts of Europeana data from our repository through a protocol named OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting, presently in v2.0). This service allows you to harvest the entirety of our database or a selection of our database. You can select which parts of the Europeana database to download by specifying which datasets you want to download, or by filtering on the date of creation or date of modification of the data.

You can learn more about the harvesting protocol on the Open Archives Initiative (OAI) website and also by reading the OAI for beginners tutorial from the Open Archives Forum.

Available requests

Below you can find the available requests. The base URL for all requests is https://api.europeana.eu/oai/record/. These links and requests return XML, for which you need to use an XML-aware browser or viewing application.

List of available requests defined by the OAI-PMH protocol:

Structure and Format of the Data

The records in the OAI-PMH service are grouped into Datasets and are available as EDM RDF/XML. An example of a dataset ID that is accepted by the OAI-PMH service is 2022608_Ag_NO_ELocal_DiMu. The records are identified by their URIs. An example of such an identifier is http://data.europeana.eu/item/2022608/AAK_AAKS_2007_02_0206. To learn more about http://data.europeana.eu and its resources please see the EDM definitions.

Known limitations

Europeana currently doesn't maintain a deleted record registry. Therefore we recommend you re-harvest or download the entire collection at least every six months to ensure your copy of the Europeana repository is up-to-date.

Console

Roadmap and Changelog

We deploy new versions of the service primarily to fix any outstanding issues or introduce new features. The current version of the OAI-PMH Service is 0.8 Beta (2020-10). To see the changes made for this version and also all previous releases, see the API changelog in the project GitHub.