The semantic and multilingual enrichment of metadata in Europeana is a core concern as it improves access to the material, defines relations among objects and enables cross-lingual retrieval of documents. It aims at augmenting the metadata with new information about the agent who created the cultural object, the timespan in which it was created, the place where it was created and the concepts related to the cultural object.
There are three main elements that impact the process of enrichment: the source fields used (e.g. dc:creator), the target resources (e.g. DBpedia), and the rules to link the contents of the source fields with the target resources and augment the record with that new information. Information about the source and target resources, and the rules adopted, can be found in the document Semantic Enrichment Framework (previous discussions can be found in this internal document).
During the process of enrichment at Europeana, selected subsets of the target resources of interest are previously downloaded, and transformed to our Europeana Data Model, in RDF. That information conforms the Europeana Entity Collection. That collection is intended to be used not only for the process of enrichment, but also as source of suggestions during retrieval, as well as a source of information for the creation of the Entity pages, with links to the cultural objects in our collection related to those entities.
The Entity API can be accessed publicly and offers the following services:
- Retrieval of the metadata associated to an entity
- The suggestion of entities based on a string
- The resolution of external URIs
The code of the Entity API can be found on Github, and the configuration of the Solr index used in production is here.
Entity Suggestion
The suggestion of entities is automatically added to the suggestions offered to the user when she is writing in the query box to issue a search on the Europeana Collections. The ranking of the entities offered is based on the following formula:
ln((1 + Wikidata PageRank) * Europeana document count) * 1000
The calculation of PageRank is done according to the methodology developed by Thalhammer and Rettinger, and described in their paper PageRank on Wikipedia: Towards General Importance for Entities.
Further internal discussions of this formula, and some of its challenges in relation to Europeana content, can be found in the Basecamp thread on the subject.
Multilinguality significantly complicates the logic of the auto-suggest functionality, raising questions such as: should matches be attempted across linguistic boundaries? If so, under what circumstances? If not, what should be done in cases where entities are missing language labels? The currently-implemented logic is outlined in the option four of this internal document: Entity API: suggest language logic. Internal discussions about this topic can be found in the document Entity API: suggest language logic (v3).
Evaluation
In our current search evaluation reports (C2 reports), we include the evaluation of the entity collection. Those metrics are based on the D6.3 Search improvement report. Section 4, as well as in the internal document Evaluation of the Europeana Entity Collection knowledge graph.
Additional work done on the evaluation of enrichments:
- A bar chart illustrating the multilingual coverage and contribution of the Entity Collection can be found on Slide Ten of the Entitifying Europeana presentation.
- MTSR2012 - Poisonous India or The Importance of a Semantic and Multilingual Enrichment Strategy: http://link.springer.com/chapter/10.1007%2F978-3-642-35233-1_25
- MTSR2014 - A Framework for the Evaluation of Automatic Metadata Enrichments: http://link.springer.com/chapter/10.1007/978-3-319-13674-5_23
- MTSR2014 - Automatic Enrichments with Controlled Vocabularies in Europeana: Challenges and Consequences: https://link.springer.com/chapter/10.1007/978-3-319-13695-0_23
- EuropeanaTech 2013-14: Task Force on Multilingual and Semantic Enrichment Strategy: http://pro.europeana.eu/taskforce/multilingual-and-semantic-enrichment-strategy
- EuropeanaTech 2015: task force on evaluation and enrichment: http://pro.europeana.eu/taskforce/evaluation-and-enrichments
- Data at https://europeana.atlassian.net/wiki/spaces/RD/pages/36077569/Comparative+evaluation+of+enrichments
- PATHS Project. Evaluations of enrichments (including query enrichment, background links, item similarity, vocabulary extension):
- 2011: http://pro.europeana.eu/files/Europeana_Professional/Projects/Project_list/PATHS/Deliverables/D2.1%20Content%20Processing-1st%20Prototype.pdf
- 2013: http://pro.europeana.eu/files/Europeana_Professional/Projects/Project_list/PATHS/Deliverables/D2.2.Content%20Processing-2nd%20Prototype-revised.v2.pdf