Created by

Result ranking is performed in Europeana through use of a Solr plugin which ranks in line with the BM25F algorithm. The BM25F plugin is supplied with a machine-learning framework for learning the weights to be applied to the various fields.

Code for the BM25F plugin is here: https://github.com/europeana/contrib/tree/master/bm25f-ranking.

Code for running the machine-learning framework can be found here: https://github.com/europeana/contrib/tree/master/query-logs-analysis.

The code and instructions in the Github repos above are reasonably self-contained. Background information, however, is provided byDeliverable 2.2.3 - Metadata-Based Indexing and Ranking.

Thematic Collections Aliasing

The purpose of the Thematic Collections Aliasing (Thecla) extension is to allow the use of short aliases to refer to the lengthy queries used to create thematic collections.

Note that Thecla is not strictly-speaking a plugin - it involves changes to the Solr core package outside the usual plugin classes.

Entity Collection

Last updated Dec 15, 2017

The Entity Collection is intended to support a number of functionalities, including but not limited to:

Autocomplete
High retrieval precision
Entity pages
Structuring of Europeana content as a knowledge graph

Substantial work has gone into the creation and curation of the Entity Collection. The pages here are intended to describe only those points at which the Entity Collection intersects specifically with Search functionality - chiefly via the Entity API.

The code for Thecla, and complete documentation on its deployment and use, can be found on GitHub: https://github.com/europeana/search/tree/master/collections_aliasing

Testing

Two kinds of testing are necessary to ensure a Solr deployment is ready for use by API clients.

Operationaltesting ensures that Solr, as a service, is behaving as expected: that it can be queried, and is reasonably responsive to such queries.

Contenttesting ensures that the data is as expected and is being served as expected - for example, that fields defined as tokenised are in fact being tokenised.

At the moment, operational testing is an automated affair, while content testing is largely manual. Although work is underway on improving the automation of content-testing, at time fo writing (2017.12.14) content testing will be a partially- or largely-manual affair for the foreseeable future.

Operational

Operational testing of our Solr deployments is carried out using SolrMeter.

An overview of how to use SolrMeter can be found in ourGitHub testing repository. Settings and sample queries are in addition provided for theSearchandEntityAPIs.

Search API

Tests for the Solr deployment powering the Search API are currently listed in theTests Reindexdocument.

ASpike documentinto the feasibility of creating an automated harness and reporting mechanism for such tests has been completed, but has not yet been resolved into actionable tickets.

Entity API

At the moment the small size and fluctuating content of the Entity Collection has meant that no large-scale approach to testing has been developed.

Autocomplete

Autocomplete is intended to be the main path through which users enter the Entity Collection, and is one of the chief functionalities supported by the EC.

Code and documentation for the Solr layer of autocomplete functionality, including ETL scripts to transform EC contents from Mongo to Solr and Solr configuration files, can be found in the Europeana SearchAutocompleteGitHub directory.

Ranking

The current (i.e., as of 2017.12.18) implementation of the Entity Collection at the Solr level uses the following formula for ranking:

ln((1 + Wikidata PageRank) * Europeana document count) * 1000

The addition of 1 to the PageRank score is to prevent values < 1 penalising the relevance score; the natural log and multiplication by 1000 is simply to get the numbers into a range Solr deals with comfortably.

The calculation of PageRank is done according to the methodology developed by Thalhammer and Rettinger, and described in their paper PageRank on Wikipedia: Towards General Importance for Entities.

Further discussion of this formula, and some of its challenges in relation to Europeana content, can be found in the Basecamp thread on the subject.

Language Logic

Multilinguality significantly complicates the logic of autosuggest functionality, raising questions such as: should matches be attempted across linguistic boundaries? If so, under what circumstances? If not, what should be done in cases where entities are missing language labels? And so forth.

This complexity means that the question of language-handling logic is continually under review. The currently-implemented logic is outlined as Option Four in the Entity Collection: Cross-Language Matching and Relevance Logic document.

More recent discussions of alternative approaches can be found in the Entity API: suggest language logic (v3) document.

Query Strategy

This page is intended as a placeholder until such a time as the strategy for querying Europeana Collections using the Entity Collection has been decided - that is to say: what query is launched once an entity from the dropdown autocomplete has been picked?

Logging

Logging for search purposes records, broadly speaking, three pieces of information:

search and filter terms used
number of results returned
the rank of the items clicked

Harvesting of this information occurs in three steps:

First, all API interactions, including search interactions, are sent to the ElasticSearch - Kibana - Logstash (ELK) stack for initial, virtually raw, logging.
Scripts are then used to query the ELK application and retrieve all user IDs.
These user IDs are then used to query the ELK stack and retrieve all interactions on a user-by-user basis. The result is query files that are organised chronologically, by user.

In addition, at times other scripts may be written ad hocfor post-procoessing and analysis of particular questions as they arise

The ELK-stack

Query logging in the first instance is performed using an ELK (ElasticSearch-Logstash-Kibana) stack. The ELK-stack is essentially an aggregator of other logs, collecting them, rewriting them in a standard format, and storing them in ElasticSearch for querying and retrieval.

Logz.io has provided a complete guide to ELK logging. For the purposes of search analysis, however, most users will find it sufficient to know how to query an ELK instance. Note that the query syntax used is that of ElasticSearch (which is in turn very similar to that of Solr).

The Europeana ELK instance can be found at http://elasticsearch2.eanadev.org:5601/app/kibana. The easiest way to filter this logging-firehose down to search activity only is to issue the query

+message:"Search Interaction"

This query will reduce the flow of information to two kinds of messages:

1. Initial searches resulting in a search-result list. Entries regarding such searches will have in their 'message' field contents of the following form:

Search interaction: * Search parameters: {"q"=>"Ryckaert"} * Total hits: 121

2. Which item in the search-result list was clicked. These entries are of the following form (again in the 'message'field):

Search interaction: * Record: /92062/BibliographicResource_1000126163678 * 
Search parameters: {"q"=>"Ryckaert"} * Total hits: 121 * Result rank: 3

If you are trying to narrow down these results still further (for example, to track the progress of a particular query), add one of the sought-for search terms to your ELK query. For example, to track down all searches related to 'Boethius', perform a search of the form:

+message:"Search interaction" +message:Boethius

Querying for users

While the ELK-stack logs (amongst other things) all API interactions, this information needs to be filtered and processed in order to be useful.

The first step in this process is specifying a period of time for which we wish to retrieve search interactions, and harvesting all User IDs employed during this time. This is performed using the session_extractor.py script; full instructions for using this script can be found in the relevant directory on GitHub.

Privacy Concerns

Note that the User IDs employed are simply cookies set on a browser whenever the Collections homepage is accessed: they store no information whatsoever about the user, and no information about the user is retrievable using this cookie-based ID beyond the searches executed and items retrieved by that user. Furthermore, there is no way to connect these cookie-based IDs with IDs issued to the same individual if the relevant cookies are deleted.

A note on terminology

Because of the limitations noted under 'Privacy Concerns', the cookie-based IDs issued by the Europeana site might arguably be referred to as 'session' rather than 'user' identifiers. In fact, the scripts used to retrieve these IDs and their inline documentation typically refer to them as 'session IDs'. However, 'sessions' are more normally considered to refer to particular time-delimited periods of user activity, while these identifiers potentially span several such sessions. The term 'user' rather than 'session' identifiers is accordingly used here.

Retrieving Interactions

Once all the User IDs for a given span of time have been retrieved and written to the `intermediate_output/sessions/` directory, the actual interactions made by each user need to be retrieved from the logs. This work is done by the entry_extractor.py script. Full instructions for using this script can be found in the relevant directory on GitHub.

The results of this script (log entries organised chronologically by user) are written to the `intermediate_output/entries_by_session` directory.

Search and Multilinguality

Work on search and multiinguality has typically yielded two forms of output:

Evaluations and tools for evaluation, including multilingual query testbeds.
White papers and other reports regarding best practice in this area.

Evaluation

Query translation

Evaluation of a 250 query corpus in English, French and German performed within the Galateas project:
- Documentation of Creation of Gold Standard from Europeana Query Corpus
- Query corpus in English: file:English_corpus_Europeana.xml
- Query corpus in French: file:French_corpus_Europeana.xml
- Query corpus in German: file:German_corpus_Europeana.xml
- Evaluation of Query Translation in Europeana: file:Auswertung_evaluation.pdf
- D7.4 - Final Evaluation of Query Translation: file:GALATEAS_D7_4.pdf
Evaluation using the Portal (done using the same corpus as in the Galateas Project):

Multilingual Saturation

Extensive work on measuring the multilinguality of Europeana metadata has been undertaken through 2016 and 2017 by Peter Kiraly, Juliane Stiller, and Vivien Petras.
- A description and overview of the work can be found in the Multilingual Saturation of Metadata document, along with relevant links and instructions for the Metadata Quality Assurance Framework application.

Crowdsourced Multilingual Queries

See the attached tarball for a list of non-English queries submitted by users, along with their ratings of the resulting SERP and other comments.

The Entity Collection

A bar chart illustrating the multilingual coverage and contribution of the Entity Collection can be found on Slide Ten of the Entitifying Europeana presentation.

Reports

Péter Király (2015) "Query Translation in European", Code4Lib Journal, Issue 27, http://journal.code4lib.org/articles/10285
Juliane Stiller, Vivien Petras, Maria Gäde, Antoine Isaac (2014) "Automatic Enrichments with Controlled Vocabularies in Europeana: Challenges and Consequences", in: Digital Heritage. Progress in Cultural Heritage: Documentation, Preservation, and Protection. EuroMed 2014, pp. 238-247
J. Stiller, M. Gäde, and V. Petras (2010), Ambiguity of Queries and the Challenges for Query Language Detection, in CLEF 2010 Labs and Workshops Notebook Papers, ed. by M. Braschler, D. Harman and E. Pianta

Juliane Stiller and Vivien Petras (2016), White Paper on Best Practice for Multilingual Access

IIiX2014 - Multilingual Interface Preferences
- http://dl.acm.org/citation.cfm?id=2637002.2637030
- Private copy at Dropbox

Maria Gäde's PhD Thesis - Country and language level differences in multilingual digital libraries
- http://edoc.hu-berlin.de/docviews/abstract.php?id=40595
- http://edoc.hu-berlin.de/dissertationen/gaede-maria-2014-02-05/PDF/gaede.pdf

Which Log for Which Information? Gathering Multilingual Data from Different Log File Types
- http://link.springer.com/chapter/10.1007%2F978-3-642-15998-5_9
- http://www.europeanaconnect.eu/documents/Gaede_Petras_Stiller2010.pdf

Research and Development

Backup pages

Created by

Search and Related Evaluations at Europeana

Evaluation of Enrichments

Usability Evaluation

Evaluation of image similarity

Evaluation of search

Europeana Logfiles data and work that uses them

Recommendation

Missing

Solr configuration

BM25f Handler