Infrastructure

Solr Configuration

We currently have four Solr engines deployed, serving a different collection each: Metadata, Newspapers, Entity Collection, and Annotations. The configuration files for those collections in production can be found on GitHub.

Metadata and Newspaper collections require a plug-in: The Thematic Collections Aliasing plug-in. It allows the use of short aliases (e.g. collection:art) to refer to the lengthy queries used to create the thematic collections. It was developed in the past as an extension of Solr and currently it is used as a plug-in. You can find the code of both approaches on Github, and the procedure to update the alias is available as an internal document.

In previous deployments, the ranking algorithm used in the Metadata collection was BM25f, which is a plug-in developed by Diego Ceccarelli. Currently the default Solr 6 ranking algorithm is used: BM25

More information about the BM25F plugin can be found in the following links:

Code for the BM25F plug-in: https://github.com/europeana/contrib/tree/master/bm25f-ranking.
Machine-learning code for learning the weights: https://github.com/europeana/contrib/tree/master/query-logs-analysis.
Background information: Deliverable 2.2.3 - Metadata-Based Indexing and Ranking.

Tests

Two kinds of testing are necessary to ensure a Solr deployment is ready for use by API clients.

Operational testing ensures that Solr, as a service, is behaving as expected: that it can be queried, and is reasonably responsive to such queries.
Content testing ensures that the data is as expected and is being served as expected - for example, that fields defined as tokenised are in fact being tokenised.

Operational testing has been carried out in the past using SolrMeter, while content testing has been done manually, testing a list of queries. A Spike document into the feasibility of creating an automated harness and reporting mechanism for such tests has been completed, but has not yet been resolved into actionable tickets.

Similar items search

Solr includes a specific handler to launch a similar items search called “MoreLikeThis” (it is also a search component). This component takes a document (the id of an internal document or a stream in case of an external document) and launch a new query with the highest matching terms based upon a tf-idf similarity calculation. As a result, documents similar to one document, for example those the user clicks, can be suggested. Additional parameters can be used to restrict the fields that should be used and the weight they should take.

This functionality was implemented ad-hoc in the portal instead of using the Solr handler because of efficiency issues. Internal tickets related to this issue can be found here:

Currently, a new query is launched with the contents and weight of the following fields:

Title: 0.3
Who: 0.2
What: 0.8
Data Provider: 0.2

The europeana id of the item is used to filter out itself from the responses.

Additional resources about recommending similar items to the users:

Implementing Recommendations in the PATHS System
The recommendation of cultural heritage objects in Europeana Collections (work by Karl Pineau)
Internal document: Multilinguality problems in our “More like this” search

Logging

The logging information is collected from two sources: our own logging system, and Google Analytics.

Our query logging is performed using an ELK (ElasticSearch-Logstash-Kibana) stack. The ELK-stack is essentially an aggregator of other logs, collecting them, rewriting them in a standard format, and storing them in ElasticSearch for querying and retrieval. We log the following information related to the searches:

search and filter terms used
number of results returned
the rank of the items clicked

This information is stored in the field ‘message’ in ElasticSearch as follows:

Search interaction: * Search parameters: {"q"=>"Ryckaert"} * Total hits: 121

Or as follows if it results from clicking an item in the search-result page:

Search interaction: * Record: /92062/BibliographicResource_1000126163678 *
Search parameters: {"q"=>"Ryckaert"} * Total hits: 121 * Result rank: 3

Currently the search performance evaluation is done using our query logging system, harvesting the logs from a specific period of time by means of this code on Github.

In Google Analytics, we have set up Site Search in order to have more insights about the behaviour of the user during the search (most common queries, pages viewed, exists without interactions, etc.), and several dashboards and reports (www.europeana.eu account :: Filtered view (to use) :: Customisation) have been defined to easily check this information (in general and by thematic collection). Here you can find the instructions to access that information.

Here are additional internal resources related to the creation of our logging system (made by 904Labs):

List of 'canned searches' (as of March 2017) for log filtration if required:

https://basecamp.com/1768384/projects/5774755/messages/67105215?enlarge=275066617#attachment_275066617

2017 logging framework

initial experiments and format: https://basecamp.com/1768384/projects/5774755/messages/67745647
Log-processing scripts: https://github.com/europeana/search/tree/master/log_munge/log_extractor
Logs 03/2017 - 09/2017: https://basecamp.com/1768384/projects/5774755/messages/72561454?enlarge=304796468#attachment_304796468

904Labs (1st half of 2015)

Reports

Online API spec: http://doc.904labs.com/display/API/Logging+API
Assessment of the client library

Bots distribution
Tickets: #162, #414

List of crawlers: file:crawlers.txt
Information on Europeana logging

Log actions: file:Europeana_Logs_Actions2010_2011.pdf
Session clickstreams: file:Europeana_Session_ClickStream.pdf

Privacy Concerns

Note that the User IDs employed are simply cookies set on a browser whenever the Collections homepage is accessed: they store no information whatsoever about the user, and no information about the user is retrievable using this cookie-based ID beyond the searches executed and items retrieved by that user. Furthermore, there is no way to connect these cookie-based IDs with IDs issued to the same individual if the relevant cookies are deleted.

A note on terminology

The cookie-based IDs issued by the Europeana site might arguably be referred to as 'session' rather than 'user' identifiers. In fact, the scripts used to retrieve these IDs and their inline documentation typically refer to them as 'session IDs'. However, 'sessions' are more normally considered to refer to particular time-delimited periods of user activity, while these identifiers potentially span several such sessions. The term 'user' rather than 'session' identifiers is accordingly used here.