Page Comparison

Solr Configuration

We currently have four Solr engines deployed, serving a different collection each: Metadata, Newspapers, Entity Collection, and Annotations. The configuration files for those collections in production can be found on GitHub.

...

Code for the BM25F plug-in: https://github.com/europeana/contrib/tree/master/bm25f-ranking.
Machine-learning code for learning the weights: https://github.com/europeana/contrib/tree/master/query-logs-analysis.
Background information: Deliverable 2.2.3 - Metadata-Based Indexing and Ranking.

Tests

Two kinds of testing are necessary to ensure a Solr deployment is ready for use by API clients.

...

Operational testing has been carried out in the past using SolrMeter, while content testing has been done manually, testing a list of queries. A Spike document into the feasibility of creating an automated harness and reporting mechanism for such tests has been completed, but has not yet been resolved into actionable tickets.

Similar items search

Solr includes a specific handler to launch a similar items search called “MoreLikeThis” (it is also a search component). This component takes a document (the id of an internal document or a stream in case of an external document) and launch a new query with the highest matching terms based upon a tf-idf similarity calculation. As a result, documents similar to one document, for example those the user clicks, can be suggested. Additional parameters can be used to restrict the fields that should be used and the weight they should take.

...

Implementing Recommendations in the PATHS System
The recommendation of cultural heritage objects in Europeana Collections (work by Karl Pineau)
Internal document: Multilinguality problems in our “More like this” search

Logging

The logging information is collected from two sources: our own logging system, and Google Analytics.

...

List of 'canned searches' (as of March 2017) for log filtration if required:

https://basecamp.com/1768384/projects/5774755/messages/67105215?enlarge=275066617#attachment_275066617

2017 logging framework

initial experiments and format: https://basecamp.com/1768384/projects/5774755/messages/67745647
Log-processing scripts: https://github.com/europeana/search/tree/master/log_munge/log_extractor
Logs 03/2017 - 09/2017: https://basecamp.com/1768384/projects/5774755/messages/72561454?enlarge=304796468#attachment_304796468

904Labs (1st half of 2015)

Reports

Online API spec: http://doc.904labs.com/display/API/Logging+API
Assessment of the client library

Bots distribution
Tickets: #162, #414

List of crawlers: file:crawlers.txt
Information on Europeana logging

Log actions: file:Europeana_Logs_Actions2010_2011.pdf
Session clickstreams: file:Europeana_Session_ClickStream.pdf

Privacy Concerns

Note that the User IDs employed are simply cookies set on a browser whenever the Collections homepage is accessed: they store no information whatsoever about the user, and no information about the user is retrievable using this cookie-based ID beyond the searches executed and items retrieved by that user. Furthermore, there is no way to connect these cookie-based IDs with IDs issued to the same individual if the relevant cookies are deleted.

A note on terminology

The cookie-based IDs issued by the Europeana site might arguably be referred to as 'session' rather than 'user' identifiers. In fact, the scripts used to retrieve these IDs and their inline documentation typically refer to them as 'session IDs'. However, 'sessions' are more normally considered to refer to particular time-delimited periods of user activity, while these identifiers potentially span several such sessions. The term 'user' rather than 'session' identifiers is accordingly used here.

Versions Compared

Old Version 1

New Version Current

Key