Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Solr Configuration

We currently have four Solr engines deployed, serving a different collection each: Metadata, Newspapers, Entity Collection, and Annotations. The configuration files for those collections in production can be found on GitHub.

...

Tests

Two kinds of testing are necessary to ensure a Solr deployment is ready for use by API clients.

...

Operational testing has been carried out in the past using SolrMeter, while content testing has been done manually, testing a list of queries. A Spike document into the feasibility of creating an automated harness and reporting mechanism for such tests has been completed, but has not yet been resolved into actionable tickets. 

Similar items search

Solr includes a specific handler to launch a similar items search called  “MoreLikeThis” (it is also a search component). This component takes a document (the id of an internal document or a stream in case of an external document) and launch a new query with the highest matching terms based upon a tf-idf similarity calculation. As a result, documents similar to one document, for example those the user clicks, can be suggested. Additional parameters can be used to restrict the fields that should be used and the weight they should take.

...

Logging

The logging information is collected from two sources: our own logging system, and Google Analytics. 

...

Privacy Concerns

Note that the User IDs employed are simply cookies set on a browser whenever the Collections homepage is accessed: they store no information whatsoever about the user, and no information about the user is retrievable using this cookie-based ID beyond the searches executed and items retrieved by that user. Furthermore, there is no way to connect these cookie-based IDs with IDs issued to the same individual if the relevant cookies are deleted.

A note on terminology

The cookie-based IDs issued by the Europeana site might arguably be referred to as 'session' rather than 'user' identifiers. In fact, the scripts used to retrieve these IDs and their inline documentation typically refer to them as 'session IDs'. However, 'sessions' are more normally considered to refer to particular time-delimited periods of user activity, while these identifiers potentially span several such sessions. The term 'user' rather than 'session' identifiers is accordingly used here.