Solr Configuration
We currently have four Solr engines deployed, serving a different collection each: Metadata, Newspapers, Entity Collection, and Annotations. The configuration files for those collections in production can be found on GitHub.
...
- Code for the BM25F plug-in: https://github.com/europeana/contrib/tree/master/bm25f-ranking.
- Machine-learning code for learning the weights: https://github.com/europeana/contrib/tree/master/query-logs-analysis.
- Background information: Deliverable 2.2.3 - Metadata-Based Indexing and Ranking.
Tests
Two kinds of testing are necessary to ensure a Solr deployment is ready for use by API clients.
...
Operational testing has been carried out in the past using SolrMeter, while content testing has been done manually, testing a list of queries. A Spike document into the feasibility of creating an automated harness and reporting mechanism for such tests has been completed, but has not yet been resolved into actionable tickets.
Similar items search
Solr includes a specific handler to launch a similar items search called “MoreLikeThis” (it is also a search component). This component takes a document (the id of an internal document or a stream in case of an external document) and launch a new query with the highest matching terms based upon a tf-idf similarity calculation. As a result, documents similar to one document, for example those the user clicks, can be suggested. Additional parameters can be used to restrict the fields that should be used and the weight they should take.
...
- Implementing Recommendations in the PATHS System
- The recommendation of cultural heritage objects in Europeana Collections (work by Karl Pineau)
- Internal document: Multilinguality problems in our “More like this” search
Logging
The logging information is collected from two sources: our own logging system, and Google Analytics.
...
- List of 'canned searches' (as of March 2017) for log filtration if required:
- 2017 logging framework
- initial experiments and format: https://basecamp.com/1768384/projects/5774755/messages/67745647
- Log-processing scripts: https://github.com/europeana/search/tree/master/log_munge/log_extractor
- Logs 03/2017 - 09/2017: https://basecamp.com/1768384/projects/5774755/messages/72561454?enlarge=304796468#attachment_304796468
- 904Labs (1st half of 2015)
- Reports
- Bots distribution
- Tickets: #162, #414
- List of crawlers: file:crawlers.txt
- Information on Europeana logging
- Log actions: file:Europeana_Logs_Actions2010_2011.pdf
- Session clickstreams: file:Europeana_Session_ClickStream.pdf
Privacy Concerns
Note that the User IDs employed are simply cookies set on a browser whenever the Collections homepage is accessed: they store no information whatsoever about the user, and no information about the user is retrievable using this cookie-based ID beyond the searches executed and items retrieved by that user. Furthermore, there is no way to connect these cookie-based IDs with IDs issued to the same individual if the relevant cookies are deleted.
A note on terminology
The cookie-based IDs issued by the Europeana site might arguably be referred to as 'session' rather than 'user' identifiers. In fact, the scripts used to retrieve these IDs and their inline documentation typically refer to them as 'session IDs'. However, 'sessions' are more normally considered to refer to particular time-delimited periods of user activity, while these identifiers potentially span several such sessions. The term 'user' rather than 'session' identifiers is accordingly used here.