Fulltext search

By the end of 2018 we incorporated a new collection, the Newspapers Collection, that can be search not only by metadata but also by content. Europeana digitized almost a million full-text newspaper issues in several languages, from the 17th to the 20th century that are currently available from our portal. This collection can now be searched by metadata OR by full-text by indicating this option beforehand.

Currently this new collection in indexed in a dedicated Solr cluster for that. The ad-hoc code for the indexing of the metadata (taken directly from the metadata index) and the full-text (taked from XMLs files generated by the APIs team) is available at Github, and the instructions to run this code are available here. Previous work in this line made by Neil and Jie (Sheffield University) is also available from Github.

In terms of the configuration and schema of Solr (available here), it is different from the metadata collection configuration/schema in order to support the new full-text fields, as well as the normalization of those fields according to the specific language. Other improvements have also been introduced to improve the search in full-text contents (and also to support the combined search in full-text and metadata at the same time, although this is not used in production). This collection also offers highlighting, although this feature is not directly used in the portal, as it is currently done by string matching between the query and the raw text and image contents in the database instead. This approach has the downside of not normalizing the query and contents, so the results obtained may not be the same as those obtained by Solr (e.g. if we query by the verb ‘go’ and we apply normalization, we may retrieved documents with that verb in different forms that can not be spotted in the contents just by looking for the string ‘go’). 

Here are the internal documents related to this new project: