Living Labs for IR Evaluation (LL4IR) runs as a Lab at CLEF 2015.
Give us your ranking, we’ll have it clicked!
Most slides are available.
Make sure you read our updated guide for CLEF participants.
Topic and goals
Evaluation is a central aspect of information retrieval (IR) research. In the past few years, a new evaluation paradigm known as living labs has been proposed, where the idea is to perform experiments in situ, with real users doing real tasks using real-world applications. This type of evaluation, however, is currently available only to (large) industrial research labs. Our main goal is to provide a benchmarking platform for researchers to evaluate their ranking systems in a live setting with real users in their natural task environments. The lab acts as a proxy between commercial organizations (live environments) and lab participants (experimental systems), facilitates data exchange, and makes comparison between the participating systems.
The LL4IR lab contributes to the understanding of online evaluation as well as an understanding of the generalization of retrieval techniques across different use-cases. Most importantly, it promotes IR evaluation that is more realistic, by allowing researches to have access to historical search and usage data and by enabling them to validate their ideas in live settings with real users. This initiative is a first of its kind for IR.
What is in it for you as a lab participant?
- Access to privileged commercial data (click through, etc.).
- Opportunity to test your IR systems on real live systems with real live users.
Research Questions the lab is trying to answer:
- Are system rankings different when using historical clicks from those using online experiments?
- Are system rankings different when using manual relevance assessments (“expert judgments”) from those using online experiments?
These answers will provide the research community with concrete insight into the need, or lack thereof, of living labs as an additional tool for IR evaluation.
The first edition of the lab focuses on three use-cases and one specific notion of what a living lab is, with a view to expanding to other use-cases and other interpretations of living labs in subsequent years. Use-cases for the first lab are:
- product search (on the REGIO JÁTÉK e-commerce site)
- local domain search (on the University of Amsterdam’s website)
- web search (through Seznam, a major commercial web search engine)
All three are ad-hoc search tasks and are closely related in terms of their general setup. Using a shared API but considering three very different use-cases allows us to study how well techniques generalize across domains.
Further details about each of the use-cases will follow soon.
Evaluation is split into training and test phases.
The training phase offers two ways for participants to train their IR models: (1) static TREC-style collections and (2) living labs evaluation environments for each of the use-cases. Details follow.
(1) Static TREC-style collections: These collections consist of a set of 50 frequent queries, a document collection, and relevance judgments. Specifically, the document collection contains the top X documents returned by the commercial providers for each of the queries. Two sets of relevance judgments are made available: (i) generated by a manual relevance assessment process (“expert judgments”) and (ii) using historical click information. For the product search and local domain search use-cases queries and documents are provided in a raw format. For the web search use-case, pre-computed query and document features are provided instead. That is, for each query, a sparse feature vector is given.
(2) Living labs evaluation environments: For each of the use-cases, challenge participants can also participate in a live living labs evaluation process. For this they use the same set of 50 frequent queries (as training queries) along with the candidate results for these queries, historical information associated with the queries, and some general collection statistics. When participants produce their rankings for each query, they upload these to the commercial provider through an API. The commercial provider then interleaves a given participant’s ranked list with their own ranking, and presents the user with the interleaved result list. That is, participants take turns (this is orchestrated by the API, such that they each get about the same number of impressions) and a single experimental system is interleaved with the production system at a time. The actions performed by the user are then made available to the challenge participant (whose ranking was shown) through the API; i.e., the clicks, interleaving outcomes. If they wish, participants are free to update their rankings using this feedback information.
In the test phase, challenge participants receive another set of 50 frequent queries (as test queries) and associated historical click information as well as candidate results for these queries. After downloading the test queries, participants have 24 hours to produce and upload their rankings. These rankings are then interleaved with the commercial providers’ rankings for 6 weeks. Again, each challenge participant is given an equal number of impressions.
Overall evaluation of challenge participants will be based on the final system performance, and additionally on how the systems performed at each query issue. Results based on manual (“expert”) judgments will be made available for comparisons. The metrics used are conventional absolute click metrics (i.e., click@1, position of last click, etc.) and interleaving metrics (i.e., number of wins) and where available, we will also compute conventional offline IR metrics (i.e., NDCG, MAP, ERR, etc.). During training, participants will only be able to see metrics for their own systems and only click-based metrics. During and after the testing phase, all metrics for all systems will be available.
Our evaluation methodology, including reasons for focusing on frequent queries, is detailed in a CIKM’14 short paper: Head First: Living Labs for Ad-hoc Search Evaluation.
|1 Nov, 2014||Training period begins|
|1-20 Apr, 2015||Uploading test runs|
|20 Apr, 2015||Testing period begins|
|15 May, 2015||Testing period ends|
|17 May, 2015||Results released|
|7 Jun, 2015||Participants working notes paper submission deadline (CEUR-WS)|
|30 June, 2015||Notification of acceptance for participants working notes papers (CEUR-WS)|
|15 Jul, 2015||Camera-ready working notes paper submission deadline (CEUR-WS)|
|8-11 Sep, 2015||Full-day lab session at CLEF 2015, in Toulouse, France|
Anne Schuth, University of Amsterdam, The Netherlands (anne.schuth (at) uva.nl)
Krisztian Balog, University of Stavanger, Norway (krisztian.balog (at) uis.no)
Liadh Kelly, Trinity College Dublin, Ireland (liadh.kelly (at) tcd.ie)
- Leif Azzopardi, University of Glasgow, Scotland
- Torben Brodt, Plista
- Henry Feild, Endicott College, USA
- Nicola Ferro, University of Padova, Italy
- Katja Hofmann, Microsoft Research, England
- Frank Hopfgartner, Technische Universität Berlin, Germany
- Gareth Jones, Dublin City University, Ireland
- Henning Müller, HES-SO, Switzerland
- Maarten de Rijke, University of Amsterdam, The Netherlands
- Ian Soboroff, NIST, USA
- Paul Thomas, CSIRO, Australia