CLEF LL4IR 2015

Living Labs for IR Evaluation (LL4IR) runs as a Lab at CLEF 2015.

Give us your ranking, we’ll have it clicked!

Slides

Most slides are available.

Guide

Make sure you read our updated guide for CLEF participants.

Topic and goals

Evaluation is a central aspect of information retrieval (IR) research. In the past few years, a new evaluation paradigm known as living labs has been proposed, where the idea is to perform experiments in situ, with real users doing real tasks using real-world applications. This type of evaluation, however, is currently available only to (large) industrial research labs. Our main goal is to provide a benchmarking platform for researchers to evaluate their ranking systems in a live setting with real users in their natural task environments. The lab acts as a proxy between commercial organizations (live environments) and lab participants (experimental systems), facilitates data exchange, and makes comparison between the participating systems.

The LL4IR lab contributes to the understanding of online evaluation as well as an understanding of the generalization of retrieval techniques across different use-cases. Most importantly, it promotes IR evaluation that is more realistic, by allowing researches to have access to historical search and usage data and by enabling them to validate their ideas in live settings with real users. This initiative is a first of its kind for IR.

What is in it for you as a lab participant?

  • Access to privileged commercial data (click through, etc.).
  • Opportunity to test your IR systems on real live systems with real live users.

Research Questions the lab is trying to answer:

  • Are system rankings different when using historical clicks from those using online experiments?
  • Are system rankings different when using manual relevance assessments (“expert judgments”) from those using online experiments?

These answers will provide the research community with concrete insight into the need, or lack thereof, of living labs as an additional tool for IR evaluation.

Usage scenarios

The first edition of the lab focuses on three use-cases and one specific notion of what a living lab is, with a view to expanding to other use-cases and other interpretations of living labs in subsequent years. Use-cases for the first lab are:

All three are ad-hoc search tasks and are closely related in terms of their general setup. Using a shared API but considering three very different use-cases allows us to study how well techniques generalize across domains.

Further details about each of the use-cases will follow soon.

Challenge operation

Evaluation is split into training and test phases.

Training phase

The training phase offers two ways for participants to train their IR models: (1) static TREC-style collections and (2) living labs evaluation environments for each of the use-cases. Details follow.

(1) Static TREC-style collections: These collections consist of a set of 50 frequent queries, a document collection, and relevance judgments. Specifically, the document collection contains the top X documents returned by the commercial providers for each of the queries. Two sets of relevance judgments are made available: (i) generated by a manual relevance assessment process (“expert judgments”) and (ii) using historical click information. For the product search and local domain search use-cases queries and documents are provided in a raw format. For the web search use-case, pre-computed query and document features are provided instead. That is, for each query, a sparse feature vector is given.

(2) Living labs evaluation environments: For each of the use-cases, challenge participants can also participate in a live living labs evaluation process. For this they use the same set of 50 frequent queries (as training queries) along with the candidate results for these queries, historical information associated with the queries, and some general collection statistics. When participants produce their rankings for each query, they upload these to the commercial provider through an API. The commercial provider then interleaves a given participant’s ranked list with their own ranking, and presents the user with the interleaved result list. That is, participants take turns (this is orchestrated by the API, such that they each get about the same number of impressions) and a single experimental system is interleaved with the production system at a time. The actions performed by the user are then made available to the challenge participant (whose ranking was shown) through the API; i.e., the clicks, interleaving outcomes. If they wish, participants are free to update their rankings using this feedback information.

The Living Labs architecture. Frequent queries (Q) with candidate document for each query (D|Q) are sent from a site through the API to the experimental systems of participants. These systems upload their rankings (r') for each query to the API. When a user of the site issues one of these frequent query (q) then the site requests a ranking (r') from the API and presents it to the users. Any interactions (c) of the user with this ranking are sent back to the API. Experimental systems can then obtain these interactions (c) from the API and update their ranking (r') if they wish.
The Living Labs architecture. Frequent queries (Q) with candidate document for each query (D|Q) are sent from a site through the API to the experimental systems of participants. These systems upload their rankings (r’) for each query to the API. When a user of the site issues one of these frequent query (q) then the site requests a ranking (r’) from the API and presents it to the users. Any interactions (c) of the user with this ranking are sent back to the API. Experimental systems can then obtain these interactions (c) from the API and update their ranking (r’) if they wish.

Test phase

In the test phase, challenge participants receive another set of 50 frequent queries (as test queries) and associated historical click information as well as candidate results for these queries. After downloading the test queries, participants have 24 hours to produce and upload their rankings. These rankings are then interleaved with the commercial providers’ rankings for 6 weeks. Again, each challenge participant is given an equal number of impressions.
Overall evaluation of challenge participants will be based on the final system performance, and additionally on how the systems performed at each query issue. Results based on manual (“expert”) judgments will be made available for comparisons. The metrics used are conventional absolute click metrics (i.e., click@1, position of last click, etc.) and interleaving metrics (i.e., number of wins) and where available, we will also compute conventional offline IR metrics (i.e., NDCG, MAP, ERR, etc.). During training, participants will only be able to see metrics for their own systems and only click-based metrics. During and after the testing phase, all metrics for all systems will be available.
Our evaluation methodology, including reasons for focusing on frequent queries, is detailed in a CIKM’14 short paper: Head First: Living Labs for Ad-hoc Search Evaluation.

Schedule

1 Nov, 2014 Training period begins
1-20 Apr, 2015 Uploading test runs
20 Apr, 2015 Testing period begins
15 May, 2015 Testing period ends
17 May, 2015 Results released
7 Jun, 2015 Participants working notes paper submission deadline (CEUR-WS)
30 June, 2015 Notification of acceptance for participants working notes papers (CEUR-WS)
15 Jul, 2015 Camera-ready working notes paper submission deadline (CEUR-WS)
8-11 Sep, 2015 Full-day lab session at CLEF 2015, in Toulouse, France

Organizers

Anne Schuth, University of Amsterdam, The Netherlands (anne.schuth (at) uva.nl)
Krisztian Balog, University of Stavanger, Norway (krisztian.balog (at) uis.no)
Liadh Kelly, Trinity College Dublin, Ireland (liadh.kelly (at) tcd.ie)

Steering Committee

 

CLEF 2015The CLEF InitiativeREGIO JATEKSeznamUvA