One of the earliest questions we faced was: how do you know when a travel recommendation is good? Price optimality is easy to measure. Fit is not.
Standard retrieval benchmarks measure precision and recall against a labelled ground truth. For travel, there is no canonical right answer. Two travellers with identical budgets and the same destination might have completely different definitions of a good hotel. Any benchmark that treats this as a retrieval problem is measuring the wrong thing.
We built our own benchmark suite around user journey replay. We collected a set of anonymised historical booking sessions — searches, filters applied, options viewed, final selection — and used these as ground truth for whether a recommendation was appropriate. If a user saw a result and booked it, that was a strong positive signal. If they filtered it out immediately, that was a negative.
We then asked our model to rank the same candidate sets, blind to the final booking outcome, and compared its ranking to the revealed preference from the session. The metric we track is preference alignment at position one, three, and five — whether the user's eventual choice appeared in the top one, three, or five positions in Roavo's ranked output.
First-round results showed 71% alignment at position one and 89% at position three. For context: a naive price-sort achieves about 38% at position one for this dataset. The gap reflects the value of reasoning over sorting.
We also tested cases where our model's top recommendation differed from what the user actually booked. In roughly a third of those mismatches, annotators agreed the model's recommendation was objectively better on the stated criteria — suggesting that the historical sessions contain noise from user fatigue and imperfect search behaviour.
The benchmark is not perfect, and we are iterating on it constantly. But it gives us a reproducible signal that improves alongside the model, which is what we needed.