#20: Practical Bandits and Travel Recommendations with Bram van den Akker

Note: This transcript has been generated automatically using OpenAI's whisper and may contain inaccuracies or errors. We recommend listening to the audio for a better understanding of the content. Please feel free to reach out if you spot any corrections that need to be made. Thank you for your understanding.

It's just so easy to think you know how something needs to be done and then you do it and it doesn't work.
But it's good to first in a simulation, very, very simple, very basic, people overestimate how difficult it is.
Just make a very simple example of what you think you're doing and validate that under the perfect circumstances it works.
Because if you cannot make it work under the perfect circumstances, likely will not work under suboptimal circumstances too.
We have almost one shot.
When comes to our platform, they want to get a hotel.
Let's assume hotels, but we also have cars and we have attractions.
We need to give that to them as soon as possible and then they go after a travel and then we might hear something back from them months later.
Every time can be somewhat of a cold start in that sense.
My main role within booking is to try and understand where and how bandit feedback is applied and I would say bandit feedback is applied almost everywhere.
The only way to really do any of this is to make sure that we do some form of stochastic decision making so that not every time a user comes in, we recommend them pairs.
Sometimes we try to recommend them something else.
If ideally your A-B test is done correctly, that's the best way to get a good estimate and off policy or counterfactual estimation in an offline fashion is the best way to do pre-selection.
Hello and welcome to this new episode of RECSPERTS, a recommender systems experts.
In today's episode we will be talking about practical bandits, what they mean, which challenges you can expect when bringing bandit based solutions for recommender systems in practice and we will hear more about recommendations in the travel industry, more specifically recommendations done at booking.com.
And if you combine both of these things, so practical bandits and booking.com, you might already have a clue about who is my guest for today's episode.
And you might be right, because my guest is Bram van den Akker and I invited him to talk about all of that stuff and I'm very glad that you followed my invitation and joined for today's episode of RECSPERTS.
Hello Bram.
Hello, I'm also very happy that I got an invite to be here.
It was an easy choice to accept it.
Thanks, thanks.
Yeah, maybe just as a short coverage, so Bram is a senior machine learning scientist at booking.com, where he works on bandit algorithms and counterfactual learning.
Previously he was also working on fraud prevention.
He has obtained his master's and bachelor's degree from the University of Amsterdam in computer science and artificial intelligence.
He is a contributor of the open bandit pipeline.
And he was one of the people that created the practical bandits and industry perspective tutorial for this year's web conference.
Bram also published papers at the web and RECSPERTS conference.
Main topic for today will be practical bandits, but I guess Bram, you are the best to introduce yourself.
So tell our listeners a bit more about yourself and how you got into recommender systems.
Wow.
So I think there's not much left to talk about after this introduction.
So I basically wrote down everything that you just said, which is basically the short history of Bram.
But as you mentioned, I started my journey in ML at the University of Amsterdam, where there's a large focus on recommender systems and learning to rank.
Martin Doreike is there, for example.
So I've spent quite of my time there also working on these types of topics.
And then after I graduated, I diverted a little bit to fraud.
I've also been part of Shopify in their logistics department on inventory forecasting, those kinds of topics.
And then a few years ago, there was an opportunity to go back to bandit feedback, that specific area.
And I was like, that sounds like the place to be.
So I moved.
And since then, I've been super happy to be working on this particular topic.
Great.
So was there a specific reason why you changed your focus a bit from logistics, fraud prevention, to recommendation and personalization?
Maybe some specific moment or some specific topic?
Or how have you decided to go deeper into RecSys?
It's funny because the reason I landed on the recommender system area was actually a friend of mine, I think six, seven years ago.
He was complaining, and this has nothing to do with why I'm at booking today.
But he was complaining about how hard it was to find travel destinations.
If you go to any place, it's usually you want to book a hotel or you want to do a thing, but you already know where you're going.
So he asked me, can we build maybe a recommender system for travel?
So go blank, say, I don't know where to go.
I want to go to the beach.
I want to be warm.
I don't want to travel too far.
What kind of places are available?
And at that time, I didn't have any experience with recommender systems.
And we went on a two year endeavor where we tried to build the startup.
It was called Stairway to Travel.
Beautiful name.
And that actually started, I think that was even before I was doing my Masters in artificial intelligence.
And that really got me kickstarted on ML, but also on the recommender platform, the recommender system journey in general.
And somehow that really connects with me.
Like, I've also done courses on user experience design.
One of my papers is on combining visual appearances and search.
It just always appeals to me to be in contact with the end user.
That is what recommender systems are much better at than, for example, fraud prevention or logistics.
Yeah, you just mentioned that you also founded a startup on your own.
And I found this when I have been doing my preliminary research.
And I also saw that you earned the prize for the best pitch or a startup grant during your studies. Was it basically the same company or?
Yeah, that was the same project.
I wouldn't say we really made it to a startup phase.
Yeah.
Because I think the company didn't require a lot of customers.
I feel like a startup is only a startup when you start earning money from customers.
We never really did that.
We did earn money through these type of challenges and pitch battles.
Yeah.
But no, we didn't we didn't end up making any money, but we did learn a ton.
And I think that, for example, one part of being part of these startup challenges is that we got coaching and people that helped us develop a business plan, which also in the end made us realize that this was probably not very doable in the phase that we're in.
But yeah, that was the same company.
Yeah, but I guess also great benefit for what you are doing nowadays, because I mean, this was a long time before you joined booking.com.
But nevertheless, you had to somehow think about the same problems or similar problems that you nowadays have to solve just on a much broader scale and with much higher complexity and many more dimensions, as we will see, because of course, you're not only recommending destinations when I'm still researching where to go like, but also, for example, a card or maybe some events that I want to perform as part of my travel.
Looking forward to what you can say about what recommendations and personalization means for booking.com and for your daily work.
But before we are going down that road, I would be interested into the genesis of that tutorial, because it was greatly covered on LinkedIn and I saw it a couple of times and then I also went through it myself.
And I guess it really helped me in understanding the broad, let's say, landscape of related topics when it comes to bandits and not only about bandits, but also about all the related things, especially when it comes to evaluation and learning where we have seen great tutorials of counterfactual learning to rank in the recent years.
But can you share a bit how you came up with that idea and what your motivation was for it? First of all, this was obviously not just my work.
It was a really cool collaboration, which I learned a ton from by itself.
I think probably the biggest goal of anyone that gives tutorials is likely to learn about the material that you want to teach.
But as I said, I did do a lot of ranking and recommendations in the last years, but bandits was a relatively new topic for me.
And I think two and a half years ago, I started really diving into it.
And I noticed that if you want to learn about anything related to bandits and hopefully we can cover more because I think bandits is a very bad term because there are so many different things people talk about when they use the term bandits.
I like to talk more about bandit feedback, so to speak.
The first thing you probably do is read a book, very theoretical or you read a blog post, very practical. But well, I wouldn't even say practical, very hands on, not necessarily practical. But really understanding how you can do this at an industry scale is much more complicated. And in this journey, I thought, who better to learn this from than from fellow practitioners at different companies that have many different experiences.
And it's difficult to get that information out if you just talk to someone on a conference, you might have a short chat. Doing then follow ups might be complicated legally, but organize the third together, put you through a formal process where you have to deliver.
And that was, I think, the biggest motivation.
Let's do a tutorial. Let's find like minded people that want to learn and want to share.
And we got a great group together.
We met I think every three weeks or so we had like an hour, hour and a half.
We would present what everybody had put together.
We would discuss things we learn.
I mean, there's many more things that you learn in a tutorial organization than that you actually present because you have to cut things out because of time limits.
Then in the end, we like now we're actually I don't know if we're supposed to share that yet, but we also have submitted a second round where we want to go into even more details on the practical side. So this tutorial was mostly what was practical, but not sharing many implementations. So where this is applied.
And now the second iteration that we want to do is add a bit more of the where is this actually used within the companies that use this.
Cool. Great. That will definitely be a good complement to that first round.
And yeah, maybe now since you mentioned it, I might have to ask you more and to provide a sneak preview for it, but maybe so much for the later part.
What I can also tell you is that we have only submitted a proposal and nothing has been created yet.
Yeah, I mean, that's how it goes, right?
Yeah. And given the group, I'm pretty sure that we'll get some really cool contributions once that process gets started.
Yeah. Yeah.
Yeah. I mean, submitting a proposal and then getting accepted then really gets you into the right mood because then you have also some external push now to deliver.
Yeah, exactly.
Yeah. OK. And so as you mentioned, is there is that range between very theoretical books or papers, but also on the other hand, very sometimes I wouldn't say shallow, but let's put that in quotes, blog posts that maybe miss something or are too shallow to really get the gist of it and the practical bandits tutorial that you created somehow positions itself in the middle between these two extremes or how am I to understand this?
Yes, that's absolutely the goal.
So first of all, I think it's a nice exercise.
If you would search for bandits, let's say you want to know something about the algorithm part and about the amount part, you won't search for bandits.
You probably search for banded algorithms.
You do it on Google Scholar.
You likely get very theoretical principles about on policy bandits.
If you do it on Google, you get medium blogs on policy bandits.
But what we've largely learned also throughout this creation of the tutorials that most of the industry is doing off policy bandits.
And then that's very difficult to find if you just search for bandits, because that's usually under term banded feedback.
And that is a term that these when I started out is not very commonly used.
Well, it's commonly used, absolutely, but not by people that don't work with it.
Like it's not a term that you run into very commonly in your standard ML books.
Yeah.
And then if we talk about on policy learning, so when we actually synchronously learn to balance exploration exploitation, there's also a lot of people that cannot figure out the difference between active learning and these on policy banded algorithms.
These are very basic elements that someone that has been working on banded feedback in either on policy, off policy or whatever is very familiar with.
But if you're starting out, those terms are not obvious.
And this tutorial set out to not just give you an introduction, but really walk you through the field if you want to apply it practically.
Yeah, that's really the goal.
Yeah, I really like that way of putting it since what you mentioned is it's easy to get lost if you don't know anything.
Yes.
Or have no clue what to search for.
And so if you can take this as a guidance or as also some kind of a roadmap, because I mean, among these almost 300 slides that you put together there.
There's a lot of animation.
That's right.
Yeah.
But I mean, you are also providing many references.
If people want to go deeper into certain aspects, I mean, you are mentioning all of these papers that have appeared in the past, let's say 10 to 15 years around that topic.
I mean, there are these very famous papers by Lee Hong Lee, who for me somehow kicked off that whole topic.
And then it goes even further.
And later on you talk about one, I guess, of Olivier's favorite topics, so the pessimistic reward modeling with which you wrap up the tutorial.
And it's really a broad spectrum across more theoretic, but also sometimes practical papers that have been applied to some industrial scale problems, for example.
Yeah.
And it's also a big benefit that most of us in the group actively publish within the field, especially, well, Olivier is a big front runner.
Ben London is also doing a lot of these publications.
But the tutorial itself is not intended as self promotion.
So there are some papers from the people within the group, but those are really like the final points that sometimes need to be added to complete the story.
But it's really like we try to tell the story that we want to tell about bandit feedback and throughout trying to make sure we make the right references at the right time so that people can follow all those different paths.
And on the topic you just mentioned, we would probably not be the first one to do a project, either a paper or some internal project.
And then after months of hard work, you run into one term, one word.
And if you would have known that word at the beginning, you would have found all you need, but that word comes at the end.
That's right.
Let's talk about recommendations in the travel industry or recommendations at booking.com.
And then let's relate to how you are solving these problems, these challenges with bandit based algorithms with bandit feedback.
So Bram, would you be so nice and let our listeners know what role do recommendations, does personalization play at booking.com?
Well, that's a much broader question than I probably can answer easily.
One of my colleagues has this slide.
He has the search result page of booking, an older version.
He highlights all of the recommendation blocks that are on there.
At that slide, it's not mentioned recommendation blocks, but ML blocks.
And I think almost everything on the page is going through some form of a recommender system.
So think about the search engine in general is obviously already a recommendation system, but there might be certain banners that you see or the filters or obviously with something that is so personal, which is travel, there needs to be a lot of personalization in many of the areas.
If we would just recommend you the most popular places in the world, everybody would be going to, I don't know, Paris or London.
That's probably not what we want to do and also not what our customers want.
However, maybe sometimes a good starting point if you don't know anything yet about your customers, so if somebody is really new to booking.com and searches for something, then of course, what I guess you might have access to, as you know, from where this person is accessing booking.com.
So maybe whether it's a person looking on the German website of booking.com or maybe on the US side, so then this is, I guess, already something that you can take into account to provide some more popularity based recommendations or what is the case there or how do you deal with cold start without going too deep?
How does booking deal with the cold start problem?
When someone arrives on our website, for example, we don't know anything.
Actually, I think the majority of our users, we have to assume that users are not logged in because if we assume that users are logged in, then the users that are not logged in have a terrible experience, but at the same time, when users come to our website, we generally assume that they already have some idea of where they want to go.
So really quickly you start gathering the information.
I think the cold start problem is not necessarily one of the places we're mostly working on.
We usually assume that someone has some idea of where they want to go and we try to best match our offering based on what they want to do and the areas where we do have a cold start problem, we tend to have ways to do it based on popularity or just show multiple things at the same time.
Because just as you said, the most common use case is really that people come with a very concrete, let's say intent to the platform.
So for me, it was the last time I used booking.com was when making a booking for a hotel in Helsinki.
So I basically, I got to the platform was searching for, I want to go to Helsinki for that timeframe and I'm searching maybe in that range and then go.
This is where you already start collecting information about myself.
Regardless of whether I have done bookings before, which I did and which you can take into account to, for example, understand what kind of hotels I'm looking for, what my sensitivity is towards certain attributes and all of that stuff, but even though if I wasn't bringing that information, if I would be totally new to booking.com as a user, then still the very first thing you get in most of the cases from me is basically my search query for a destination.
Yeah.
So that's also directly, I think an interesting area you touch upon is you say, I come to the website and I have my intent to go to Helsinki and then maybe I come back, well, I assume you always come back, but you come back later with your next travel that might be going to, I don't know, let's say Barcelona.
But that also touches upon a very different aspect of what booking types of recommendations are compared to what things like Netflix or Spotify have like companies like Netflix and Spotify, they have more or less continuous interactions with their users.
Spotify, I think I'm on Spotify almost all day, every day, right?
The music is something to listen to Netflix or any of the other streaming platforms you tend to watch maybe once a week, a couple of times a week.
If you're having time off and you didn't have money to travel, you might do it all day for a whole week or if you're sick.
So then there's much more interaction and that really changes the dynamics of how you can do recommendations.
We have almost one shot.
Someone comes to our platform.
They want to get a hotel.
Let's assume hotels, but we also have cars and we have attractions.
We need to give that to them as soon as possible.
And then they go after a travel and then we might hear something back from them months later.
And then they come for their second travel, this completely different context.
They might want to go to something completely different or suddenly they're a business traveler.
They're not traveling for themselves.
So this constantly changes the dynamic.
So every time can be somewhat of a cold star in that sense.
That's an interesting point that you are bringing up there.
So actually how much feedback you get per time from your users.
The very first domain where I've been doing RecSys work was actually for a car marketplace, somehow similar because people don't buy cars as often as they consume music, some videos on video streaming platforms or read news.
Maybe they do it once every couple of years.
And then what you, what you said is you only have that chance to do it right.
Where I would partially agree and partially disagree, because on the one hand, of course, they won't come back for quite a long time.
And maybe in your case, it's for weeks and month and for a car platform.
It might be for years.
However, if I'm entertaining to go on travel and maybe really thinking about vacation, where I might also want to visit different cities, different countries, whatever, then I would also assume at least this is how I work, which is very anecdotal evidence, but let's just entertain it that I almost never book it or book my hotel when I go for a search the very first time.
So it's really like, Hmm, I would have to book a hotel since I intend to go somewhere.
Let's give it a try.
And then I look around and then I might not be fully convinced.
And then I return a couple of days later and then urgency kicks in and I think, okay, now you really need to get that hotel because maybe prices might be increasing or availability might be decreasing, whatever reason.
And then I feel just the urgency.
Okay.
Now you need to get it done.
But nevertheless, throughout that narrowing down of myself and making my mind up, you collected evidence from me.
So you might not need to get it the first time, right?
Even though it would be the best case or what is your perception or your opinion on this.
I think that's very much right.
And there's no one type of user, right?
There's always, I think also on Spotify, there's probably people that put on a playlist and listen that for the rest of their duration of the day or whatever.
And there's people that select their songs one by one in that sense though.
And that also relates to cars.
What also makes it different is yes, maybe you have multiple sessions, multiple touch points within one purchase.
But what is nice is that we do have the golden signal, so to speak.
I wouldn't say it's a golden signal, but that is if someone actually ends up booking a hotel and if someone has a bad experience, they'll let us know, which if someone doesn't like a song, it's unlikely that they will downvote it.
If someone doesn't like a movie, they might sit through it through the end and then tell their friends how bad it was.
Yeah.
But in a booking session, you click on a bunch of hotels.
Those are very weak signals.
But then at the end you book one and then you might leave a review.
If something is very good, you leave a good review.
If something is very bad, you leave a bad review.
That gives us a lot more, in that sense, a lot more signal.
But still for that user session, we don't always learn a ton from the user.
Sometimes we do, sometimes we don't.
But there's definitely information we can gather throughout a session to make it more personalized. Absolutely.
And also there's a lot of travelers who doesn't book their travel based on an Instagram picture. So there's also a lot of travelers that are very similar to each other and one very similar thing.
So that is definitely a way to address person.
Maybe also could say personalization is not personalized to you.
It's personalized to who do we know that does something similar to you that you might also like.
For today, we promised our listeners to uncover the space of counterfactual learning and evaluation, bandit feedback.
So when talking about the recommendations that are being placed on booking.com and how you do personalization with recommendations, what is the role that what you do in the tutorial is playing there.
So basically to which degree does these two things, the topics that you cover in the tutorial and what you do as let's say daily work, how do they match?
Well, I would say almost 100%.
So the big difference is that not everything that is set in the tutorial comes from booking.com.
A large part of it is coming from Netflix, ShareChat, Amazon, Spotify.
So that means that the difference I think is that some of these things that we present, I would love to have and we don't have them yet.
And I think the same goes vice versa.
But my main role within booking is to try and understand where and how bandit feedback is applied.
And I would say bandit feedback is applied almost everywhere.
I was actually about asking you what you explicitly mean with bandit feedback, because for people who might recall the episode two or three, that was with Olivia Yunnan, we basically made that difference between organic and bandit feedback. So to say, where does my feedback come from?
Was it from organic user behavior or was it from recommendations or something that was already somehow biasing my decision, pushing me maybe into a direction to say, so am I recalling it right or how would you put it?
When I talk about bandit feedback, and I think this is generally what people refer to it when you have a decision system, it doesn't have to be a recommender system, but you have a decision system where you are actively creating a decision and then you only get feedback for the decision that you took.
So in bandits, we tend to talk about arms or actions.
We pull an arm.
So, for example, we have five slot machines.
We make an active decision, which slot machine to pull.
But we only see the outcome of that slot machine.
There's many other variations of this.
For example, you have semi bandit feedback where you have multiple arms that you pull, but you still have some that you don't know.
And you have sometimes it can be that you still get feedback from arms you didn't pull. But those all are complications on top of it.
In general, in the recommender space, we have either a regular bandit feedback setting. For example, when we show a banner or we have a single recommendation we give that's typical bandit feedback, but also ranking systems are a form of bandit feedback, semi bandit feedback in that way.
But you decide what order to display items in.
But at the same time, you only know the outcome for the things you display.
So as soon as you go above forever, someone either because you have a cutoff, you don't show anything else or people don't go to the second page or people don't watch the whole page, then you get into this more complicated bandit feedback setting.
But this is actually applying to bandits with I guess the term is single place.
So where I'm offered with an assortment of K arms and I only pick one from them to experience kind of the reward.
But there are also cases in which I can pull several.
So not only just one where we talk about multiple place or where I'm offered with a couple of arms from which I cannot only pull one or select one, but several.
How does that fit into your scheme?
It depends on what you're asking exactly.
So either you're you're asking about semi bandit feedback.
So, for example, a ranking system or you're talking about sequential decision making in bandit feedback, it's generally assumed you have only one play.
Doesn't mean you can have only one arm that you pull, but you make a decision and then you get the outcome of that decision.
You don't then act upon this outcome.
When you go into that direction, you come much more into the direction of sequential learning and more complex reinforcement learning.
That is something that I do encounter it, but it's explicitly not what I'm trying to reason with, because as soon as you go into that step, a lot of assumptions break and you get into a harder problem.
Also, most of the time, and that's also something we mentioned in the first few slides of the tutorial is so if you have a contextual bandit, you do sometimes implicitly make it a sequential decision making process.
If somehow your context contains anything that can be caused by a previous instance of your own policy.
So in case my action up on the outcome that I observed influences the state again that is taken as context for let's say the next round so that you then break the independence between these two steps.
And that is not always bad, but it's not part of the assumptions that we tend to make when we talk about bandit feedback.
Okay, so which kind of the recommendation problems you are solving would be one where we could basically apply this to.
So I guess you already mentioned the decision was I'm going to show a banner, but if we, for example, think about a ranked list of items.
So for example, destinations.
So I put in a search.
I search for a certain timeframe and city and I want to be shown with hotels and then you basically show me a ranked list of hotels, how I could apply the bandit setting to this.
This is actually a very interesting area that you touch upon now, because one of the things that we encountered when we were thinking about resubmitting our practical bandit tutorial was can we extend this to ranking?
Because right now we have some small elements of ranking, but most of it is on the traditional bandit where you pull one arm at a time.
And then we discussed about another tutorial that is being given by some people, largely from the University of Amsterdam on unbiased learning to rank.
And actually we were discussing what's the difference between unbiased learning to rank and the whole field of bandit feedback.
That is a very thin line.
Right. So as soon as you start extending your bandit problem to a ranking problem, you very quickly come into the space that is unbiased learning to rank.
Although unbiased learning to rank tends to, well, that's not completely true, but it's not uncommon that it's about off policy learning.
So we have logs and we want to do something with it, but that still is some form of bandit feedback. And it's also if you read Thor Latimer's book on bandit algorithms, he also mentions ranking as one of the problems of semi-bandit.
So where you can have multiple.
So many of the things you would see in an unbiased learning to rank setting are easily considered as a bandit setting as well.
Now, the different thing is, though, if we start talking about exploration and about treating the information that you get in, then you get some more complicated elements than in regular bandits. So, for example, in a one-arm bandit setting, you show an arm, you know how likely you're going to show that particular arm.
If someone interacted with it, you know that that's all equally weighted.
But in a ranking bandit setting, we show a couple of arms and now suddenly position bias start playing a major role.
All the other biases are trust bias.
And there's a ton of biases that come into play that make it so that the feedback that you're getting is not equal among all of the arms that you are displaying.
So I do hesitate a bit with understanding equal correctly.
So do you mean that because the feedback is a consequence of the appeal and the bias?
So this is then not equal anymore because some are positioned higher.
So the position bias is higher and it's basically some kind of an unfair advantage.
And this makes them not comparable anymore as long as I don't debias properly.
Is this what you mean by not equal?
Yeah, so if in a regular bandit setting, we're only talking about relevance.
I mean, we rarely use the term relevance because relevance is really a term that is more common in ranking.
But we just want to know if we display something here, depending on what we display, which gets more clicks, more conversion, more whatever we are interested in.
Rewards practically.
Yeah, reward.
Yeah, reward.
But in a ranking scenario, we're still interested in this.
But now suddenly the position where we place it plays a major role.
So we tend to assume that from top to bottom, things get more attention.
Let's say there's only position bias and relevance.
Those there's more, but let's say those are the only one.
And the first item is 100% relevant for the user and it's 100% seen.
Then, yes, you're going to see a click.
But the second element, if that's 100% relevant, but it's only 50% of the users actually reached that point.
Now we get half the amount of feedback.
But if we now start to learn based on this feedback, then we say we've displayed 100% of the time.
So whatever bias correction we want to do based on that, we don't do anything.
But we also need to consider how valuable was it for this particular item to be in the second position.
And that is not trivial because that's something we cannot just get from the design of the system.
That's something we need to get from some form of additional exploration, which we don't tend to have, which we now need to add on top of what we have to have in a regular system.
And there are very interesting ways to do this, though.
First of all, if as long as we have a limited amount of arms, this can make things simpler even, because let's say we only have 10 arms to display, like a common use case for this is let's say we do block ranking on a website, like we have our blocks on the website want to order them.
Now we can just place them anywhere.
And then we can figure out more or less how the position bias is by randomizing these, but they're still all displayed.
So it's less expensive than what we have in a regular bandit system, if we would only want to show one at a time, we might miss the opportunity to show the right one to the right user.
Well, if we show all of them, the user might scroll all the way to the end and still find the block they're interested in.
But as soon as you have a lot of different arms, then, you know, some are unreachable.
So now you still get the same problem.
I guess you have already mentioned the word randomization, and I guess randomization plays a major role in how to get this problem right.
Because what you are doing quite often when dealing with bandit feedback is that you basically raise the counterfactual question where you ask yourself, what would have been my reward if I had chosen differently, for example, and then using randomization as an effort to answer or try to answer it.
Can you explain how randomization and where it plays a role there and how it helps you to address this problem?
Absolutely.
Well, randomization is one of the ways of describing it.
But I would say stochasticity is a more appropriate way to coin this term.
Randomization has a bit of a negative connotation, I'd say.
And all your product managers will leave the room starting to cry or yell at you if you propose to randomize stuff, which is what we do.
But yeah, if you want to propose to your product manager to do randomization, call it stochasticity and say that it's something fancy and everybody will stay.
Yes, exactly.
But this is actually a very good point.
I'm going to divert a little bit from this to give you a little bit of a thing that might be relevant to this.
So a lot of the ways to deal with bandit feedback is to correct for some sort of selection bias.
So if we make a decision and a certain decision is more likely to occur, that decision also has more opportunity to prove itself.
And depending, it's not a given that this happens, but depending on how you optimize your system, you might need to correct for how often something was displayed.
And that usually boils down to some form of propensity correction and propensity for those who are not familiar.
Propensity is the likelihood that a certain action is being taken.
Now, the funny thing is, is that every time I talk to a team that is not familiar with this, so we're explaining them what a propensity is.
They say, oh, yeah, we have that.
And then what they show me is the model predictions and they say, well, the model predicts a score and that's the propensity.
But that's not the propensity because most of the systems that are designed by someone who is not familiar with bandit feedback will be deterministic.
So the highest or greedy is a better way of addressing this.
So it will always pick the arm with the highest score.
So that means that the likelihood that for this user, given this context, this action would have happened is 100 percent.
Now, that brings us back to your question is why do we need randomization?
Well, because now the only thing we can say about this user is that when they get this, they will convert yes or no, or they will click yes or no.
We need to know some counterfactual evidence.
So instead of factual evidence, which is what we did, we need to know what if that's what you said.
So now what we need to make sure is that these users or at least not necessarily this user, but users with a similar profile that ties back to what I mentioned earlier, like everybody has someone else in the world that is likely similar, not the same.
We're all unique, but similar to us.
So we try to match them to someone that is similar.
So we understand what would happen for this.
We need stochasticity.
So some users need to see a and some users need to see B.
Not to be confused with a B testing, but within a context, we need to show some users a some users B. Right.
And we do this so that we do not get a propensity of one or zero for certain actions.
The funny thing is that there's many ways of doing this, but sometimes we are already doing it.
And that is when we have, for example, multiple models running at the same time.
Plenty of teams that have multiple models that they run just randomly for certain parts of traffic.
And that can already bring some stochasticity and some teams have holdout traffic where certain users are not treated just to have some sanity check or some traffic where all users are treated just to have some sanity check.
And if you have these, then all of these models and holdout and flat.
So flat being, we always give our treatment, whatever that may be together, those form a limited stochastic policy.
So some limited stochasticity, even though it might not be completely random.
Okay.
So to translate that into practice, let's say there was a 10 hotel world and I was actually.
I wish.
It would make my life much easier.
And they all would have been existing since the beginning of time and are comparable, at least in the type of features they have.
And there is a certain type of user features that we have so that we can take, for example, the user features as a context to impose a stochastic policy, which basically is a probability distribution across these 10 hotels that I then use to sample from it the hotel that I'm going to recommend to a person.
So, and then what you would basically do is say the first hotel has a probability of 70% to be recommended to me.
The next one has 20% and then it decreases even further.
And now you take this to basically sample from that according to that probability distribution.
And it turns out the most likely thing happens.
You propose to me the hotel with a probability of 70%, which then would mean that 70% becomes the propensity that is locked together with my potential feedback towards that hotel.
So as I'm going to make a booking or just click it or save it to my favorites, whatever.
But it could also be that for another user that is more or less similar to me, you would, for example, display the less likely hotel to them, the one that only had 20%.
And then you again lock that 20% along with the feedback the user provided.
So basically what we would or could refer to as a reward.
So is this somehow in that very limited example, something that you do that you actually want to know what was at the time of a decision to use the broader term instead of recommendation.
The decision made, for example, to display a certain item to a user as a recommendation and then to lock that information to reason about it afterwards.
Is this somehow going into the right direction or what would you say?
Yes, yes, it is.
And it's important to note that in this very simplified world, like it really depends on how we start using this information.
So sometimes when you have bandit feedback, you don't necessarily need to do all these bias corrections because let's say we just do a simple linear regression, like a point wise linear regression.
And we have shown all of these different hotels equally.
And for each hotel, we do a separate regression.
We just try to learn the average click rate from users.
We don't necessarily need a wait for that.
That would be fine.
But as soon as you start going to more complex modeling where just like you would do simple weights anywhere else in machine learning, you might need to understand how likely it is that some of these decisions are taken in the real world so that you can correct for them.
So that you make everything balanced.
Yeah.
Yes.
So for that reason, you need to know what the likelihood was that a certain decision was made at a certain point in time and especially in systems that we have some systems where multiple teams sometimes in different areas of the business that they do communicate, but not all the time.
So constantly things are changing.
So it means that it could be that when one user comes in, we are still using one system, but then the next user comes in and we suddenly switch to a new system.
So we constantly need to understand for this particular user, what was the likelihood that this would have happened for them?
In general, the foundations of this is that the only way to really do any of this is to make sure that we do some form of stochastic decision making so that not every time a user comes in, we recommend them pairs.
But sometimes we try to recommend them something else.
So that your denominator doesn't become zero and you need to exclude that observation from your IPS score or something like that.
Yes, but also that, for example, if we talk about counterfactual learning, we also have counterfactual evaluation.
But if we start doing counterfactual evaluation on the decision system, usually you can do two things.
You can either evaluate the click through rate.
So we have an action in a context and we want to know the click through rate and we predict how accurately we can predict the click through rate.
That's one thing you can evaluate.
But ideally you evaluate for a particular user, would they have converted yes or no if we would have presented this?
Because we don't have that many samples for all of the users.
So we don't really have one click rate for one type of user.
And if you do this, then you can only really evaluate this for the users that you've seen, for the actions that you've taken.
But then if certain actions are taken much more often, then if your system would start taking that decision, they're more likely to get reward because that's the only data that we have from this user.
So for this reason, we need to make sure that we, one, have examples for different actions.
Ideally, completely random.
Don't say that to your product manager, but ideally we have some completely random traffic.
That's the easiest way to work with this.
But probably we want to do something a little bit more optimized and then we need to make sure we balance things out in the end.
Yeah.
Which brings me to two thoughts that I'm having.
The first one is that I actually made a mistake.
The second one is that randomization might also sometimes be misunderstood as random uniform randomization, which is also not something that one of you might advocate for as I learned from going through the practical bandits tutorial.
Actually, I would say that the majority of us mentioned that the easiest one and the most commonly used one is uniform random randomization for part of your traffic.
But that only works in specific scenarios, not all scenarios can just be done recently.
So it doesn't mean it wouldn't work at all, but for some scenarios it works and for some it works less.
Maybe we will get to that point.
But the first thing that I mentioned so that I actually made a mistake is actually some feedback that I received quite a while ago, which with this I want to pay a bit more attention to because we haven't actually covered really the question of the why.
So why are we doing this?
Why is it actually relevant and good to pay Bram for investigating developing systems that deal with bandit feedback that enable counterfactual learning and evaluation?
So what are the benefits of doing this actually?
That opens a lot of different questions.
So you're asking why would booking in this case pay me to invest in this?
Well, I would say that bandit feedback is a given in many of the places that we work for.
I always have a hard time giving an example of a recommender system.
I don't know if I would be able to give one easily where the recommender system is not a form of bandit feedback.
And then as soon as we are doing a form of bandit feedback, we need to start thinking about how to do it properly because otherwise we get like if you are coming from a Coursera course on machine learning, which is not uncommon for machine learning practitioners, right?
A lot of machine learning practitioners have very STEM backgrounds and then were transitioned to learn how to do machine learning.
But there's a lot of context in industry.
There's every product you work on has their own intricacies.
So many of the things that were built at first are built just by doing something very basic and then later you realize you might have introduced bias or you might have had a naive model that ignored some of the important elements of bandit feedback.
And we need to better understand for cases where we work with bandit feedback, what's the appropriate way to not waste money on bad decisions either because you're having a bias model or because you're doing too much randomization because both could be the case.
And then given that you have collected this data, so either through a little bit of randomization or a lot of randomization, how do you then best deal with this to come up with a new model policy or something?
Or some way of learning that is better and with better, I mean, specifically in the e-commerce world, to get more people to buy the products that they want to buy and not recommend them things that they don't, they're not interested in.
And in the case of many of the streaming platforms, it would be how to keep people engaged.
If you don't listen to Spotify, you're not going to pay for that next month subscription.
Makes sense. So it's about improving existing systems, but also what I get from what you are saying to establishing trust in what you measure or how would you rephrase that second part?
I mean, one of the for me, it's sometimes like one of the holy grails and RecSys is really to get that offline A B testing solved.
So to somehow use properly locked data, which, for example, contains those propensities and not the model scores, which, as we learned, are not the same thing, but they might be related.
So to have proper logging data, not as an only, but as one of the crucial components.
In order to be able to, for example, evaluate new policies.
So my target policy based on logging data that came from a different policy.
So basically the production policy and then do all kinds of techniques to estimate how my metrics and I'm not talking only about, let's say, typical information retrieval metrics, but also something like conversion rate or click through rate would be like if I were to put that target policy in production to replace the policy that created the data in the first place and thereby to somehow become independent from A B testing.
However, I have seen papers.
There was a wayfare paper, for example, at last breakfast where they have really gone through the three steps of typical offline evaluation.
So mean average precision and DCG, whatever.
I'm not sure whether this was actually the metrics that were covered there, but standard offline evaluation in RecSys, let's say they did off policy evaluation, basically trying to answer that counterfactual question.
What if I would put that target policy in place?
And then in the end, they also actually run an A B test to somehow confirm what they have found as results of the preliminary remarks.
So in that sense, is it about establishing confidence in what you measure by rather evaluating in that counterfactual way instead of the, let's say, classic way and thereby not only establishing confidence in your results when making the decision of which model you want to put into your model.
You want to put into production, for example, but also to increase the experimentation speed because you don't have to rely anymore on setting up an A B test, running the A B test, waiting until you have enough data to have a statistically significant result and so on and so forth.
Also, along with all the risk that A B testing might have in terms of decreased user experience or something like that.
So what is the gist?
So is it confidence in your results?
Does it increase the experimentation speed?
What is it?
So I don't think you can see these two separately.
So there's again, multiple layers in your question.
So you're asking specifically about counterfactual evaluation.
There's two sides of this.
You have counterfactual learning or optimization.
You have counterfactual evaluation.
They're both relying on some form of counterfactual estimate.
But when we do evaluation, we need to be very precise.
And there's a lot of elements that come into play that make it difficult to be precise.
Actually, I think on RecSys, Ben London and Olif Iona had a paper on this about confounders in, I don't know if it was recommender systems or ranking, but at least there's a lot of things we don't know for certain.
So I don't think, at least at booking, but I would also recommend people to not go in that direction.
That counterfactual evaluation is ever a replacement of A-B testing.
But if you are doing candidate selection, so we have many different models that we could try out, probably some form of counterfactual evaluation is the best way to narrow down your search.
So let's say we have a model that wants to optimize how many people convert.
Then we can first train a model that given the different actions that we have learns the conversion rate.
And if we evaluate this, how precise we are at estimating this conversion rate, we might get some information.
But the real thing we're interested in is what you said is the what if question.
So we want to know what if we show it does our conversion rate really goes up?
Because maybe you just didn't have the right evidence or there's some bias.
Then if we have created different models or different policies and we apply them on our log data, we can get the best candidate and then we test the best candidate or the best couple candidates.
So we want to be as precise as possible in making this candidate selection.
Because if we make a mistake in the candidate selection, we also miss out on any gains we have online.
But we never at least I never aim to replace a B testing because that's the only real way we get rid of all the confounders because we just randomly show two different strategies.
Even if there's other elements at play, if there's other a B test running other things interfering, they are supposedly as everything is done well exposed to both A and B equally.
So we don't have any biases anymore from that.
Obviously, there is a lot of work on leakage between A and B, especially in marketplaces.
If you have I don't know if hypothetically we lower the price of hotel, it gets booked, then that hotel is not available anymore.
Yeah, good.
Or if we get more hotels in a certain area to sign up, the price will probably lower because now there's more competition.
Yeah, so it's something that if ideally your A B test is done correctly, that's the best way to get a good estimate and off policy or counterfactual estimation in an offline fashion is the best way to do pre selection for the process.
Oh, okay.
That makes total sense to me that you don't want to replace A B testing, but you want to improve the way you make the decision for the candidate that you want to put in an A B test.
As kind of the final validation of what you think might work best or improve your current system.
Yes.
And there is another point and maybe this is what you are also referring to.
In our tutorial, we talk about the benefits of off policy learning versus on policy learning.
And one of the things is speed and iteration of your development.
But in learning, we get into a slightly different area than evaluation because now we're not just talking about evaluating things offline.
We're also talking about having to even optimize with the feedback of the user.
So if we have an on policy system, so we show the user something, we get feedback and now we update the beliefs of our system and we show the next user something new and we update.
So the expiration expectation trade off, we first need to expose it to a lot of users to get to a certain state where we're confident that the model has converged and then we need to compare it to something else.
So if we want to train for models, we need to have four completely distinct sets of users to train and four completely distinct sets of use to evaluate.
But if we do off policy learning, we now have four models learning from the same set of users and only if we would not do off policy evaluation, we have to use four sets of users to evaluate.
And then with off policy evaluation, we could say, well, the first two are definitely bad.
So we only keep the latter two and those we now need two sets of users for.
So every time you can do something offline reliably, it will save you exposure to users.
And to add one more to that is that if you do on policy learning, you also need to do exploration to users.
Usually when you do a B testing, you have a converged model.
So something that you are quite sure of that is better than random.
But if you would do it on policy and you would do learning, you first need to expose the users to bad recommendations.
Even learn what they want.
So that also is a big cost that you don't necessarily have in a B testing.
Most a B tests are not that expensive because both are probably reasonable.
And if one is really bad, you can shut it off early.
Yeah.
Do damage control.
Yeah.
Yeah.
Makes sense.
And yes, you have said is stochasticity of action selection is very important and that you also need to lock the process.
Propanzity.
But when it comes to to logging data, I guess you had a good coverage of the certain steps and of how to get it right in the tutorial.
Let's make it broader.
Let's say we now get our product managers convinced with just what we said here have more reliable estimates.
What you are bringing to a B tests is for example, less risky or already more bulletproof than if I were just to evaluate or select my candidate based on some MRR or NDCG.
And they say, yeah, go for it.
What do you need?
What do you need to do?
So I know that's a that's a broad question, but also maybe one that makes it more practical also for those listeners sitting in companies where they are just at the very start and maybe not as advanced as maybe Netflix, Spotify or you folks are.
So how would you sketch a plan of how to get started with that to open up all these possibilities of doing counterfactual learning, counterfactual evaluation and so on and so forth?
That's a very good question and a very difficult question to answer one way.
I think it highly depends on what stage your system is in.
So let's say because it's much easier to reason from something that doesn't exist yet.
So let's say you're building something new and you're considering taking this into consideration.
Right.
So you're you want to design it with counterfactual collection in mind.
Now, it highly depends on how big the action space is, how you can do the exploration.
Let's assume you have the 10 hotels, right?
So we have only 10 hotels and we just wanted to show one very simple example.
Then probably the best thing you can do is just show for certain percent of users, probably depending on how big your user traffic is, two to five percent.
Very hard to say.
Like this is highly dependent on how much data you can collect, how expensive it is to expose people to things that are maybe not optimal, how suboptimal you think some actions are.
Let's say they're all reasonable.
Then probably we can at least start with like five percent of the users, give them just a uniform random choice.
And the other half you give a deterministic choice on a model that you have trained, maybe just popularity based, maybe even something super new.
It doesn't matter too much.
And then you basically have Epsilon.
That's usually the best way to start.
People underestimate how expensive Epsilon greedy is.
Sorry, overestimate how expensive.
Yeah, it was actually about Alaska.
Yeah, it can be very expensive, but in most cases.
Depends on the Epsilon.
Depends on the Epsilon.
Depends on how large your context is.
Depends on how far apart the recommendations are.
And then the most important part here becomes to log everything correctly.
So what you want to know for sure is for any user that you log you, first of all, and that's something people tend to forget is you need to know what hotels were available.
This doesn't work if you have like we have tens of millions of hotels.
So it doesn't work for us that like that for hotels.
But if you have something like banners, you just log which banners were available because not all users can see all banners logged in users might be able to see different banners that you're not showing log please log in to see more to non-logging users.
So that's a login user.
So first of all, you want to make sure that you know exactly what actions are available to this user and then you want to know which action was picked.
Of course, you want to know what the feedback was to the action, but you also want to know the propensity was that this user would have seen this action.
So in this simple scenario, it's very simple.
If the user was in the greedy policy, so it was the model made a greedy decision, then you can give it the greedy one.
You might probably want to log what the percentage of that particular decision is.
So 95% of the 5% of salon and probably also want to know what the other things were that were options.
So in this case, we also have epsilon and in epsilon, the chance that this would have happened was one over 10.
So now we have and I'm not doing I know it's simple math, but I forgot 95 and then we have 0.5%.
95 and a half is our propensity.
So now we know that and then already we have a very good start of the system, right?
We now know exactly what action was taken, how likely that action was going to be taken.
We know what actions were the other actions that were available.
By the way, it's good to know and that's also why I say keep in mind the actions that you have in terms of choice, because let's say half the actions are not available.
And now that some part is one over five, not one over 10.
So it really quickly changes.
So you want to make sure that you always know at that point in time, what were all the decisions that were made by all of the different choices.
Some cases you have multiple models running next to each other.
So then you want to know what the decision was from all of these models and how big the contribution of traffic was for all of these models.
So you can get a good propensity for that.
And then you already have a very good basis.
And now I'm just because you said it's part of the slides and know that it's part of the slides exactly know where it is.
I'm opening them right now because there are some obvious things that one obvious thing that I'm missing and that is never forget the lot of features.
Uh-huh.
So by features, you mean the context?
Yeah, the context.
As a follow up question to that very good and comprehensive description of what we should do in the first place.
This is something that you also mentioned business words that sometimes kick in after you have come up with a list of recommendations or something that you want to have the user being able to interact with.
So that there are some kind of post filtering or re-ranking.
So let's assume in that 10 hotel world, there are five hotels, which are some kind of premium partners and they are maybe paying more or whatever.
I do have some business goal to promote them.
And so I might re-rank a list in a way so that I put them higher in the list or something like that.
And then this is not reflected by my propensity of being shown properly anymore because this was some kind of the output from the probability of that action being shown by my model, but not actually what was displayed to the user in the end.
So how do we deal with these business rule related problems downstream where we have that difference of what our model is going to be shown is different from what we finally show on the website.
Yeah, we also have a slide on this.
It's a good slide, but it's very limited in how much it explains.
And that might also be because it's extremely difficult to deal with this.
It's not a trivial thing to do.
So I want to steer away from the ranking case a little bit because if you add the ranking case, I'm sorry, not in general, but just for this particular one, it makes things so much more complicated than if you just talk about a single arm pull.
So in general, the best advice is to get the business rules first.
That's also what I said in the logging part.
Make sure you know which arms are available to serve before you start your algorithm.
And if that's not possible, then ideally those business rules are part of your policy.
So either you have them at the beginning or your policy also knows the business rules and can apply them.
But that requires those rules to adhere to the same context as the policy has.
So one reason you might want to do it after your model is because the business rules can be very expensive to execute.
So they might be heavier computationally than your model itself.
So you want to only apply them to a limited set, something that I can quickly come up with.
Let's say we talk about hotels.
Let's say it's not the case with us, but let's say it's very expensive to understand if a hotel is available.
Then you might not want to check the availability of all hotels before you start making a choice.
You'd rather just get a hotel you recommend, check if it's available or if it's not available, you take a second one.
Something like that.
So if that's the case, then you sometimes can also include it in your policy as a post recommendation.
And then you just treat it as a post-processing step of your model.
If that's not the case, which is generally the case if we talk about these types of business rules, it's something that has some confounder that is not part of your model, something else that is influencing this decision.
Then we quickly run into problems.
I've seen things, for example, where decisions are made on a user level.
We decide for a user what to show, but then later we have some filter that also looks at the hotels or something.
There are multiple stages, but the second stage was not considered.
We don't even know what that is in the model itself.
Then you get into a mess and the best thing to do is either disentangle the mess and try to get those things out.
And if that's not possible, my advice would usually be to distinguish between treatment and intended treatment.
And that is that instead of looking at what happened, you look at what you wanted to happen.
Because that's the only thing that you influence.
And then you assume that what you happened is causing the effect.
And that is probably a better estimate than what you saw happening.
So in many cases, we cannot reasonably know what the actual outcome would be.
And it's better to just act like we took our decision.
Because if it's a multi-armed bandit where you just have very clear actions, then probably it's best to actually look at what the decision was that was taken.
But then that also needs to become the propensity.
There's two ways of dealing with this if you cannot include it either at the beginning or at the end.
And that is we still can look at the intention to treat and treatment type of example.
In some cases, you might want to look at the treatment itself.
So let's say we have logs and we see what happened that might be different from what we have actually tried to do.
And if that relationship is something you can easily...
Well, we might not be able to predict it, but at least if that's a very simple relationship, for example, let's say we do something on one grain.
So on a user level, we try to make a prediction.
But later, we start adjusting it on a hotel level.
We start making the...
Then everything breaks apart.
But if we clearly take a certain action that is in our action space and let's say we predict Paris and now we do London, then probably we can still take London as a sample to train on.
But when we start doing evaluation, things get very weird.
So as long as somehow you can predict the outcome given your context that you have available, then probably it doesn't matter too much.
You can just still use your intention to treat things will work out.
But as soon as something else comes into play, yeah, you probably need to address that part to get really good results.
But things are going to get messy regardless.
You will. That's something you have to figure out for your own particular case.
If it's better to take whatever you were intending to treat or you look at what actually happened, there is...
I don't have a clear answer on what would be the best way to deal with it.
Yeah.
So definitely something to keep in mind and it somehow fits for me into that scheme that we also talked about last time with Himain Abdullah Puri when going through the different steps of where to take care of popularity bias.
Where we basically said, okay, you might do some pre-processing of your data.
You might to somehow integrate your debiasing criterion as a regularizer into your optimization criterion.
So to do some kind of in-processing or as the most effective and easiest way to do is post-processing.
So to re-rank and here we might have actually an example which would rather advocate against doing re-ranking as long as you can't capture what effect it has on the propensity of an item being shown.
And then rather to do things, incorporate your business rules into the pre-processing or at least into the algorithm to capture the propensity properly and then to kind of keep your logging data clean.
I think what's also a clear separator here is that if we're talking about learning, it doesn't matter all that much, right?
We have an outcome and we have a context and whether that had business rules on top of them, it doesn't really matter.
But when we start talking about evaluation, that's when the real problem comes into play because now certain samples in your data set might not be accessible for your model, but you don't know.
And there the real problem comes into play.
Let's say it's a business rule that changes the stochasticity a little bit, but it's still everything is accessible and that stochasticity is based on your context and maybe you'll be okay.
But as soon as certain actions become unavailable or like a simple rules, let's say we do a ranker and two things cannot appear in the same ranking.
Like let's say we have themes or something and we have two themes that are almost identical.
Yeah, one is slightly better than the other.
But if they both appear, we have to cut one of them.
That's something that your system now cannot really evaluate anymore because your policy might recommend this particular setup.
But we have no examples of where this happened and we cannot get examples of where this happened.
Then everything starts breaking apart.
A lot of things in this space are different when you're talking about learning and when you're talking about evaluation, they touch upon the same sources of issues.
But sometimes during learning, we can make shortcuts a little bit.
And I'm not advocating for shortcuts, but sometimes doing something that does not fall completely within your assumptions still works.
Yeah, when you're evaluating, you don't know what the effects are going to be because if you're learning and then evaluating in a proper way, at least know what the effects were of what you did wrong.
But if you're also having this part in your evaluation, then you don't know anymore.
So let's say we have decided that we're going to use propensities in our learning system.
For example, all these unbiased learning to rank papers, let's say we do propensity corrections in unbiased length rank setting.
We might learn with propensity weight, but we also have our evaluation metric that is biased.
So NDCG, for example, tends to be suffering from position bias.
But I can update my model to work with these propensities and I can assume that works, for example.
But then if I evaluate it with my bias method, it will be worse because the bias method is more worse.
But then if I start evaluating it with my debias method, it will likely be better because it's biased towards my new metric because that has this bias incorporated.
And if I had the position bias estimated wrongly, I will now, you know, I will probably still get a better result.
So both are biased towards whatever method they're using so that I made some mistake maybe during learning.
I would usually figure out during evaluation, but if my evaluation is also biased, I will have no way of validating it.
So then the only way to do it is to go through an A-B test, of course.
But this makes this whole process, you can make mistakes in learning because you will figure them out in evaluation, but if you make mistakes in evaluation, you will not figure them out.
Okay, okay.
I mean, lots of complicated considerations, which also, I guess, are still part of the active research in that field.
So having talked about business rules and how you incorporate them, how you properly log, what are other practical considerations that come to your mind that are important for bandit feedback?
Well, one thing that I found very challenging when I started out with bandit feedback, especially within the recommendation space, is rewards.
So if you read a paper, the reward is always given.
Yeah, there's always a reward.
And also that reward is likely available at any given time.
Now, that assumption is something that a lot of people are aware of.
It's not true because the words are delayed.
Even if it's a click, a click takes time between the time that you serve a recommendation, it might take some time for some clicks, but there's more types of rewards.
In our case, a conversion, but it could also be like a real conversion because someone can book a hotel, but they can also still cancel it.
So by the time they've gotten a stay, that's another delay.
And for example, for streaming services, a unsubscribe is a very big delay and very difficult to attribute.
So there's all these different types of signals that have different types of delays.
They have different types of attribution.
And that makes it very difficult to work with many of the principles, especially on policy bandits are very complicated here because yes, there are many papers that talk about delayed feedback and those kinds of things.
We also have done a recent project on auxiliary signals, which is different from the late feedback.
So the late feedback is when you have a feedback that is not available when you need it.
But auxiliary signals are signals that, for example, if we have a purchase or a conversion, that is probably our real reward, but that is not necessarily always, especially in booking.
It's not always tied to whatever recommender system was running under the hood.
So let's say we recommend someone to go to Paris and then someone booked something in Paris.
That doesn't mean that that banner helped anything.
We don't even know if they saw the banner.
We don't even know if they might already had the intention to go to Paris.
But if they clicked on the banner, that is probably a better signal that that banner did something.
But we don't know if that's a good thing on itself.
So if we just optimize for clicks on the banner, we can get clickbait.
So we need to now somehow balance these auxiliary signals with the real signal without optimizing too hard on these auxiliary signals.
A beautiful example of making this a little bit more tangible is let's say we have social media and we have something we want to optimize something we display on social media.
Now, the question is, what type of content would you interact with most on social media?
Probably one is good content.
You like it.
You buy it.
You share it.
But also very bad content because very bad content is very funny.
So you might tag someone or share it to say, look at this weird stuff.
Yeah, but that might not result in a purchase.
But if you're looking at engagement, very bad content will probably get better engagement and mediocre content.
So all of these types of signals that are very important, they're very important.
Engagement is very important to measure.
Should not be your primary objective.
How do you incorporate those are things that obviously there are papers largely from industry, but it's not the first thing you run into.
You have to really search for it if you want to understand how these types of things work because most people that start with bandit feedback, they will likely do CTR.
Click through.
It's not necessarily the best thing to go for.
When you talk about delayed reward, is it also a matter of the different feedback types you can get and how they translate into reward?
So to say your suggestion, for example, was actually good when a user, let's say, bookmarked it.
But it was even better if the user purchased it.
And you might also want to take into account clicks, even though we might be running into a click bait problem there.
But you do have these different, I guess, some paper coined this as feedback channels.
I sometimes say different types of feedback.
But is this also actually a matter that is relevant for the modeling part here?
How you convert different feedback types into reward as somehow the global currency that we want to go for?
Yes.
But the question is much more complex than you think.
Because if you look at, like, how are you going to give multiple types of rewards to a model?
Because a reward is, as we call it, a form of privileged information.
It's information that you only have after you make a prediction.
But you do have it when you want to learn.
And most machine learning models can only have one outcome.
So if we want to have multiple of these signals embedded into our model, we need to be clever about it.
Well, I had my intern, Daniel Provodin, who was here the last three months.
He just left three, four weeks ago.
He did actually a couple of months on this.
So I cannot share all the details yet because we're still working on finalizing it.
But I can share the premise is that, let's say, I wanted to use clicks and bookings as feedback.
How?
Like, do I weight them differently?
Do I just have one reward where they're weighted?
Or do I train two models and somehow combine?
But then what's the weight between the two models?
Is it equally is multiplication?
Do I average like my previous notion of this?
There's many different ways you can do this, but most of them will likely not result in anything better or they can result in something better.
But likely not.
But there's many ways, for example, if we let's say we factorize.
So instead of doing probability of booking given context, we do probability of booking given that we do.
With context and given that we click times, likely to this person will click that kind of de-factorization.
Then that potentially could be better, but it's more likely that all of those models will have an error and that error will multiply and we only get worse results.
So getting these things in is way more complicated than you think.
Most will not really get signal.
I've seen recent papers where neural networks have multiple outputs where these can be combined.
But then it's hard to really understand if the change comes from the change in architecture or the improvement comes from the introduction of the other signals.
So it's a much more complicated problem than it sounds.
Also hard to define.
But yeah, and with delayed feedback, it's a little bit easier to reason about because what tends to happen is that people try to predict.
It's quite common that proxies are used or somehow this information, we know how likely it is that we get the delayed feedback at that moment in time or something like that.
But if everything is available, it still can be a problem to have multiple types of feedback.
Wow.
Okay, that's definitely quite a lot.
And maybe let me wrap this up with a reference to at least two papers because kind of in every episode, we need to mention some papers.
And I've seen that you were also quite active on that front and published a paper last year together with, I guess, your colleagues from booking.com on extending Open Bandit Pipeline to simulate industry challenges.
I assume it relates a lot to the things that we have already touched on, but can you just give us a brief summary of what the content was and what you proposed to be added and that was finally added to Open Bandit Pipeline and how that reflected those industry challenges?
Yes.
So when we worked on this paper, I was largely still working on policy bandits.
So they largely represent on policy challenges.
And also at that time, I was more looking into like how, what's the reason to transition from on policy, off policy or why would you use one or the other, which one is more powerful?
And I was looking for good simulations to better understand this.
And I ran into Yuta, largely Yuta Saito did a lot of work on this.
I know he's not the only one, but on an Open Bandit Pipeline, Open Bandit Pipeline just has a ton of really nice, simple to use methods to do off policy bandits.
Although, and this is funny because when I started trying Open Bandit Pipeline, I assumed it was on policy, for the same reason that bandits is usually referring, if you search for bandit algorithms, you find on policy algorithms.
It's largely a collection of off policy or counterfactual evaluation techniques and they have some really good data set, open bandit data set that has some bandit feedback collected to play with.
But one, it was lacking a good on policy set of tools.
So it was not as easy to run comparative on policy algorithms to compare it with your off policy algorithms.
And then on top of that, it was largely lacking and lacking is not the right word because it was just not their focus.
But for me, what I wanted was something that could also deal with drift.
So concept drift, which is something that is a common misperception about on policy bandit algorithms is that these are adaptive algorithms.
They're online learning algorithms.
They're not adaptive.
If you will, there's a really nice visualization that the VESH made in the practical bandit material on how if you introduce, if all of your rewards go up within your optimization process, that your bandit will likely screw up.
Anyway, to get back on my original line of thought, I implemented some tools to do on policy learning in the same setting, in the same API as the off policy learning.
And I added a couple of things among which was drift simulation so that I could at least prove to myself when things do work and do not work with these largely on policy bandit algorithms.
And it was a very good exercise to better understand how to simulate these things because I feel at least in my surroundings, I noticed people tend to not resort enough to simulations before they start doing the real thing.
Because it's just so easy to think you know how something needs to be done and then you do it and doesn't work.
But it's good to first in a simulation, very, very simple, very basic.
People overestimate how difficult it is.
Just make a very simple example of what you think you're doing and validate that under the perfect circumstances it works.
Because if you cannot make it work under the perfect circumstances, likely will not work under suboptimal circumstances.
And this was something that we tried to visualize with open bandit pipeline.
At the same time, we really didn't have answers to most of the questions that we're asking.
We were still trying to explore.
I didn't have a full view of, I still don't have it, but on the full range of bandit literature and methods.
So this was a way to get into that world better and also at the conference have an in to talk to people that have more experience than I have.
Nice. And this is then how you went even further and then ended up with a tutorial, which will definitely not be the end.
But as you said, just the beginning of further tutorials coming up there, applying it to practical problems.
What else do you think?
So in that whole realm, I mean, you are working in industry, you are solving practical problems on a daily basis.
But on the other side, you also see what's going on in that exciting realm of counterfactual learning or counterfactual evaluation.
What would you say maybe as a summary, but also as an outlook?
What are the challenges that are the ones that are hard to solve or that are still unsolved or which you would say are the ones that are causing you a headache?
I think that to give you one example or one good overview is very difficult.
But in general, every time I read a paper, even doesn't matter if it's from industry or academia and I get super excited, I really quickly learn that everything is different.
Because as I said at the beginning, at booking, we have very different reward signals than Spotify or Netflix has.
We have very different intensities of how much we know of our users.
We have very everything is always different.
And also we might have different business rules or even the way we represent training samples can be different because certain assumptions are different.
And then translating a very good idea from paper into something that works always requires a lot of customization.
And there's no one clear learning because if I would give a clear learning on some of these things, you would likely go to your area of expertise or your product or your company.
And then you realize that we have different assumptions.
We have different setups.
But what I did learn largely is that it's very important to get an understanding of Bennett feedback.
This is something that I've had to explain quite a bit working with Bennett algorithms.
Understand that you're working with Bennett feedback and that from there you need to do certain things in order to understand counterfactually what's happening.
Like if you are working on something that is bandit feedback and you're not doing anything with expiration, not anything with making sure that you're not creating a rich get richer problem, then likely you are.
Yeah, some good advice definitely.
So become aware that you are dealing with bandit feedback.
In cases when you are.
Okay.
And to wrap this up, looking at the RecSys field from a more broader perspective, what do you think are the greatest challenges?
Well, obviously there's many things that you could talk about here.
The thing that is most front and center for me right now is this whole area of business rules being applied on top of your decision system.
So a recommender or not.
It's something that I'm actively working with right now and I can not figure out what I would really love is like good diagnostic that would give me an understanding of whatever I don't know because I don't like there's a reason I don't know these business rules and that I don't control them.
Given that I don't know what this is, how much this is hurting what I'm doing.
I think that's something that almost I don't know if I can say that strongly because there are definitely teams where that's not the case, but almost every ML product that I know of.
It's very common.
There's at least some rule on top of it as at least when it's a bigger system, simple small recommenders will not have that.
As soon as it gets a bigger system, as soon as there's more business stakeholders, as soon as it's crossing multiple countries, there will be some business rules involved that either change so much that you cannot really account for them or they're so complex that you cannot account for them or you just don't know that they exist and dealing with this is much harder than I, than I'd hope.
There's also, for example, there's so many things that can happen that are out of your control.
One simple business rule that you might not realize is a business rule, but drift is also a form of business rule.
Things just change.
How do you deal with this?
How do you measure this?
This might be something that is more commonly addressed, but there are so many different levels where something is being influenced that you do not control with X to Y.
Now, this is a pretty interesting one.
So business rules in that space and the effect they have.
And then I mean you could be happy or lucky if you are aware of them because this is basically the first step towards incorporating them into your reasoning and maybe finally also into your modeling, whatever.
But if you don't even know and can't explain what happens afterwards, then the disconnect between what really happens and what you think happens can mess up a lot of things.
So brings me actually to the point of thinking about dedicating a whole episode to the topic of business rules and recommender systems, how they fit together and where and how and when.
So if there is someone in the RecSys space who is hearing this, please reach out to me.
Also reach out to me, please, because I would love to learn.
Yeah.
So yeah, reach out to the both of us.
If you do have some good design process.
I mean, it's actually not a very technical thing, but it has also, it's I would say technical sites because it might also be a whole organizational question of how to enable people on the technical side or on the modeling side to be aware of what and for what reason things are being done and to then also incorporate this into your decisions that you make.
It's funny because it's both technical and non-technical because on the one hand, it's not technical because you want non-technical users to make decisions either on top or next to a technical system on an ML system.
But at the same time, as someone who is doing the ML, especially like in, I think in booking, we have very ML centric in many places, most places.
So usually the conversation can get started and things can be improved when these types of rules are built on top.
But there's many places where that conversation cannot even be started or even if the conversation can be started, there's just no good alternative.
So now you need to deal with it.
How to deal with it is not clear cut.
And I think that starts and I hope if someone knows a good resource in this, even identifying from everything you do know, like maybe the treatment and the intention of treatment, just those two.
Can you understand how much it hurts?
Like just from what you know, can you understand if it's even a problem that I don't know?
Sometimes it is, sometimes it isn't.
And there's some things, you know, that I do know then when it's okay, but when it's not okay, I don't know.
Yeah, makes sense.
Wow.
Okay.
Bram, thanks for sharing all of these thoughts and also practical experiences that you have gone through.
As for every episode, I also want to ask you within the space of personalized products and services, what is it that you like the most if it wasn't booking.com?
Wasn't booking.com.
Well, that requires me a little bit more time.
That's the necessary pause.
I think I mentioned this a little bit early.
The most beautiful thing about the recommender space for me is that it touches both the technical side of things.
So deep optimization problems and user experience design.
Like I'm, I would never say I'm a good user experience designer, but I'm very intrigued by that concept as a whole.
Yeah.
So you cannot build a good recommender system without fully respecting the fact that something is going to be displayed in a certain way and it affects the experience.
We don't want to talk about NS1 examples or NS5, but it's still good to have a good anecdotal understanding of end user, understand that if you're doing something that is, doesn't make any sense to the end user that you can identify that even though you're working with millions of rows of data, if you pick one out and you're like, this doesn't make any sense at all.
You can understand it.
That's something that is not as clear cut in all areas of ML, but in recommender systems, I feel that that's something that is very fun for me to do.
Think about the experience of the user when they get exposed to what you're doing.
I definitely agree with what you said.
However, this is actually not mentioning a certain product or service that I was seeking for, but it's actually also not a problem.
That's fine.
I'm going to take this that way.
Which product or service did you mean?
I'm always seeking for what product or what service that you use, do you think is doing a great job in terms of personalization?
Did I miss that in the original question or was that you hoped that I would say that?
I hope you would say that.
I'm not sure whether you missed it.
Oh man, I can tell you what I don't like.
Oh yeah, that's also a new one.
Come on.
Please.
What is it that you don't like?
I use Google Chrome on my phone and every time I open Google Chrome, there is a list of articles that I recommend and it's the most clickbait articles in the world.
And I keep falling for that trap because it's all super relevant for me.
Yeah, it's all terrible articles all the time.
You said on the phone, right?
Yeah, on my phone.
Every time I open Google Chrome on my phone and it shows me articles, I'm super intrigued by the articles and they're always very, very worse of the worse.
Yeah, and I don't like absolutely hate that.
This is so I do have that kind of news feed.
Yes, it shows up.
So actually, yeah, I'm totally with you on that side.
So the person who on the Pixel in my case, it's very, very clickbaity.
That's right.
Yeah.
So if you work on that product, don't feel offended.
I understand why it happened.
Please take a look because that doesn't work at all.
I must say that it's very hard to name one where I'm really happy because if it's good, you don't notice.
I think YouTube is pretty reasonable.
Although also YouTube and Spotify, I feel like I'm so much trapped in a bubble and it's so hard to get out.
I wouldn't say there's one where I'm like, oh my god, that's always good because if it's good, I'm usually also concerned that I'm in too much in a bubble.
And if it's bad, I feel like I'm not engaged.
Okay, as a Dutch, I can never be super positive.
So just Okay, agreed.
But YouTube, Spotify, you're doing a good job.
I'm happy.
Maybe last question.
Who else would you like me to invite to the show or to listen to this part of RECSPERTS?
Well, I think almost everyone in our tutorial group would be very interesting to talk to.
Okay, great.
So two done for our story remaining if I can't.
Yes.
I think everyone in that group would have so many interesting things to contribute.
I hope we didn't just talk about the tutorial.
We talked about my experience and the tutorial, the way we put it together, it's everybody has their own importance of this.
Everyone has their own area that they bring to the table.
So I could recommend everyone from the tutorial.
So Oliver Yernen, Ben London, Sarnazari, Yingli and Devesh Barak.
I think they're all added so much to my knowledge that I can only recommend them to be part of this too.
I also have a ton of people at booking that I could recommend, but I don't want to do too much self-promotion here.
So I would tell them quietly because otherwise it would be too big of a list.
There's so many good people around me that I'd be happy to have here.
No, I hope it won't be the last episode with booking.com.
So I'm also reaching out to further people because I guess there are still many things to discuss and to talk about in that sense.
But that's it for today.
It was a great coverage, nice experience sharing.
I guess people could learn a lot and they can even learn more by looking and checking out the Practical Bandits tutorial, which we will also put into the show notes as always.
And then we will all be very excited for the follow-up of your nice Practical Bandits tutorial that we are going to expect soon.
And yeah, thanks for everything, Ram.
Thank you so much for having me.
It was tons of fun.
Well, I forgot that we were recording.
Oh, nice.
Great, Ben. Have a nice day.
See you. Bye.
See you all.
Thank you so much for listening to this episode of RECSPERTS, Recommender Systems Experts, the podcast that brings you the experts in recommender systems.
If you enjoy this podcast, please subscribe to it on your favorite podcast player and please share it with anybody you think might benefit from it.
If you have questions, a recommendation for an interesting expert you want to have in my show or any other suggestions, drop me a message on Twitter or send me an email.
Thank you again for listening and sharing and make sure not to miss the next episode because people who listen to this also listen to the next episode.
Goodbye.
Thanks for watching.

#20: Practical Bandits and Travel Recommendations with Bram van den Akker
Broadcast by