This summer I spent my summer hacking away on some cool projects at GoIbibo. So here is a detailed description of my first project in GoIbibo!
Getting into the technical details of my project, the primary aim was to reduce the no of invalidations occurring when Customers find that the prices of the flights have changed on navigating from the search result page to booking page. These invalidations increase drastically during sales and it was my responsibility to come up with an intelligent algorithm that would handle sales and reduce the no of invalidations occurring everyday!
My mentor, Phani gave me the invalidation logs consisting of millions of invalidations and I began my research. I plotted graphs for each route to find a relation between various factors that might result in sudden increase of invalidations and tried to find a way to use this data to resolve the problem. I had a discussion with Mr Neeraj Kaul (Director of the Control team), on what I had derived from the graphs. He helped me figure out more factors which can also result in increased invalidations.
From the graphs, I was able to derive the following conclusions.
1. There is a nearly linear relation between the time left for the journey date and the time it takes for the data obtained from a new invalidation to get invalidated.
2. Time taken for this data to expire is also inversely proportional to demand.
3. Demand generally increase on weekends but one could observe demand sometimes increasing on week days as well.
4. There is no relation between the price change and time taken for the new data to get invalidated.
I realised that this is purely a machine learning problem and hence started learning machine learning concepts from Andrew Ng’s Machine learning course and simultaneously implemented the algorithms to predict the time , a new search result would take to get invalidated. I began with polynomial regression curves. I used scipy (a Python Machine learning library) to determine the coefficients of various polynomial regression curves for each route-vendor combination and checked the accuracy of the predictions. I also implemented gradient descent algorithm which would correct the coefficients of the curves depending on the mean square error of the predicted values. I was able to achieve only 80-85% accuracy using polynomial regression with last search result’s invalidation time, day and month as input features. Also it was unable to detect and handle any short durations sales or sales in festival seasons.
Still not satisfied with the performance, i decided to dive into neural network concepts and see if I can further improve the performance. The first few days of the week went in understanding multi-layer neural networks and implementing them. Basically neural networks is used to find a pattern between the attributes and the results when we are not sure about the pattern or relation between the attributes and the results, ie- It is used for unsupervised machine learning. In this case, we have several factors like demand which is interdependent on various factors like time before journey date, day, is_weekend, is_festival_season, number of flight recently booked etc so Neural networks seemed to be the best approach to solve this problem. Based on demand for flights for a route, we grouped routes with similar demand into 4 clusters using K-Mean Clustering. We found the Pearson coefficient for correlation between lagged observations and current output and also between various input features. Based on this, we built a windowed MultiLayer Perceptron (MLP) model with input features like days before journey date, number of flights booked for that route since the last price update, day, is_weekend, month, is_festival_season, average of last 3 search results’ invalidation time and observations for the last P( value of P varies across models built for different clusters) price updates. The MLP model gave satisfactory prediction results and hence we incorporated the same in our bot.
Although the windowed MLP could handle major sales that happened during festival season every year, it could not handle short duration sales that happen quite frequently.
I discussed this with Neeraj Sir and we finally came to a conclusion that we would use the predictions of the windowed MultiLayer Perceptron model only when there was no short duration sales happening. Also we devised an algorithm which would effectively detect such sales based on current scenario analysis. So basically we would determine true_expiry (time for which the data is to be stored in cache) from the model generated for the past data and logical expiry would be determined from the time it took for the most recent data to get expired. Expiry in redis cache would be set with respect to true_expiry. Whenever an invalidation occurs, we would check if the time taken for the most recent data to get invalidated is way less than the calculated value. If yes, logical expiry would be half the time it took for recent data to get invalidated else logical expiry would be same as true_expiry. Logical expiry not equal to true_expiry would generally mark the beginning of any sale or sudden increase in demand. This algorithm would ensure that the sales are handled properly. When a cached data is logically expired, it would hit the api to get the new price. If price is same as the data in the cache which has been logically expired, it will reset the logical expiry to the new true_expiry calculated from the current time left for the journey date to arrive. So when a sale ends, it would go back to using what it had learnt from past data. With this we were able to achieve satisfactory accuracy during the testing period and above with no negative values of expiry time. Thus an intelligent cachebot was built which considers past data analysis as well as current scenario analysis and intelligently determines the time for which the new search result is to be cached. 🙂
The whole week was spent testing the performance of the above devised algorithm and tuning the model. 🙂