This summer I spent my summer hacking away on some cool projects at GoIbibo. So here is a detailed description of my first project in GoIbibo!
Getting into the technical details of my project, the primary aim was to reduce the no of invalidations occurring when Customers find that the prices of the flights have changed on navigating from the search result page to booking page. These invalidations increase drastically during sales and it was my responsibility to come up with an intelligent algorithm that would handle sales and reduce the no of invalidations occurring everyday!
My mentor, Phani gave me the invalidation logs consisting of millions of invalidations and I began my research. I plotted graphs for each route to find a relation between various factors that might result in sudden increase of invalidations and tried to find a way to use this data to resolve the problem. I had a discussion with Mr Neeraj Kaul (Director of the Control team), on what I had derived from the graphs. He helped me figure out more factors which can also result in increased invalidations.
From the graphs, I was able to derive the following conclusions. 1. There is a nearly linear relation between the time left for the journey date and the time it takes for the data obtained from a new invalidation to get invalidated.
2. Time taken for this data to expire is also inversely proportional to demand.
3. Demand generally increase on weekends but one could observe demand sometimes increasing on week days as well.
4. There is no relation between the price change and time taken for the new data to get invalidated.
I realised that this is purely a machine learning problem and hence started learning machine learning concepts from Andrew Ng’s Machine learning course and simultaneously implemented the algos to predict the time , a new data would take to get invalidated. I began with linear regression curves. I used scipy (a Python Machine learning library) to determine the coefficients of various linear regression curves and checked the accuracy of the predictions. I implemented gradient descent algorithm which would correct the coefficients of the curves depending on the mean square error of the predicted values. I was able to achieve 80-90% accuracy using these algorithms.
Still not satisfied with the performance, i decided to dive into neural network concepts and see if I can further improve the performance. The first few days of the week went in understanding multi-layer neural networks and implementing them. Basically neural networks is used to find a pattern between the attributes and the results when we are not sure about the pattern or relation between the attributes and the results, ie- It is used for unsupervised machine learning. In this case, we have several factors like demand which is interdependent on various factors like time before journey date, whether it is a weekday or weekend etc so Neural networks seemed to be the best approach to solve this problem. I implemented single layer and multi-layer neural network with gradient descent for error correction. I tried varying the no of neurons and other constant values like learning rate etc but in vain, the best I could achieve was 80% accuracy. The performance kept changing each time since the weights between the neurons were initialized randomly. 😦
A few reasons why I chose not to go forward with the neural network are as follows:
1. Since the neural network is entirely responsible for generating a pattern from the past data, and the past data included sales data as well which couldn’t be filtered out, the neural network was generating a wrong pattern most of the times.
2. It is difficult to correct the pattern generated based on the errors because error correction required a large amount of training data and takes a lot of time (around 30 minutes).
3. The neural network took to much time to learn and it was required to store 360 coefficients for each route corresponding to each airline (We currently have 40,000 such combinations) which seemed to be an unnecessary waste of memory.
4. If the no of nodes in each layer is not optimal, there can errors arising due to over-fitting of the curve or under-fitting of the curve. It is not possible to figure out an optimal no of nodes in each layer for 40,000 curves which may differ from one another due to various reasons.
5. The pattern generated from past data may be different from the current pattern due to reasons like sudden change in demand, sales etc. The neural network will generate a curve without considering the fact that time left before journey date is directly proportional to time it takes for new data to get expired and it may happen that the pattern generated does not follow this logic due to the errors brought in by the sales.
With the help of a fellow employee Akansha Verma, I started testing these results with some machine learning tools which has all the machine learning algorithms inbuilt ( just incase there is some fault with my implementation of the machine learning algos) and found that there was not much difference between the results.
I discussed this with Neeraj Sir and we finally came to a conclusion that we would use polynomial regression to learn from past data and divide the data into 3 or 4 categories to improve accuracy. Also we devised an algorithm which would handle sales based on current scenario analysis, independent of the machine learning algos. So basically we would determine true_expiry (time for which the data is to be stored in cache) from the polynomial regression curves, generated for the past data and logical expiry would be determined from the time it took for the most recent data to get expired. Expiry in redis cache would be set with respect to true_expiry. Whenever an invalidation occurs, we would check if the time taken for the most recent data to get invalidated is way less than the calculated value. If yes, logical expiry would be half the time it took for recent data to get invalidated else logical expiry would be same as true_expiry. Logical expiry not equal to true_expiry would generally mark the beginning of any sale or sudden increase in demand. This algo would ensure that the sales are handled properly. When a cached data is logically expired, it would hit the api to get the new price. If price is same as the data in the cache which has been logically expired, it will reset the logical expiry to the new true_expiry calculated from the current time left for the journey date to arrive. So when a sale ends, it would go back to using what it had learnt from past data. With this we were able to achieve an accuracy of 95% and above with no negative values of expiry time. In case by any chance if the above algorithms generate negative value of expiry time, it would recalculate using default values of coefficients of polynomial regression curves. Thus an intelligent cachebot was built which considers past data analysis as well as current scenario analysis and intelligently determines the time for which the new data is to be cached. 🙂
The whole week was spent testing the performance of the above devised algorithm and fixing bugs in code. 🙂