This is a report on the Supermarket Basket Analysis by Max Philipp, Ceyda Ugur and Vera Weidmann. I found this interesting project on GitHub and wrote a report below about their analysis.
The project which the repository is about is a competition posted on Kaggle. The kick off of the project was in May, 2017 and the time given is 3 months, meaning the deadline is the end of July, 2017. The datatables are provided by Instacart.
Instacart, a grocery ordering and delivery app, aims to make it easy to fill your refrigerator and pantry with customers’ personal favorites and staples when they need them. After selecting products through the Instacart app, personal shoppers review their order and do the in-store shopping and delivery for customers.
The purpose of this project is to predict/estimate the users’ next orders based on customer orders over time.
The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 users. For each user, Kaggle provides between 4 and 100 of their orders, with the sequence of products purchased in each order. Moreover, the week and hour of day the order was placed are also provided, and a relative measure of time between orders.
In order to to facilitate, simplify and better understand the content and relationships between all the csv files provided by Kaggle, a schema including connections to multiple data tables have been visualized via the SQL architecture. Besides, all data types are specified. An opportunity to see the origin of each column on this schema is obtained.
As it can be seen from the schema, there are 6 csv files which describes a relational set of files for customers’ orders over time. Each entity (customer, product, order, aisle, etc.) has an associated unique id. It can confidently be said that the most important tables for the project task are order_products_ prior and order_products_train. These two tables are also linked to the order and the products. In other words, products and orders table feed the order_products_ prior and order_products_train tables. The reason of why these two tables are highly important for the task is because they contain the “reordered” columns, which are a basis for predicting the next orders of customers.
We have 134 aisles at total. These aisles contain different products and are grouped by the type of these products. In our visualization which is generated by Tableau, we can obviously see that the huge amount of products are included in the aisle fresh fruits. As an assumption, we can see that people have a lot of option to choose and buy from this aisle. Accordingly, this might increase the reorder rates as well.
In the bar chart below, we have observed products which are contained in more than 64,000 baskets. These products are highly ordered by customers and therefore they are considered as “frequent”.The frequency bar chart demonstrates lots of fruits and vegetables. So we were right about our assumption by saying that “fresh fruit” aisle will probably be the aisle which customers buy the products the most. This is also about the huge amount of products that fresh fruit aisle contain. Specifically, Banana can be observed as the product which is really highly demanded.
By taking our exploratory analysis into consideration , we saw that 262464 users have reordered products which contain the word of “Organic”. We have also seen that there are 5035 organic products at total. This result does not surprise us as people’s interest in bio nutrition has increased in the last few years.
From the public data on the GitHub, I found that most of the customers made orders at 11 am and the time around 11. This could mean a strategy to sell the advertisement place on their website at the highest price at this time because many people would make orders at this moment.
Steffen Rendle, Christoph Freudenthaler, Lars Schmidt-Thieme: Factorizing Personalized Markov Chainsfor Next-Basket Recommendation
Shengxian Wan, Yanyan Lan, Pengfei Wang, Jiafeng Guo, Jun Xu, Xueqi Cheng (2015): Next Basket Recommendation with Neural Networks
Jakob Aungiers (2016): LSTM Neural Network for Time Series Prediction. URL: http://www.jakob-aungiers.com/articles/a/LSTM-Neural-Network-for-Time-Series-Prediction