Description

In this competition, Instacart is challenging the Kaggle community to use this anonymized data on customer orders over time to predict which previously purchased products will be in a user’s next order.

Getting the Data

## 
Read 0.0% of 32434489 rows
Read 16.7% of 32434489 rows
Read 38.0% of 32434489 rows
Read 55.2% of 32434489 rows
Read 72.3% of 32434489 rows
Read 89.9% of 32434489 rows
Read 32434489 rows and 4 (of 4) columns from 0.538 GB file in 00:00:09
## 
Read 59.0% of 3421083 rows
Read 3421083 rows and 7 (of 7) columns from 0.101 GB file in 00:00:03

Assignments to be used for Data-frame names:

  • “aisles.dt”
  • “departments.dt”
  • “order_products__prior.dt"
  • “order_products__train.dt"
  • “orders.dt”
##    order_id product_id add_to_cart_order reordered
## 1:        2      33120                 1         1
## 2:        2      28985                 2         1
## 3:        2       9327                 3         0
## 4:        2      45918                 4         1
## 5:        2      30035                 5         0
## 6:        2      17794                 6         1
##    order_id user_id eval_set order_number order_dow order_hour_of_day
## 1:  2539329       1    prior            1         2                 8
## 2:  2398795       1    prior            2         3                 7
## 3:   473747       1    prior            3         3                12
## 4:  2254736       1    prior            4         4                 7
## 5:   431534       1    prior            5         4                15
## 6:  3367565       1    prior            6         2                 7
##    days_since_prior_order
## 1:                     NA
## 2:                     15
## 3:                     21
## 4:                     29
## 5:                     28
## 6:                     19

Getting the mapping done For Prior set

  • Get the mapping between order_id and the associated products for it for a particular user
  • New DataTables for getting
    • user-id, Order-id and count of products for each of the order-ids
    • splitting the data-table that corresponds to orders.csv (filtered based on prior) into multiple sub data tables, based on user_id
    • Get the products for each Order for each of the above data tables and put them into a list that will be an extra row
    • Helper functions are present to break the number of userIds(~206k) into 207 chunks of 1000 each to process them faster
    • chks is a List of length 1 which has another list of length 207
    • Never create multiple variables: not be a wise decision to create multiple vars , why not process and get the output in this function itself
## Loading required package: plyr

Getting some sense with Plots

  • Get the Order Count based on users
  • Check & Plot the 10% user_ids who have the maximum number of Orders
  • Get a quantile distribution of the orders , for eg, home many users fall under 95% of the total orders
  • 95% of the Users have less than 50 orders , so presumably if a user has already ordered 50 times , very less likely to re-order - Plotted the QQ distribution plot

##    user_id order_count
## 1:     210          99
## 2:     310          99
## 3:     313          99
## 4:     690          99
## 5:     786          99
## 6:     964          99
##   0%   5%  10%  15%  20%  25%  30%  35%  40%  45%  50%  55%  60%  65%  70% 
##    3    3    3    4    4    5    6    6    7    8    9   11   12   14   16 
##  75%  80%  85%  90%  95% 100% 
##   19   23   28   37   51   99

  • Get the products info for each order - Use Merge concept here to get this done amazingly quick :
    • Used Inner join concept: https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html
    • Get the Products data table also merged with the Resultant table to get a better understanding of the products
    • DT[i, j, by] - Take DT, subset rows using i, then calculate j, grouped by by.
    • The order_prior.dt is handled here for evaluations before the predictions
    • Plot the top 20 products only as most 1% is also a large value to plot
##     order_id user_id eval_set order_number order_dow order_hour_of_day
##  1:        2  202279    prior            3         5                 9
##  2:        2  202279    prior            3         5                 9
##  3:        2  202279    prior            3         5                 9
##  4:        2  202279    prior            3         5                 9
##  5:        2  202279    prior            3         5                 9
##  6:        2  202279    prior            3         5                 9
##  7:        2  202279    prior            3         5                 9
##  8:        2  202279    prior            3         5                 9
##  9:        2  202279    prior            3         5                 9
## 10:        3  205970    prior           16         5                17
## 11:        3  205970    prior           16         5                17
## 12:        3  205970    prior           16         5                17
## 13:        3  205970    prior           16         5                17
## 14:        3  205970    prior           16         5                17
## 15:        3  205970    prior           16         5                17
## 16:        3  205970    prior           16         5                17
## 17:        3  205970    prior           16         5                17
## 18:        4  178520    prior           36         1                 9
## 19:        4  178520    prior           36         1                 9
## 20:        4  178520    prior           36         1                 9
##     days_since_prior_order product_id add_to_cart_order reordered
##  1:                      8      33120                 1         1
##  2:                      8      28985                 2         1
##  3:                      8       9327                 3         0
##  4:                      8      45918                 4         1
##  5:                      8      30035                 5         0
##  6:                      8      17794                 6         1
##  7:                      8      40141                 7         1
##  8:                      8       1819                 8         1
##  9:                      8      43668                 9         0
## 10:                     12      33754                 1         1
## 11:                     12      24838                 2         1
## 12:                     12      17704                 3         1
## 13:                     12      21903                 4         1
## 14:                     12      17668                 5         1
## 15:                     12      46667                 6         1
## 16:                     12      17461                 7         1
## 17:                     12      32665                 8         1
## 18:                      7      46842                 1         0
## 19:                      7      26434                 2         1
## 20:                      7      39758                 3         1
##     order_id user_id eval_set order_number order_dow order_hour_of_day
##  1:     1107   38259    prior            2         1                11
##  2:     5319  196224    prior           65         1                14
##  3:     7540  138499    prior            8         0                14
##  4:     9228   79603    prior            2         2                10
##  5:     9273   50005    prior            1         1                15
##  6:     9696  108919    prior           46         5                16
##  7:    11140   63782    prior            4         1                14
##  8:    11485  170451    prior            5         5                18
##  9:    12672  106854    prior           28         5                10
## 10:    13668  181127    prior           10         4                17
## 11:    14668  106519    prior           18         0                22
## 12:    16132  145505    prior           64         2                10
## 13:    18303  139923    prior           16         0                23
## 14:    19479  110984    prior            2         2                12
## 15:    19569  180719    prior           84         4                19
## 16:    19879  142388    prior            7         3                 5
## 17:    19939  200356    prior           11         2                12
## 18:    19989    4122    prior           11         2                 7
## 19:    23202  162394    prior            2         3                12
## 20:    24367  148255    prior           18         6                17
##     days_since_prior_order product_id add_to_cart_order reordered
##  1:                      7          1                 7         0
##  2:                      1          1                 3         1
##  3:                      7          1                 4         1
##  4:                     30          1                 2         0
##  5:                     NA          1                30         0
##  6:                      8          1                 5         1
##  7:                     14          1                 1         1
##  8:                     24          1                 4         0
##  9:                      6          1                 3         1
## 10:                      9          1                 4         1
## 11:                      7          1                13         1
## 12:                      5          1                 1         1
## 13:                     14          1                 1         1
## 14:                     20          1                 5         0
## 15:                      0          1                 9         1
## 16:                     15          1                12         0
## 17:                      0          1                 2         0
## 18:                      8          1                 1         0
## 19:                      7          1                 3         0
## 20:                      6          1                 2         1
##                   product_name aisle_id department_id
##  1: Chocolate Sandwich Cookies       61            19
##  2: Chocolate Sandwich Cookies       61            19
##  3: Chocolate Sandwich Cookies       61            19
##  4: Chocolate Sandwich Cookies       61            19
##  5: Chocolate Sandwich Cookies       61            19
##  6: Chocolate Sandwich Cookies       61            19
##  7: Chocolate Sandwich Cookies       61            19
##  8: Chocolate Sandwich Cookies       61            19
##  9: Chocolate Sandwich Cookies       61            19
## 10: Chocolate Sandwich Cookies       61            19
## 11: Chocolate Sandwich Cookies       61            19
## 12: Chocolate Sandwich Cookies       61            19
## 13: Chocolate Sandwich Cookies       61            19
## 14: Chocolate Sandwich Cookies       61            19
## 15: Chocolate Sandwich Cookies       61            19
## 16: Chocolate Sandwich Cookies       61            19
## 17: Chocolate Sandwich Cookies       61            19
## 18: Chocolate Sandwich Cookies       61            19
## 19: Chocolate Sandwich Cookies       61            19
## 20: Chocolate Sandwich Cookies       61            19
##        0%        1%        2%        3%        4%        5%        6% 
##      1.00      3.00      3.00      4.00      4.00      5.00      5.00 
##        7%        8%        9%       10%       11%       12%       13% 
##      6.00      6.00      7.00      7.00      8.00      8.00      9.00 
##       14%       15%       16%       17%       18%       19%       20% 
##      9.00     10.00     10.00     11.00     12.00     12.00     13.00 
##       21%       22%       23%       24%       25%       26%       27% 
##     14.00     15.00     15.00     16.00     17.00     18.00     19.00 
##       28%       29%       30%       31%       32%       33%       34% 
##     20.00     21.00     22.00     23.00     24.00     26.00     27.00 
##       35%       36%       37%       38%       39%       40%       41% 
##     28.00     30.00     31.00     33.00     34.00     36.00     38.00 
##       42%       43%       44%       45%       46%       47%       48% 
##     40.00     42.00     44.00     47.00     49.00     51.00     54.00 
##       49%       50%       51%       52%       53%       54%       55% 
##     56.00     60.00     63.00     66.00     69.00     73.00     77.00 
##       56%       57%       58%       59%       60%       61%       62% 
##     81.00     86.00     91.00     96.00    102.00    108.00    114.00 
##       63%       64%       65%       66%       67%       68%       69% 
##    120.00    127.00    135.00    143.00    152.00    162.00    173.00 
##       70%       71%       72%       73%       74%       75%       76% 
##    185.00    198.00    211.00    226.00    242.00    260.00    278.00 
##       77%       78%       79%       80%       81%       82%       83% 
##    299.52    320.00    346.00    377.00    409.00    446.00    487.08 
##       84%       85%       86%       87%       88%       89%       90% 
##    533.00    584.00    646.00    719.00    803.88    905.00   1021.00 
##       91%       92%       93%       94%       95%       96%       97% 
##   1170.16   1340.92   1554.00   1864.76   2286.00   2869.80   3811.16 
##       98%       99%      100% 
##   5477.88   9931.16 472565.00

##     order_id user_id product_count
##  1:     1107   38259            17
##  2:     5319  196224             7
##  3:     7540  138499             7
##  4:     9228   79603             4
##  5:     9273   50005            33
##  6:     9696  108919            10
##  7:    11140   63782             3
##  8:    11485  170451             7
##  9:    12672  106854             4
## 10:    13668  181127            10
## 11:    14668  106519            15
## 12:    16132  145505             4
## 13:    18303  139923             1
## 14:    19479  110984             5
## 15:    19569  180719            14
## 16:    19879  142388            13
## 17:    19939  200356             3
## 18:    19989    4122             2
## 19:    23202  162394             4
## 20:    24367  148255             6
##     order_id user_id product_count
##  1:  2550362       1             9
##  2:   431534       1             8
##  3:  2295261       1             6
##  4:  2398795       1             6
##  5:  3108588       1             6
##  6:   473747       1             5
##  7:   550135       1             5
##  8:  2254736       1             5
##  9:  2539329       1             5
## 10:  3367565       1             4
## 11:  1718559       2            26
## 12:  1199898       2            21
## 13:  3186735       2            19
## 14:   788338       2            16
## 15:   839880       2            16
## 16:  1402090       2            15
## 17:  3194192       2            14
## 18:  1673511       2            13
## 19:   738281       2            13
## 20:  2168274       2            13

Habit analysis of each user based on the products ordered sequence

  • Broken the data set to analyze which user is ordering which product more
  • Used tapply and data table methods to find out that we are having the memory issues so had to break down the data-table
##     order_id user_id eval_set order_number order_dow order_hour_of_day
##  1:     1107   38259    prior            2         1                11
##  2:     5319  196224    prior           65         1                14
##  3:     7540  138499    prior            8         0                14
##  4:     9228   79603    prior            2         2                10
##  5:     9273   50005    prior            1         1                15
##  6:     9696  108919    prior           46         5                16
##  7:    11140   63782    prior            4         1                14
##  8:    11485  170451    prior            5         5                18
##  9:    12672  106854    prior           28         5                10
## 10:    13668  181127    prior           10         4                17
## 11:    14668  106519    prior           18         0                22
## 12:    16132  145505    prior           64         2                10
## 13:    18303  139923    prior           16         0                23
## 14:    19479  110984    prior            2         2                12
## 15:    19569  180719    prior           84         4                19
## 16:    19879  142388    prior            7         3                 5
## 17:    19939  200356    prior           11         2                12
## 18:    19989    4122    prior           11         2                 7
## 19:    23202  162394    prior            2         3                12
## 20:    24367  148255    prior           18         6                17
##     days_since_prior_order product_id add_to_cart_order reordered
##  1:                      7          1                 7         0
##  2:                      1          1                 3         1
##  3:                      7          1                 4         1
##  4:                     30          1                 2         0
##  5:                     NA          1                30         0
##  6:                      8          1                 5         1
##  7:                     14          1                 1         1
##  8:                     24          1                 4         0
##  9:                      6          1                 3         1
## 10:                      9          1                 4         1
## 11:                      7          1                13         1
## 12:                      5          1                 1         1
## 13:                     14          1                 1         1
## 14:                     20          1                 5         0
## 15:                      0          1                 9         1
## 16:                     15          1                12         0
## 17:                      0          1                 2         0
## 18:                      8          1                 1         0
## 19:                      7          1                 3         0
## 20:                      6          1                 2         1
##                   product_name aisle_id department_id
##  1: Chocolate Sandwich Cookies       61            19
##  2: Chocolate Sandwich Cookies       61            19
##  3: Chocolate Sandwich Cookies       61            19
##  4: Chocolate Sandwich Cookies       61            19
##  5: Chocolate Sandwich Cookies       61            19
##  6: Chocolate Sandwich Cookies       61            19
##  7: Chocolate Sandwich Cookies       61            19
##  8: Chocolate Sandwich Cookies       61            19
##  9: Chocolate Sandwich Cookies       61            19
## 10: Chocolate Sandwich Cookies       61            19
## 11: Chocolate Sandwich Cookies       61            19
## 12: Chocolate Sandwich Cookies       61            19
## 13: Chocolate Sandwich Cookies       61            19
## 14: Chocolate Sandwich Cookies       61            19
## 15: Chocolate Sandwich Cookies       61            19
## 16: Chocolate Sandwich Cookies       61            19
## 17: Chocolate Sandwich Cookies       61            19
## 18: Chocolate Sandwich Cookies       61            19
## 19: Chocolate Sandwich Cookies       61            19
## 20: Chocolate Sandwich Cookies       61            19
##    order_id user_id order_number order_dow order_hour_of_day product_id
## 1:     1107   38259            2         1                11          1
## 2:     5319  196224           65         1                14          1
## 3:     7540  138499            8         0                14          1
## 4:     9228   79603            2         2                10          1
## 5:     9273   50005            1         1                15          1
## 6:     9696  108919           46         5                16          1
##                  product_name
## 1: Chocolate Sandwich Cookies
## 2: Chocolate Sandwich Cookies
## 3: Chocolate Sandwich Cookies
## 4: Chocolate Sandwich Cookies
## 5: Chocolate Sandwich Cookies
## 6: Chocolate Sandwich Cookies
##     order_id user_id order_number order_dow order_hour_of_day product_id
##  1:   431534       1            5         4                15        196
##  2:   431534       1            5         4                15      10258
##  3:   431534       1            5         4                15      10326
##  4:   431534       1            5         4                15      12427
##  5:   431534       1            5         4                15      13176
##  6:   431534       1            5         4                15      17122
##  7:   431534       1            5         4                15      25133
##  8:   431534       1            5         4                15      41787
##  9:   473747       1            3         3                12        196
## 10:   473747       1            3         3                12      10258
## 11:   473747       1            3         3                12      12427
## 12:   473747       1            3         3                12      25133
## 13:   473747       1            3         3                12      30450
## 14:   550135       1            7         1                 9        196
## 15:   550135       1            7         1                 9      10258
## 16:   550135       1            7         1                 9      12427
## 17:   550135       1            7         1                 9      13032
## 18:   550135       1            7         1                 9      25133
## 19:  2254736       1            4         4                 7        196
## 20:  2254736       1            4         4                 7      10258
##               product_name
##  1:                   Soda
##  2:             Pistachios
##  3:    Organic Fuji Apples
##  4:    Original Beef Jerky
##  5: Bag of Organic Bananas
##  6:      Honeycrisp Apples
##  7:  Organic String Cheese
##  8:         Bartlett Pears
##  9:                   Soda
## 10:             Pistachios
## 11:    Original Beef Jerky
## 12:  Organic String Cheese
## 13:   Creamy Almond Butter
## 14:                   Soda
## 15:             Pistachios
## 16:    Original Beef Jerky
## 17:  Cinnamon Toast Crunch
## 18:  Organic String Cheese
## 19:                   Soda
## 20:             Pistachios
##    user_id product_id
## 1:       1        196
## 2:       1      10258
## 3:       1      10326
## 4:       1      12427
## 5:       1      13176
## 6:       1      17122
##     user_id product_id Aggregate.count
##  1:       1        196              10
##  2:       1      12427              10
##  3:       1      10258               9
##  4:       1      25133               8
##  5:       1      13032               3
##  6:       1      46149               3
##  7:       1      13176               2
##  8:       1      26405               2
##  9:       1      49235               2
## 10:       1      26088               2
## 11:       1      10326               1
## 12:       1      17122               1
## 13:       1      41787               1
## 14:       1      30450               1
## 15:       1      14084               1
## 16:       1      35951               1
## 17:       1      38928               1
## 18:       1      39657               1
## 19:       2      32792               9
## 20:       2      47209               8
##     product_id Aggregate.count
##  1:        196              10
##  2:      12427              10
##  3:      10258               9
##  4:      25133               8
##  5:      13032               3
##  6:      46149               3
##  7:      13176               2
##  8:      26405               2
##  9:      49235               2
## 10:      26088               2
## 11:      10326               1
## 12:      17122               1
## 13:      41787               1
## 14:      30450               1
## 15:      14084               1
## 16:      35951               1
## 17:      38928               1
## 18:      39657               1
##      product_id Aggregate.count
##   1:      32792               9
##   2:      47209               8
##   3:      24852               7
##   4:       1559               6
##   5:      18523               6
##  ---                           
##  98:      30908               1
##  99:      39877               1
## 100:      42342               1
## 101:      44303               1
## 102:      45948               1
## Empty data.table (0 rows) of 1 col: user_id

Plotting each User’s Top 5% based based on Product Id Count for the 1st 100k users

  • This will be a sample to see what is the highest product count for most/average distribution of users
## [1] 38

Plotting the Top 1% Product for each user

  • Most of the Counts for each user_id lies below 20 and it seems most of the maximum products are between (0-10) counts
  • Have to find another way to find only the top 1% for each user and for each Product id

  • Not using the below part of the notebook as that is not the right way to approach the handling of the data set

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).