Data Analysis, SEM2 CA3

Author

Joanna Zaremba

Q1 Market basket Analysis (using Apriori Algorithm)

Setting up

library(arules)
Warning: package 'arules' was built under R version 4.4.3
library(tidyverse)

1

Loading dataset

ret_tran <- read.transactions("retail_transactions_3.csv", sep = ",")

2

Getting overview of data

summary(ret_tran)
transactions as itemMatrix in sparse format with
 10000 rows (elements/itemsets/transactions) and
 5479 columns (items) and a density of 0.002744552 

most frequent items:
WHITE HANGING HEART T-LIGHT HOLDER           REGENCY CAKESTAND 3 TIER 
                               822                                776 
           JUMBO BAG RED RETROSPOT                      PARTY BUNTING 
                               663                                561 
     ASSORTED COLOUR BIRD ORNAMENT                            (Other) 
                               544                             147008 

element (itemset/transaction) length distribution:
sizes
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
1658  707  498  413  365  341  316  310  309  290  261  227  227  242  260  227 
  17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
 204  199  233  189  184  149  138  125  104  111  112   98  113   99   78   68 
  33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
  55   62   65   44   48   41   52   41   44   26   45   27   27   35   30   24 
  49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
  26   23   22   21   16   19   21   13   11   16   13   14   15   10   13   13 
  65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80 
   3   13   13    7    6    9    9    6    6    4    3    8    5    6    3    5 
  81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96 
   3    4    5    8    3    5    8    2    4    4    1    3    2    3    1    2 
  97   98  100  101  102  103  104  105  107  108  109  110  111  112  113  119 
   5    1    2    2    2    1    2    1    1    3    1    2    1    1    2    1 
 120  121  122  123  125  127  134  142  146  147  150  154  157  171  193  204 
   1    1    1    1    1    1    1    2    1    1    1    2    1    2    1    1 
 235  249 
   1    1 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    3.00   10.00   15.04   21.00  249.00 

includes extended item information - examples:
                      labels
1                   1 HANGER
2     10 COLOUR SPACEBOY PEN
3 12 COLOURED PARTY BALLOONS

2(a)

There is 10’000 transactions in the dataset

2(b)

5479 items are available to purchase

2(c)

The sparse matrix contains 54’790’000 cells (10’000 x 5’479)

2(c, i)

With the matrix having a density of 0.002744552 , 1’503’740 cells contain non zero values (54’790’000 x 0.002744552)

2(d)

The largest amount of items purchased in a single transaction is 249.

2(e)

The mean of items purchased in a single transaction is 15.

3

Bar chart showing the top 20 most frequently purchased items.

itemFrequencyPlot(ret_tran, topN = 20, horiz = T)

4

Using apriori () function to discover association rules:

retail_rules <- apriori(ret_tran, parameter = list(support = 0.01,
                                                   confidence = 0.5,
                                                   minlen = 2))
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.5    0.1    1 none FALSE            TRUE       5    0.01      2
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 100 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[5479 item(s), 10000 transaction(s)] done [0.07s].
sorting and recoding items ... [384 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.01s].
writing ... [86 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].
retail_rules
set of 86 rules 

4(a)

86 rules are discovered

4(b)

Support of an item/ itemset measures how frequently it occurs in the data. By setting support threshold to 0.01, it means that item/ itemset must appear in at least 1% of all transactions.

4(c)

Confidence represents the number of transactions where presence of itemset X results in the presence of itemset Y. By setting confidence to 0.5, the probability of the presence of X resulting in the presence of Y cant be lower than 50%.

5

Using summary () function to get an overview of retail_rules

summary (retail_rules)
set of 86 rules

rule length distribution (lhs + rhs):sizes
 2  3 
47 39 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.000   2.000   2.453   3.000   3.000 

summary of quality measures:
    support          confidence        coverage            lift       
 Min.   :0.01000   Min.   :0.5021   Min.   :0.01200   Min.   : 6.585  
 1st Qu.:0.01063   1st Qu.:0.5510   1st Qu.:0.01733   1st Qu.:12.347  
 Median :0.01160   Median :0.6022   Median :0.02000   Median :15.266  
 Mean   :0.01315   Mean   :0.6198   Mean   :0.02152   Mean   :17.144  
 3rd Qu.:0.01515   3rd Qu.:0.6652   3rd Qu.:0.02428   3rd Qu.:21.864  
 Max.   :0.02270   Max.   :0.8814   Max.   :0.03770   Max.   :44.055  
     count      
 Min.   :100.0  
 1st Qu.:106.2  
 Median :116.0  
 Mean   :131.5  
 3rd Qu.:151.5  
 Max.   :227.0  

mining info:
     data ntransactions support confidence
 ret_tran         10000    0.01        0.5
                                                                                     call
 apriori(data = ret_tran, parameter = list(support = 0.01, confidence = 0.5, minlen = 2))

5 (a)

47 rules have 2 items, 39 rules have 3 items

5 (b)

Minimum lift for a rule is 6.585, maximum lift for a rule is 44.055.

6

Using inspect() function to examine all rules, sorted according to lift.

inspect(sort(retail_rules, by = "lift"))

6(a)

(i)

A basket which contains wooden star Christmas Scandinavian is likely to also contain wooden heart Christmas Scandinavian.

(ii)

support value: 0.0113, confidence value: 0.7533333.

(iii)

In comparison to an average basket, a basket with wooden star Christmas Scandinavian is 44 times more likely to also contain wooden heart Christmas Scandinavian.

6(b)

Trivial Rules: painted metal pears assorted -> assorted colour bird ornament.

6(c)

Actionable Rules: jumbo bag pears -> jumbo bag apples, dolly girl lunch box -> spaceboy lunchbox,

7

Finding all rules that involve ‘ROSES REGENCY TEACUP AND SAUCER’

roses_rules <- subset(retail_rules, items %in% "ROSES REGENCY TEACUP AND SAUCER")
inspect(roses_rules)
     lhs                                   rhs                               support confidence coverage      lift count
[1]  {PINK REGENCY TEACUP AND SAUCER}   => {ROSES REGENCY TEACUP AND SAUCER}  0.0177  0.7314050   0.0242 21.964113   177
[2]  {ROSES REGENCY TEACUP AND SAUCER}  => {PINK REGENCY TEACUP AND SAUCER}   0.0177  0.5315315   0.0333 21.964113   177
[3]  {GREEN REGENCY TEACUP AND SAUCER}  => {ROSES REGENCY TEACUP AND SAUCER}  0.0227  0.7394137   0.0307 22.204615   227
[4]  {ROSES REGENCY TEACUP AND SAUCER}  => {GREEN REGENCY TEACUP AND SAUCER}  0.0227  0.6816817   0.0333 22.204615   227
[5]  {GREEN REGENCY TEACUP AND SAUCER,                                                                                  
      PINK REGENCY TEACUP AND SAUCER}   => {ROSES REGENCY TEACUP AND SAUCER}  0.0156  0.8125000   0.0192 24.399399   156
[6]  {PINK REGENCY TEACUP AND SAUCER,                                                                                   
      ROSES REGENCY TEACUP AND SAUCER}  => {GREEN REGENCY TEACUP AND SAUCER}  0.0156  0.8813559   0.0177 28.708662   156
[7]  {GREEN REGENCY TEACUP AND SAUCER,                                                                                  
      ROSES REGENCY TEACUP AND SAUCER}  => {PINK REGENCY TEACUP AND SAUCER}   0.0156  0.6872247   0.0227 28.397714   156
[8]  {GREEN REGENCY TEACUP AND SAUCER,                                                                                  
      ROSES REGENCY TEACUP AND SAUCER}  => {REGENCY CAKESTAND 3 TIER}         0.0116  0.5110132   0.0227  6.585222   116
[9]  {GREEN REGENCY TEACUP AND SAUCER,                                                                                  
      REGENCY CAKESTAND 3 TIER}         => {ROSES REGENCY TEACUP AND SAUCER}  0.0116  0.7891156   0.0147 23.697167   116
[10] {REGENCY CAKESTAND 3 TIER,                                                                                         
      ROSES REGENCY TEACUP AND SAUCER}  => {GREEN REGENCY TEACUP AND SAUCER}  0.0116  0.7483871   0.0155 24.377430   116

PINK REGENCY TEACUP AND SAUCER, GREEN REGENCY TEACUP AND SAUCER and REGENCY CAKESTAND 3 TIER are other items customers are likely to purchase with ROSES REGENCY TEACUP AND SAUCER.

Q2 Collaborative Filtering

1

Setting up

library(recommenderlab)
Warning: package 'recommenderlab' was built under R version 4.4.3
Warning: package 'proxy' was built under R version 4.4.3

2

Importing and preparing dataset

steam_ratings <- read_csv("steam_ratings.csv")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
steam_ratings <- as(steam_ratings, "matrix")
steam_ratings <- as(steam_ratings, "realRatingMatrix")

3

Carrying out initial exploration

vector_ratings <- as.vector(steam_ratings@data)
table(vector_ratings)
vector_ratings
      0       1       2       3       4       5 
3236066    4773   12500   19762   10655    4724 

3(a)

3236066 missing ratings

3 is the highest chosen rating, 5 is the lowest chosen rating

3(b)

Creating a histogram showing the distribution of the average rating given for each game

colMeans(steam_ratings) %>%
  tibble::enframe(name = "game", value = "steam_ratings") %>% 
  ggplot() +
  geom_histogram(mapping = aes(x = steam_ratings), color = "white") +
  scale_x_continuous(limits = c(1, 5), breaks = c(1, 2, 3, 4, 5), 
                     labels = c('1','2', '3', '4', '5'))
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

3(c)

Creating a histogram showing the distribution of the total number of games rated by each user

counts <- rowCounts(steam_ratings, value = TRUE, na.rm = FALSE) 

  ggplot() +
  geom_histogram(mapping = aes(x = counts), color = "white") 

4

4 (a,b)

Splitting dataset 80/20

set.seed(101)
eval_games <- evaluationScheme(data = steam_ratings, 
                                method = "split",  
                                train = 0.8,      
                                given = 6,        
                                goodRating = 3)  

4 (c)

Saving output into 3 subsets; train, known and unknown

train_games <- getData(eval_games, "train")
known_games <- getData(eval_games, "known")
unknown_games <- getData(eval_games, "unknown")

5

Building and measuring accuracy of UBCF model

5(a)

#centered#


ubcf_model_1a <- Recommender(data = train_games,
                          method = "UBCF", 
                          parameter = list(normalize = "center", method = "Cosine"))

ubcf_model_1b <- Recommender(data = train_games,
                            method = "UBCF", 
                            parameter = list(normalize = "center", method = "Euclidean"))

ubcf_model_1c <- Recommender(data = train_games,
                            method = "UBCF", 
                            parameter = list(normalize = "center", method = "Pearson"))
#z-score#


ubcf_model_2a <- Recommender(data = train_games,
                             method = "UBCF", 
                             parameter = list(normalize = "z-score", method = "Cosine"))

ubcf_model_2b <- Recommender(data = train_games,
                             method = "UBCF", 
                             parameter = list(normalize = "z-score", method = "Euclidean"))

ubcf_model_2c <- Recommender(data = train_games,
                             method = "UBCF", 
                             parameter = list(normalize = "z-score", method = "Pearson"))
#null#


ubcf_model_3a <- Recommender(data = train_games,
                             method = "UBCF", 
                             parameter = list(normalize = NULL, method = "Cosine"))

ubcf_model_3b <- Recommender(data = train_games,
                             method = "UBCF", 
                             parameter = list(normalize = NULL, method = "Euclidean"))

ubcf_model_3c <- Recommender(data = train_games,
                             method = "UBCF", 
                             parameter = list(normalize = NULL, method = "Pearson"))

5(b)

Computing MAE of each UBCF model

  • Computing MAE for ubcf_1a
ubcf_predict_1a <- predict(object = ubcf_model_1a,
                        newdata = known_games, 
                        type = "ratings")

ubcf_1a_eval <- calcPredictionAccuracy(x = ubcf_predict_1a,
                                    data = unknown_games)
ubcf_1a_eval
     RMSE       MSE       MAE 
1.1697655 1.3683514 0.9183398 
 #MAE 0.9183398#
  • Computing MAE for ubcf_1b
ubcf_predict_1b <- predict(object = ubcf_model_1b,
                           newdata = known_games, 
                           type = "ratings")

ubcf_1b_eval <- calcPredictionAccuracy(x = ubcf_predict_1b,
                                       data = unknown_games)
ubcf_1b_eval 
     RMSE       MSE       MAE 
1.1910345 1.4185633 0.9163087 
#MAE 0.9163087#
  • Computing MAE for ubcf_1c

    ubcf_predict_1c <- predict(object = ubcf_model_1c,
                               newdata = known_games, 
                               type = "ratings")
    
    ubcf_1c_eval <- calcPredictionAccuracy(x = ubcf_predict_1c,
                                           data = unknown_games)
    ubcf_1c_eval 
         RMSE       MSE       MAE 
    1.1212624 1.2572293 0.8702777 
    #MAE 0.8702777#
  • Computing MAE for ubcf_2a

    ubcf_predict_2a <- predict(object = ubcf_model_2a,
                               newdata = known_games, 
                               type = "ratings")
    
    ubcf_2a_eval <- calcPredictionAccuracy(x = ubcf_predict_2a,
                                           data = unknown_games)
    ubcf_2a_eval 
        RMSE      MSE      MAE 
    1.184555 1.403170 0.923375 
    #MAE 0.923375#
  • Computing MAE for ubcf_2b

    ubcf_predict_2b <- predict(object = ubcf_model_2b,
                                 newdata = known_games, 
                                 type = "ratings")
    
    ubcf_2b_eval <- calcPredictionAccuracy(x = ubcf_predict_2b,
                                           data = unknown_games)
    ubcf_2b_eval 
         RMSE       MSE       MAE 
    1.2103032 1.4648339 0.9309624 
    #MAE 0.9309624#
  • Computing MAE for ubcf_2c

    ubcf_predict_2c <- predict(object = ubcf_model_2c,
                                 newdata = known_games, 
                                 type = "ratings")
    
    ubcf_2c_eval <- calcPredictionAccuracy(x = ubcf_predict_2c,
                                           data = unknown_games)
    ubcf_2c_eval 
         RMSE       MSE       MAE 
    1.1345807 1.2872733 0.8790968 
    #MAE 0.8790968#
  • Computing MAE for ubcf_3a

    ubcf_predict_3a <- predict(object = ubcf_model_3a,
                                 newdata = known_games, 
                                 type = "ratings")
    
    ubcf_3a_eval <- calcPredictionAccuracy(x = ubcf_predict_3a,
                                           data = unknown_games)
    ubcf_3a_eval 
         RMSE       MSE       MAE 
    1.0793268 1.1649463 0.8189319 
    #MAE 0.8189319#
  • Computing MAE for ubcf_3b

    ubcf_predict_3b <- predict(object = ubcf_model_3b,
                                 newdata = known_games, 
                                 type = "ratings")
    
    ubcf_3b_eval <- calcPredictionAccuracy(x = ubcf_predict_3b,
                                           data = unknown_games)
    ubcf_3b_eval 
         RMSE       MSE       MAE 
    1.0990975 1.2080152 0.8294308 
    #MAE 0.8294308#
  • Computing MAE for ubcf_3c

    ubcf_predict_3c <- predict(object = ubcf_model_3c,
                                 newdata = known_games, 
                                 type = "ratings")
    
    ubcf_3c_eval <- calcPredictionAccuracy(x = ubcf_predict_3c,
                                           data = unknown_games)
    ubcf_3c_eval 
         RMSE       MSE       MAE 
    1.1086429 1.2290892 0.8349371 
    #MAE 0.8349371#

6

Building and measuring accuracy of IBCF model

#centered#


ibcf_model_1a <- Recommender(data = train_games,
                             method = "IBCF", 
                             parameter = list(normalize = "center", method = "Cosine"))

ibcf_model_1b <- Recommender(data = train_games,
                             method = "IBCF", 
                             parameter = list(normalize = "center", method = "Euclidean"))

ibcf_model_1c <- Recommender(data = train_games,
                             method = "IBCF", 
                             parameter = list(normalize = "center", method = "Pearson"))
#z-score#


ibcf_model_2a <- Recommender(data = train_games,
                             method = "IBCF", 
                             parameter = list(normalize = "z-score", method = "Cosine"))

ibcf_model_2b <- Recommender(data = train_games,
                             method = "IBCF", 
                             parameter = list(normalize = "z-score", method = "Euclidean"))

ibcf_model_2c <- Recommender(data = train_games,
                             method = "IBCF", 
                             parameter = list(normalize = "z-score", method = "Pearson"))
#null#


ibcf_model_3a <- Recommender(data = train_games,
                             method = "IBCF", 
                             parameter = list(normalize = NULL, method = "Cosine"))

ibcf_model_3b <- Recommender(data = train_games,
                             method = "IBCF", 
                             parameter = list(normalize = NULL, method = "Euclidean"))

ibcf_model_3c <- Recommender(data = train_games,
                             method = "IBCF", 
                             parameter = list(normalize = NULL, method = "Pearson"))

6(b)

Computing MAE of each IBCF model

  • Computing MAE for ibcf_1a
ibcf_predict_1a <- predict(object = ibcf_model_1a,
                        newdata = known_games, 
                        type = "ratings")

ibcf_1a_eval <- calcPredictionAccuracy(x = ibcf_predict_1a,
                                    data = unknown_games)
ibcf_1a_eval
    RMSE      MSE      MAE 
1.500713 2.252139 1.165198 
#MAE 1.165198#
  • Computing MAE for ibcf_1b

    ibcf_predict_1b <- predict(object = ibcf_model_1b,
                            newdata = known_games, 
                            type = "ratings")
    
    ibcf_1b_eval <- calcPredictionAccuracy(x = ibcf_predict_1b,
                                        data = unknown_games)
    ibcf_1b_eval
        RMSE      MSE      MAE 
    1.477274 2.182339 1.142542 
    #MAE 1.142542#
  • Computing MAE for ibcf_1c

    ibcf_predict_1c <- predict(object = ibcf_model_1c,
                            newdata = known_games, 
                            type = "ratings")
    
    ibcf_1c_eval <- calcPredictionAccuracy(x = ibcf_predict_1c,
                                        data = unknown_games)
    ibcf_1c_eval
        RMSE      MSE      MAE 
    1.470169 2.161397 1.158908 
    #MAE 1.158908#
  • Computing MAE for ibcf_2a

    ibcf_predict_2a <- predict(object = ibcf_model_2a,
                            newdata = known_games, 
                            type = "ratings")
    
    ibcf_2a_eval <- calcPredictionAccuracy(x = ibcf_predict_2a,
                                        data = unknown_games)
    ibcf_2a_eval
        RMSE      MSE      MAE 
    1.500976 2.252928 1.163775 
    #MAE 1.163775#
  • Computing MAE for ibcf_2b

    ibcf_predict_2b <- predict(object = ibcf_model_2b,
                            newdata = known_games, 
                            type = "ratings")
    
    ibcf_2b_eval <- calcPredictionAccuracy(x = ibcf_predict_2b,
                                        data = unknown_games)
    ibcf_2b_eval
        RMSE      MSE      MAE 
    1.475157 2.176087 1.141132 
    #MAE 1.141132#
  • Computing MAE for ibcf_2c

    ibcf_predict_2c <- predict(object = ibcf_model_2c,
                            newdata = known_games, 
                            type = "ratings")
    
    ibcf_2c_eval <- calcPredictionAccuracy(x = ibcf_predict_2c,
                                        data = unknown_games)
    ibcf_2c_eval
        RMSE      MSE      MAE 
    1.467355 2.153130 1.158796 
    #MAE 1.158796#
  • Computing MAE for ibcf_3a

    ibcf_predict_3a <- predict(object = ibcf_model_3a,
                            newdata = known_games, 
                            type = "ratings")
    
    ibcf_3a_eval <- calcPredictionAccuracy(x = ibcf_predict_3a,
                                        data = unknown_games)
    ibcf_3a_eval
        RMSE      MSE      MAE 
    1.587257 2.519385 1.239649 
    #MAE 1.239649#
  • Computing MAE for ibcf_3b

    ibcf_predict_3b <- predict(object = ibcf_model_3b,
                            newdata = known_games, 
                            type = "ratings")
    
    ibcf_3b_eval <- calcPredictionAccuracy(x = ibcf_predict_3b,
                                        data = unknown_games)
    ibcf_3b_eval
        RMSE      MSE      MAE 
    1.476175 2.179092 1.140654 
    #MAE 1.140654#
  • Computing MAE for ibcf_3c

    ibcf_predict_3c <- predict(object = ibcf_model_3c,
                            newdata = known_games, 
                            type = "ratings")
    
    ibcf_3c_eval <- calcPredictionAccuracy(x = ibcf_predict_3c,
                                        data = unknown_games)
    ibcf_3c_eval
        RMSE      MSE      MAE 
    1.456788 2.122230 1.152312 
    #MAE 1.152312#

7

UBCF model with the best MAE score

  • ubcf_2b model has the highest MAE score of 0.9309624

IBCF model with the best MAE model

  • ibcf_3a model has the highest MAE score of 1.239649
ubcf_2b_recs <- predict(object = ubcf_model_2b,
                     newdata = known_games,
                     type = "topNList",
                     n = 3) 


as(ubcf_2b_recs, "list")
  • Recommendations for user 0; “Pro Evolution Soccer 2015”, “Deadpool”, “Guns of Icarus Online”

  • Recommendations for user 1; “Valkyria Chronicles”, “Lara Croft and the Guardian of Light”, “Panzar”

  • Recommendations for user 2; “Duke Nukem 3D Megaton Edition”, “The Ultimate DOOM”, “Synergy”

  • Recommendations for user 3; “Sparkle 2 Evo”, “Sang-Froid - Tales of Werewolves”, “The Journey Down Chapter One”

  • Recommendations for user 4; “Assassin’s Creed”, “Sonic Adventure 2”, “Galaxy on Fire 2 Full HD”

ibcf_3a_recs <- predict(object = ibcf_model_3a,
                     newdata = known_games,
                     type = "topNList",
                     n = 3) 


as(ibcf_3a_recs, "list")
  • Recommendations for user 0; “Wind of Luck Arena”, “404Sight”, “8BitMMO”

  • Recommendations for user 1; “3DMark”, “AdVenture Capitalist”, “Age of Wonders III”

  • Recommendations for user 2; “Heroes of Might & Magic III - HD Edition”, “Age of Conan Unchained - EU version”, “Alien Rage - Unlimited”

  • Recommendations for user 3; “60 Seconds!”, “Batla”, “Bus Driver”

  • Recommendations for user 4; “Anno 1404”, “Axis Game Factory’s AGFPRO 3.0”, “Blood Bowl Chaos Edition”

8

Steam can use these game recommendations to recommend these games to specific users based on their existing game purchase history. By also observing the common factors amongst these games, Steam can accurately predict which games and game types users are interested in the most, as well as using these to predict future trends in gaming purchases on their platform. As steam has multiple sales on its platform, these predictions can be included during these sales to increase sale numbers as well as revenue.