library(arules)Warning: package 'arules' was built under R version 4.4.3
library(tidyverse)Setting up
library(arules)Warning: package 'arules' was built under R version 4.4.3
library(tidyverse)Loading dataset
ret_tran <- read.transactions("retail_transactions_3.csv", sep = ",")Getting overview of data
summary(ret_tran)transactions as itemMatrix in sparse format with
10000 rows (elements/itemsets/transactions) and
5479 columns (items) and a density of 0.002744552
most frequent items:
WHITE HANGING HEART T-LIGHT HOLDER REGENCY CAKESTAND 3 TIER
822 776
JUMBO BAG RED RETROSPOT PARTY BUNTING
663 561
ASSORTED COLOUR BIRD ORNAMENT (Other)
544 147008
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1658 707 498 413 365 341 316 310 309 290 261 227 227 242 260 227
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
204 199 233 189 184 149 138 125 104 111 112 98 113 99 78 68
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
55 62 65 44 48 41 52 41 44 26 45 27 27 35 30 24
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
26 23 22 21 16 19 21 13 11 16 13 14 15 10 13 13
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
3 13 13 7 6 9 9 6 6 4 3 8 5 6 3 5
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
3 4 5 8 3 5 8 2 4 4 1 3 2 3 1 2
97 98 100 101 102 103 104 105 107 108 109 110 111 112 113 119
5 1 2 2 2 1 2 1 1 3 1 2 1 1 2 1
120 121 122 123 125 127 134 142 146 147 150 154 157 171 193 204
1 1 1 1 1 1 1 2 1 1 1 2 1 2 1 1
235 249
1 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 3.00 10.00 15.04 21.00 249.00
includes extended item information - examples:
labels
1 1 HANGER
2 10 COLOUR SPACEBOY PEN
3 12 COLOURED PARTY BALLOONS
There is 10’000 transactions in the dataset
5479 items are available to purchase
The sparse matrix contains 54’790’000 cells (10’000 x 5’479)
With the matrix having a density of 0.002744552 , 1’503’740 cells contain non zero values (54’790’000 x 0.002744552)
The largest amount of items purchased in a single transaction is 249.
The mean of items purchased in a single transaction is 15.
Bar chart showing the top 20 most frequently purchased items.
itemFrequencyPlot(ret_tran, topN = 20, horiz = T)Using apriori () function to discover association rules:
retail_rules <- apriori(ret_tran, parameter = list(support = 0.01,
confidence = 0.5,
minlen = 2))Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
0.5 0.1 1 none FALSE TRUE 5 0.01 2
maxlen target ext
10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 100
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[5479 item(s), 10000 transaction(s)] done [0.07s].
sorting and recoding items ... [384 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.01s].
writing ... [86 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
retail_rulesset of 86 rules
86 rules are discovered
Support of an item/ itemset measures how frequently it occurs in the data. By setting support threshold to 0.01, it means that item/ itemset must appear in at least 1% of all transactions.
Confidence represents the number of transactions where presence of itemset X results in the presence of itemset Y. By setting confidence to 0.5, the probability of the presence of X resulting in the presence of Y cant be lower than 50%.
Using summary () function to get an overview of retail_rules
summary (retail_rules)set of 86 rules
rule length distribution (lhs + rhs):sizes
2 3
47 39
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.000 2.000 2.453 3.000 3.000
summary of quality measures:
support confidence coverage lift
Min. :0.01000 Min. :0.5021 Min. :0.01200 Min. : 6.585
1st Qu.:0.01063 1st Qu.:0.5510 1st Qu.:0.01733 1st Qu.:12.347
Median :0.01160 Median :0.6022 Median :0.02000 Median :15.266
Mean :0.01315 Mean :0.6198 Mean :0.02152 Mean :17.144
3rd Qu.:0.01515 3rd Qu.:0.6652 3rd Qu.:0.02428 3rd Qu.:21.864
Max. :0.02270 Max. :0.8814 Max. :0.03770 Max. :44.055
count
Min. :100.0
1st Qu.:106.2
Median :116.0
Mean :131.5
3rd Qu.:151.5
Max. :227.0
mining info:
data ntransactions support confidence
ret_tran 10000 0.01 0.5
call
apriori(data = ret_tran, parameter = list(support = 0.01, confidence = 0.5, minlen = 2))
47 rules have 2 items, 39 rules have 3 items
Minimum lift for a rule is 6.585, maximum lift for a rule is 44.055.
Using inspect() function to examine all rules, sorted according to lift.
inspect(sort(retail_rules, by = "lift"))A basket which contains wooden star Christmas Scandinavian is likely to also contain wooden heart Christmas Scandinavian.
support value: 0.0113, confidence value: 0.7533333.
In comparison to an average basket, a basket with wooden star Christmas Scandinavian is 44 times more likely to also contain wooden heart Christmas Scandinavian.
Trivial Rules: painted metal pears assorted -> assorted colour bird ornament.
Actionable Rules: jumbo bag pears -> jumbo bag apples, dolly girl lunch box -> spaceboy lunchbox,
Finding all rules that involve ‘ROSES REGENCY TEACUP AND SAUCER’
roses_rules <- subset(retail_rules, items %in% "ROSES REGENCY TEACUP AND SAUCER")
inspect(roses_rules) lhs rhs support confidence coverage lift count
[1] {PINK REGENCY TEACUP AND SAUCER} => {ROSES REGENCY TEACUP AND SAUCER} 0.0177 0.7314050 0.0242 21.964113 177
[2] {ROSES REGENCY TEACUP AND SAUCER} => {PINK REGENCY TEACUP AND SAUCER} 0.0177 0.5315315 0.0333 21.964113 177
[3] {GREEN REGENCY TEACUP AND SAUCER} => {ROSES REGENCY TEACUP AND SAUCER} 0.0227 0.7394137 0.0307 22.204615 227
[4] {ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER} 0.0227 0.6816817 0.0333 22.204615 227
[5] {GREEN REGENCY TEACUP AND SAUCER,
PINK REGENCY TEACUP AND SAUCER} => {ROSES REGENCY TEACUP AND SAUCER} 0.0156 0.8125000 0.0192 24.399399 156
[6] {PINK REGENCY TEACUP AND SAUCER,
ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER} 0.0156 0.8813559 0.0177 28.708662 156
[7] {GREEN REGENCY TEACUP AND SAUCER,
ROSES REGENCY TEACUP AND SAUCER} => {PINK REGENCY TEACUP AND SAUCER} 0.0156 0.6872247 0.0227 28.397714 156
[8] {GREEN REGENCY TEACUP AND SAUCER,
ROSES REGENCY TEACUP AND SAUCER} => {REGENCY CAKESTAND 3 TIER} 0.0116 0.5110132 0.0227 6.585222 116
[9] {GREEN REGENCY TEACUP AND SAUCER,
REGENCY CAKESTAND 3 TIER} => {ROSES REGENCY TEACUP AND SAUCER} 0.0116 0.7891156 0.0147 23.697167 116
[10] {REGENCY CAKESTAND 3 TIER,
ROSES REGENCY TEACUP AND SAUCER} => {GREEN REGENCY TEACUP AND SAUCER} 0.0116 0.7483871 0.0155 24.377430 116
PINK REGENCY TEACUP AND SAUCER, GREEN REGENCY TEACUP AND SAUCER and REGENCY CAKESTAND 3 TIER are other items customers are likely to purchase with ROSES REGENCY TEACUP AND SAUCER.
Setting up
library(recommenderlab)Warning: package 'recommenderlab' was built under R version 4.4.3
Warning: package 'proxy' was built under R version 4.4.3
Importing and preparing dataset
steam_ratings <- read_csv("steam_ratings.csv")Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
steam_ratings <- as(steam_ratings, "matrix")
steam_ratings <- as(steam_ratings, "realRatingMatrix")Carrying out initial exploration
vector_ratings <- as.vector(steam_ratings@data)
table(vector_ratings)vector_ratings
0 1 2 3 4 5
3236066 4773 12500 19762 10655 4724
3236066 missing ratings
3 is the highest chosen rating, 5 is the lowest chosen rating
Creating a histogram showing the distribution of the average rating given for each game
colMeans(steam_ratings) %>%
tibble::enframe(name = "game", value = "steam_ratings") %>%
ggplot() +
geom_histogram(mapping = aes(x = steam_ratings), color = "white") +
scale_x_continuous(limits = c(1, 5), breaks = c(1, 2, 3, 4, 5),
labels = c('1','2', '3', '4', '5'))Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).
Creating a histogram showing the distribution of the total number of games rated by each user
counts <- rowCounts(steam_ratings, value = TRUE, na.rm = FALSE)
ggplot() +
geom_histogram(mapping = aes(x = counts), color = "white") Splitting dataset 80/20
set.seed(101)
eval_games <- evaluationScheme(data = steam_ratings,
method = "split",
train = 0.8,
given = 6,
goodRating = 3) Saving output into 3 subsets; train, known and unknown
train_games <- getData(eval_games, "train")
known_games <- getData(eval_games, "known")
unknown_games <- getData(eval_games, "unknown")Building and measuring accuracy of UBCF model
#centered#
ubcf_model_1a <- Recommender(data = train_games,
method = "UBCF",
parameter = list(normalize = "center", method = "Cosine"))
ubcf_model_1b <- Recommender(data = train_games,
method = "UBCF",
parameter = list(normalize = "center", method = "Euclidean"))
ubcf_model_1c <- Recommender(data = train_games,
method = "UBCF",
parameter = list(normalize = "center", method = "Pearson"))#z-score#
ubcf_model_2a <- Recommender(data = train_games,
method = "UBCF",
parameter = list(normalize = "z-score", method = "Cosine"))
ubcf_model_2b <- Recommender(data = train_games,
method = "UBCF",
parameter = list(normalize = "z-score", method = "Euclidean"))
ubcf_model_2c <- Recommender(data = train_games,
method = "UBCF",
parameter = list(normalize = "z-score", method = "Pearson"))#null#
ubcf_model_3a <- Recommender(data = train_games,
method = "UBCF",
parameter = list(normalize = NULL, method = "Cosine"))
ubcf_model_3b <- Recommender(data = train_games,
method = "UBCF",
parameter = list(normalize = NULL, method = "Euclidean"))
ubcf_model_3c <- Recommender(data = train_games,
method = "UBCF",
parameter = list(normalize = NULL, method = "Pearson"))Computing MAE of each UBCF model
ubcf_predict_1a <- predict(object = ubcf_model_1a,
newdata = known_games,
type = "ratings")
ubcf_1a_eval <- calcPredictionAccuracy(x = ubcf_predict_1a,
data = unknown_games)
ubcf_1a_eval RMSE MSE MAE
1.1697655 1.3683514 0.9183398
#MAE 0.9183398#ubcf_predict_1b <- predict(object = ubcf_model_1b,
newdata = known_games,
type = "ratings")
ubcf_1b_eval <- calcPredictionAccuracy(x = ubcf_predict_1b,
data = unknown_games)
ubcf_1b_eval RMSE MSE MAE
1.1910345 1.4185633 0.9163087
#MAE 0.9163087#Computing MAE for ubcf_1c
ubcf_predict_1c <- predict(object = ubcf_model_1c,
newdata = known_games,
type = "ratings")
ubcf_1c_eval <- calcPredictionAccuracy(x = ubcf_predict_1c,
data = unknown_games)
ubcf_1c_eval RMSE MSE MAE
1.1212624 1.2572293 0.8702777
#MAE 0.8702777#Computing MAE for ubcf_2a
ubcf_predict_2a <- predict(object = ubcf_model_2a,
newdata = known_games,
type = "ratings")
ubcf_2a_eval <- calcPredictionAccuracy(x = ubcf_predict_2a,
data = unknown_games)
ubcf_2a_eval RMSE MSE MAE
1.184555 1.403170 0.923375
#MAE 0.923375#Computing MAE for ubcf_2b
ubcf_predict_2b <- predict(object = ubcf_model_2b,
newdata = known_games,
type = "ratings")
ubcf_2b_eval <- calcPredictionAccuracy(x = ubcf_predict_2b,
data = unknown_games)
ubcf_2b_eval RMSE MSE MAE
1.2103032 1.4648339 0.9309624
#MAE 0.9309624#Computing MAE for ubcf_2c
ubcf_predict_2c <- predict(object = ubcf_model_2c,
newdata = known_games,
type = "ratings")
ubcf_2c_eval <- calcPredictionAccuracy(x = ubcf_predict_2c,
data = unknown_games)
ubcf_2c_eval RMSE MSE MAE
1.1345807 1.2872733 0.8790968
#MAE 0.8790968#Computing MAE for ubcf_3a
ubcf_predict_3a <- predict(object = ubcf_model_3a,
newdata = known_games,
type = "ratings")
ubcf_3a_eval <- calcPredictionAccuracy(x = ubcf_predict_3a,
data = unknown_games)
ubcf_3a_eval RMSE MSE MAE
1.0793268 1.1649463 0.8189319
#MAE 0.8189319#Computing MAE for ubcf_3b
ubcf_predict_3b <- predict(object = ubcf_model_3b,
newdata = known_games,
type = "ratings")
ubcf_3b_eval <- calcPredictionAccuracy(x = ubcf_predict_3b,
data = unknown_games)
ubcf_3b_eval RMSE MSE MAE
1.0990975 1.2080152 0.8294308
#MAE 0.8294308#Computing MAE for ubcf_3c
ubcf_predict_3c <- predict(object = ubcf_model_3c,
newdata = known_games,
type = "ratings")
ubcf_3c_eval <- calcPredictionAccuracy(x = ubcf_predict_3c,
data = unknown_games)
ubcf_3c_eval RMSE MSE MAE
1.1086429 1.2290892 0.8349371
#MAE 0.8349371#Building and measuring accuracy of IBCF model
#centered#
ibcf_model_1a <- Recommender(data = train_games,
method = "IBCF",
parameter = list(normalize = "center", method = "Cosine"))
ibcf_model_1b <- Recommender(data = train_games,
method = "IBCF",
parameter = list(normalize = "center", method = "Euclidean"))
ibcf_model_1c <- Recommender(data = train_games,
method = "IBCF",
parameter = list(normalize = "center", method = "Pearson"))#z-score#
ibcf_model_2a <- Recommender(data = train_games,
method = "IBCF",
parameter = list(normalize = "z-score", method = "Cosine"))
ibcf_model_2b <- Recommender(data = train_games,
method = "IBCF",
parameter = list(normalize = "z-score", method = "Euclidean"))
ibcf_model_2c <- Recommender(data = train_games,
method = "IBCF",
parameter = list(normalize = "z-score", method = "Pearson"))#null#
ibcf_model_3a <- Recommender(data = train_games,
method = "IBCF",
parameter = list(normalize = NULL, method = "Cosine"))
ibcf_model_3b <- Recommender(data = train_games,
method = "IBCF",
parameter = list(normalize = NULL, method = "Euclidean"))
ibcf_model_3c <- Recommender(data = train_games,
method = "IBCF",
parameter = list(normalize = NULL, method = "Pearson"))Computing MAE of each IBCF model
ibcf_predict_1a <- predict(object = ibcf_model_1a,
newdata = known_games,
type = "ratings")
ibcf_1a_eval <- calcPredictionAccuracy(x = ibcf_predict_1a,
data = unknown_games)
ibcf_1a_eval RMSE MSE MAE
1.500713 2.252139 1.165198
#MAE 1.165198#Computing MAE for ibcf_1b
ibcf_predict_1b <- predict(object = ibcf_model_1b,
newdata = known_games,
type = "ratings")
ibcf_1b_eval <- calcPredictionAccuracy(x = ibcf_predict_1b,
data = unknown_games)
ibcf_1b_eval RMSE MSE MAE
1.477274 2.182339 1.142542
#MAE 1.142542#Computing MAE for ibcf_1c
ibcf_predict_1c <- predict(object = ibcf_model_1c,
newdata = known_games,
type = "ratings")
ibcf_1c_eval <- calcPredictionAccuracy(x = ibcf_predict_1c,
data = unknown_games)
ibcf_1c_eval RMSE MSE MAE
1.470169 2.161397 1.158908
#MAE 1.158908#Computing MAE for ibcf_2a
ibcf_predict_2a <- predict(object = ibcf_model_2a,
newdata = known_games,
type = "ratings")
ibcf_2a_eval <- calcPredictionAccuracy(x = ibcf_predict_2a,
data = unknown_games)
ibcf_2a_eval RMSE MSE MAE
1.500976 2.252928 1.163775
#MAE 1.163775#Computing MAE for ibcf_2b
ibcf_predict_2b <- predict(object = ibcf_model_2b,
newdata = known_games,
type = "ratings")
ibcf_2b_eval <- calcPredictionAccuracy(x = ibcf_predict_2b,
data = unknown_games)
ibcf_2b_eval RMSE MSE MAE
1.475157 2.176087 1.141132
#MAE 1.141132#Computing MAE for ibcf_2c
ibcf_predict_2c <- predict(object = ibcf_model_2c,
newdata = known_games,
type = "ratings")
ibcf_2c_eval <- calcPredictionAccuracy(x = ibcf_predict_2c,
data = unknown_games)
ibcf_2c_eval RMSE MSE MAE
1.467355 2.153130 1.158796
#MAE 1.158796#Computing MAE for ibcf_3a
ibcf_predict_3a <- predict(object = ibcf_model_3a,
newdata = known_games,
type = "ratings")
ibcf_3a_eval <- calcPredictionAccuracy(x = ibcf_predict_3a,
data = unknown_games)
ibcf_3a_eval RMSE MSE MAE
1.587257 2.519385 1.239649
#MAE 1.239649#Computing MAE for ibcf_3b
ibcf_predict_3b <- predict(object = ibcf_model_3b,
newdata = known_games,
type = "ratings")
ibcf_3b_eval <- calcPredictionAccuracy(x = ibcf_predict_3b,
data = unknown_games)
ibcf_3b_eval RMSE MSE MAE
1.476175 2.179092 1.140654
#MAE 1.140654#Computing MAE for ibcf_3c
ibcf_predict_3c <- predict(object = ibcf_model_3c,
newdata = known_games,
type = "ratings")
ibcf_3c_eval <- calcPredictionAccuracy(x = ibcf_predict_3c,
data = unknown_games)
ibcf_3c_eval RMSE MSE MAE
1.456788 2.122230 1.152312
#MAE 1.152312#UBCF model with the best MAE score
IBCF model with the best MAE model
ubcf_2b_recs <- predict(object = ubcf_model_2b,
newdata = known_games,
type = "topNList",
n = 3)
as(ubcf_2b_recs, "list")Recommendations for user 0; “Pro Evolution Soccer 2015”, “Deadpool”, “Guns of Icarus Online”
Recommendations for user 1; “Valkyria Chronicles”, “Lara Croft and the Guardian of Light”, “Panzar”
Recommendations for user 2; “Duke Nukem 3D Megaton Edition”, “The Ultimate DOOM”, “Synergy”
Recommendations for user 3; “Sparkle 2 Evo”, “Sang-Froid - Tales of Werewolves”, “The Journey Down Chapter One”
Recommendations for user 4; “Assassin’s Creed”, “Sonic Adventure 2”, “Galaxy on Fire 2 Full HD”
ibcf_3a_recs <- predict(object = ibcf_model_3a,
newdata = known_games,
type = "topNList",
n = 3)
as(ibcf_3a_recs, "list")Recommendations for user 0; “Wind of Luck Arena”, “404Sight”, “8BitMMO”
Recommendations for user 1; “3DMark”, “AdVenture Capitalist”, “Age of Wonders III”
Recommendations for user 2; “Heroes of Might & Magic III - HD Edition”, “Age of Conan Unchained - EU version”, “Alien Rage - Unlimited”
Recommendations for user 3; “60 Seconds!”, “Batla”, “Bus Driver”
Recommendations for user 4; “Anno 1404”, “Axis Game Factory’s AGFPRO 3.0”, “Blood Bowl Chaos Edition”
Steam can use these game recommendations to recommend these games to specific users based on their existing game purchase history. By also observing the common factors amongst these games, Steam can accurately predict which games and game types users are interested in the most, as well as using these to predict future trends in gaming purchases on their platform. As steam has multiple sales on its platform, these predictions can be included during these sales to increase sale numbers as well as revenue.