Final Project
Background Case
Steam is the world’s most popular PC Gaming hub, with over 6,000 games and a community of millions of gamers. With a massive collection that includes everything from AAA blockbusters to small indie titles, great discovery tools are a highly valuable asset for Steam.
As a steam user, I have noticed that steam has built up their recommendation system. It appears in the home page with the tab called “For You”.
This dataset is a list of user behaviors, with columns: user-id, game-title, behavior-name, value. The behaviors included are ‘purchase’ and ‘play’. The value indicates the degree to which the behavior was performed - in the case of ‘purchase’ the value is always 1, and in the case of ‘play’ the value represents the number of hours the user has played the game.
I will try to use SVD approach in this case. Had some recommender system related issues with NA values.
Load required Libraries
Download Data File
##
0%| | 0.00/1.46M [00:00<?, ?B/s]
68%|██████▊ | 1.00M/1.46M [00:00<00:00, 6.57MB/s]
100%|██████████| 1.46M/1.46M [00:00<00:00, 7.13MB/s]
## Downloading steam-video-games.zip to /localdisk/Data-612/Final
##
## Archive: steam-video-games.zip
## inflating: steam-200k.csv
Analyse Data
The dataset doesn’t contain the ratings of each game, we can define the users’ preferences on the games by how many hours they have spent in the game.
## V1 V2 V3 V4 V5
## 1 151603712 The Elder Scrolls V Skyrim purchase 1.0 0
## 2 151603712 The Elder Scrolls V Skyrim play 273.0 0
## 3 151603712 Fallout 4 purchase 1.0 0
## 4 151603712 Fallout 4 play 87.0 0
## 5 151603712 Spore purchase 1.0 0
## 6 151603712 Spore play 14.9 0
## user game purchase_play hrs NA
## 1 151603712 The Elder Scrolls V Skyrim purchase 1.0 0
## 2 151603712 The Elder Scrolls V Skyrim play 273.0 0
## 3 151603712 Fallout 4 purchase 1.0 0
## 4 151603712 Fallout 4 play 87.0 0
## 5 151603712 Spore purchase 1.0 0
## 6 151603712 Spore play 14.9 0
Are we missing values?
## user game purchase_play hrs <NA>
## 0 0 0 0 0
I am splitting out the purchase/play column into two columns purchase and play.
steam_clean$purchase <- sapply(steam_clean$purchase_play, function(x) as.numeric(x == 'purchase'))
steam_clean$play <- sapply(steam_clean$purchase_play, function(x) as.numeric(x == 'play'))
steam_clean$hrs <- steam_clean$hrs-steam_clean$purchase
steam_clean <- steam_clean[,-3]
steam_clean <- aggregate(. ~ user + game, data = steam_clean, FUN = 'sum') # Aggregating users on game
head(steam_clean)
## user game hrs NA
## 1 46055854 007 Legends 0.7 0
## 2 11940338 0RBITALIS 0.6 0
## 3 86055705 0RBITALIS 0.3 0
## 4 93030550 0RBITALIS 0.3 0
## 5 11794760 1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby) 0.0 0
## 6 35701646 1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby) 0.0 0
## purchase play
## 1 1 1
## 2 1 1
## 3 1 1
## 4 1 1
## 5 1 0
## 6 1 0
Lets find how many games were purchased by users.
numgames <- length(unique(steam_clean$game))
numusers <- length(unique(steam_clean$user))
cat("Total:- ", numgames, "games purchased by:- ", numusers, "users")
## Total:- 5155 games purchased by:- 12393 users
Let’s find out the top 20 games with the most users and hours played.
game_total_hrs <- aggregate(hrs ~ game, data = steam_clean, FUN = 'sum')
game_total_hrs <- game_total_hrs[order(game_total_hrs$hrs, decreasing = TRUE),]
most_played_games <- head(game_total_hrs, 20)
most_played_games <- data.frame(game = most_played_games$game, hrs = most_played_games$hrs)
head(most_played_games, 20)
## game hrs
## 1 Dota 2 981684.6
## 2 Counter-Strike Global Offensive 322771.6
## 3 Team Fortress 2 173673.3
## 4 Counter-Strike 134261.1
## 5 Sid Meier's Civilization V 99821.3
## 6 Counter-Strike Source 96075.5
## 7 The Elder Scrolls V Skyrim 70889.3
## 8 Garry's Mod 49725.3
## 9 Call of Duty Modern Warfare 2 - Multiplayer 42009.9
## 10 Left 4 Dead 2 33596.7
## 11 Football Manager 2013 32308.6
## 12 Football Manager 2012 30845.8
## 13 Football Manager 2014 30574.8
## 14 Terraria 29951.8
## 15 Warframe 27074.6
## 16 Football Manager 2015 24283.1
## 17 Arma 3 24055.7
## 18 Grand Theft Auto V 22956.7
## 19 Borderlands 2 22667.9
## 20 Empire Total War 21030.3
Lets find game with the highest number of users
game_freq <- data.frame(sort(table(steam_clean$game), decreasing = TRUE))
colnames(game_freq) <- c("game", "numusers")
top20 <- merge(game_freq, game_total_hrs, by = 'game')
top20 <- head(top20[order(top20$numusers, decreasing = TRUE),], 20)
top20
## game numusers hrs
## 1333 Dota 2 4841 981684.6
## 4256 Team Fortress 2 2323 173673.3
## 4822 Unturned 1563 16096.4
## 971 Counter-Strike Global Offensive 1412 322771.6
## 2077 Half-Life 2 Lost Coast 981 184.4
## 974 Counter-Strike Source 978 96075.5
## 2475 Left 4 Dead 2 951 33596.7
## 968 Counter-Strike 856 134261.1
## 4922 Warframe 847 27074.6
## 2074 Half-Life 2 Deathmatch 823 3712.9
## 1888 Garry's Mod 731 49725.3
## 4368 The Elder Scrolls V Skyrim 717 70889.3
## 3554 Robocraft 689 9096.6
## 969 Counter-Strike Condition Zero 679 7950.0
## 970 Counter-Strike Condition Zero Deleted Scenes 679 418.2
## 2145 Heroes & Generals 658 3299.5
## 2073 Half-Life 2 639 4260.3
## 3833 Sid Meier's Civilization V 596 99821.3
## 4917 War Thunder 590 14381.6
## 3239 Portal 588 2282.8
ggplot(top20, aes(x = game, y = numusers, fill = hrs)) +
geom_bar(stat = "identity") +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
labs(title = "Top 20 games with the most users", x = "Game Name", y = "Num of users")
“Dota 2” game has the highest number of players and the highest number of total hours played so undeniably the most popular game. Where as other games are much lower in played hrs numbers.
Lets find how many users purchased games but did not play
purchased_not_played <- subset(steam_clean, purchase == 1 & play == 0)
nTransactions <-nrow(purchased_not_played)
numusers <- length(unique(purchased_not_played$user))
nPurchased <- nrow(subset(steam_clean, purchase == 1))
cat("There are", nTransactions, "games purchased out of", nPurchased, "that have not been played by", numusers, "users")
## There are 57904 games purchased out of 128097 that have not been played by 5949 users
That’s a lot of games that haven’t been played. We cant rate games and recommendation only on bought numbers.
It seems reasonable to turn the distribution of hours played for a game into a rating, at least it’s a reasonable way to view the reception of the game. For example a bad game will have a large distribution around the 0-1 hours and a game which was well received will have a distribution of something bigger.
We will use the EM algorithm to highlight 5 groups which can be considered 1-5 star ratings.
Of course this might not be true, for example a user may only play the game for an average number of hours, but love it and would rate it a 5. The EM algorithm will be more appropriate than simply breaking the data into percentiles.
steam_clean$game1 <- gsub("[^a-zA-Z0-9]", "", steam_clean$game) # Cleaning up the game names. It will be easier to use the names of the users.
head(steam_clean)
## user game hrs NA
## 1 46055854 007 Legends 0.7 0
## 2 11940338 0RBITALIS 0.6 0
## 3 86055705 0RBITALIS 0.3 0
## 4 93030550 0RBITALIS 0.3 0
## 5 11794760 1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby) 0.0 0
## 6 35701646 1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby) 0.0 0
## purchase play game1
## 1 1 1 007Legends
## 2 1 1 0RBITALIS
## 3 1 1 0RBITALIS
## 4 1 1 0RBITALIS
## 5 1 0 123KICKITDropThatBeatLikeanUglyBaby
## 6 1 0 123KICKITDropThatBeatLikeanUglyBaby
It seems reasonable to create rating based on how long the game was played. Will consider the games that have more than 2 hrs of playing hours registered.
game_hrs_density <- function(GAME, nclass, print_vals = TRUE){
# subsetting data
game_data <- subset(steam_clean, game1 == GAME & hrs > 2)
game_data$loghrs <- log(game_data$hrs)
# em algorithm
mu.init <- seq(min(game_data$loghrs), max(game_data$loghrs), length = nclass)
EM <- normalmixEM(game_data$loghrs, mu = mu.init, sigma=rep(1, nclass))
# print results
if(print_vals){
cat(" lambda: ", EM$lambda, "\n mean : ", EM$mu, "\n sigma : ", EM$sigma, "\n")
}
# building data frame for plotting
x <- seq(min(game_data$loghrs), max(game_data$loghrs), 0.01)
dens <- data.frame(x = x)
for(k in 1:nclass){
dens[,paste0('y', k)] <- EM$lambda[k]*dnorm(x, EM$mu[k], EM$sigma[k])
}
dens <- melt(dens, 'x', variable.name = 'gaussian')
game_plt <- ggplot(game_data, aes(x = loghrs)) +
geom_histogram(aes(y = ..density..), bins = 25, colour = "black", alpha = 0.7, size = 0.1) +
geom_area(data = dens, aes(x = x, y = value, fill = gaussian), alpha = 0.5, position = position_dodge()) +
geom_density(linetype = 2, size = 0.1) +
ggtitle(game_data$game[1])
return(game_plt)
}
Lets find some sample game hours of some of the games selected at random.
## number of iterations= 937
## lambda: 0.1817628 0.03697299 0.04630347 0.2034308 0.5315299
## mean : 1.625785 2.57379 2.998076 3.663856 4.553853
## sigma : 0.3996334 0.06765078 0.1151694 0.3322277 0.4540627
## Warning: Width not defined. Set with `position_dodge(width = ?)`
## number of iterations= 800
## lambda: 0.0511508 0.4510223 0.4120477 0.07204664 0.01373261
## mean : 0.8685749 1.868876 3.802058 5.718911 7.16823
## sigma : 0.08624541 0.5096381 0.6166057 0.4771198 0.2627739
## Warning: Width not defined. Set with `position_dodge(width = ?)`
## number of iterations= 206
## lambda: 0.3853429 0.07186484 0.2351439 0.2303342 0.07731413
## mean : 1.39312 2.638814 3.456927 4.208118 5.030528
## sigma : 0.3987871 0.05963827 0.3235782 0.1881075 0.1686162
## Warning: Width not defined. Set with `position_dodge(width = ?)`
As you can see most of those who played the The Witcher 3 stuck with it and played 40+ hours. However there were a few users where The Witcher didn’t grab them and stopped playing after a few hours. The EM algorithm does a great job finding the groups of people with similar gaming habits and would potentially rate the game in a similar way.
Lets create user-item matrix is created with the users being the rows and games being the columns. The missing values are set to zero. The observed values are the log hours for each observed user-game combination.
Will consider data subset to games which have greater than 50 users and users which played the game for greater than 2 hours. 2 hours is the limit under which Steam will offer a return if user did not like the purchased game.
# create user item matrix
set.seed(100)
game_freq$game1 <- gsub("[^a-zA-Z0-9]", "", game_freq$game) # Cleaning the game names a bit so its easier to use their names.
game_users <- subset(game_freq, game_freq$numusers > 50)
steam_clean_pos <- subset(steam_clean, steam_clean$hrs > 2 & (steam_clean$game1 %in% game_users$game1))
steam_clean_pos$loghrs <- log(steam_clean_pos$hrs)
# make matrix
games <- data.frame(game1 = sort(unique(steam_clean_pos$game1)), game_id = 1:length(unique(steam_clean_pos$game1)))
users <- data.frame(user = sort(unique(steam_clean_pos$user)), user_id = 1:length(unique(steam_clean_pos$user)))
steam_clean_pos <- merge(steam_clean_pos, games, by = 'game1')
steam_clean_pos <- merge(steam_clean_pos, users, by = 'user')
ui_mat <- matrix(0, nrow = nrow(users), ncol = nrow(games), dimnames = list(user = paste0("u", sort(unique(steam_clean_pos$user))),
game = sort(unique(steam_clean_pos$game1))))
for(k in 1:nrow(steam_clean_pos)){
ui_mat[steam_clean_pos$user_id[k], steam_clean_pos$game_id[k]] <- steam_clean_pos$loghrs[k]
}
To create a test set 20% of the observed values will be set to 0. The root mean squared error will be calculated to determine the accuracy.
# create training set i.e. suppress a 20th of the actual ratings
index <- sample.split(steam_clean_pos$user, SplitRatio = 0.8)
train <- steam_clean_pos[index,]
test <- steam_clean_pos[!index,]
ui_train <- ui_mat
for(k in 1:nrow(test)){
ui_train[test$user_id[k], test$game_id[k]] <- 0
}
# root mean squared error function
rmse <- function(pred, test, data_frame = FALSE){
test_pred <- rep(NA, nrow(test))
for(k in 1:nrow(test)){
test_pred[k] <- pred[test$user_id[k], test$game_id[k]]
}
if(data_frame){
return(data.frame(test_pred, test$loghrs))
}
return(sqrt(1/(nrow(test)-1)*sum((test_pred - test$loghrs)^2)))
}
cat("Dimensions of training user-item matrix:", dim(ui_train))
## Dimensions of training user-item matrix: 8214 498
Basic SVD
The basic SVD approach will perform matrix factorisation using the first 60 leading components. Since the missing values are set to 0 the factorisation will try and recreate them which is not quite what we want. For this example we will simply impute the missing observations with a mean value.
Y <- ui_train
# mean impute
Y <- apply(Y, 2, function(x) ifelse(x == 0, mean(x), x))
Y_svd <- svd(Y)
U <- Y_svd$u
V <- Y_svd$v
D <- Y_svd$d
ggplot(data.frame(x = 1:length(D), y = D/sum(D)), aes(x = x, y = y)) +
geom_line() +
labs(x = "Leading component", y = "")
# First 60 leading components
lc <- 50
pred <- U[,1:lc] %*% diag(D[1:lc]) %*% t(V[,1:lc])
rmse(pred, test)
## [1] 2.977975
## test_pred test.loghrs
## 1 -0.04245722 4.127134
## 2 0.13295874 5.602119
## 3 0.16577710 3.218876
## 4 1.00221468 2.272126
## 5 0.09393850 1.280934
## 6 0.48019680 5.105945
Not the best prediction. But this is our first stab at it.
SVD via gradient descent
Now we will use the gradient descent approach to find optimal U and V matrices which retain the actual observations with predict the missing values by drawing on the information between similar users and games. We have chosen a learning rate of 0.001 and will run for 500 iterations tracking the RMSE. The objective function is the squared error between the actual observed values and the predicted values. The U and V matrices are initialised with a random draw from a ~N(0, 0.02) distibution.
leading_components <- 50
Y <- ui_train
I <- apply(Y, 2, function(x) ifelse(x>0, 1, 0))
U <- matrix(rnorm(nrow(Y)*leading_components, 0, 0.02), ncol = leading_components)
V <- matrix(rnorm(ncol(Y)*leading_components, 0, 0.02), ncol = leading_components)
# objective function
f <- function(U, V){
return(sum(I*(U%*%t(V)-Y)^2))
}
dfu <- function(U){
return((2*I*(U%*%t(V)-Y))%*%V)
}
dfv <- function(V){
return(t(2*I*(U%*%t(V)-Y))%*%U)
}
Lets draw out a gradient descent.
# gradient descent
N <- 500
alpha <- 0.001
pred <- round(U%*%t(V), 2)
fobj <- f(U, V)
rmsej <- rmse(pred, test)
pb <- txtProgressBar(min = 0, max = N, style = 3)
##
|
| | 0%
start <- Sys.time()
for(k in 1:N){
U <- U - alpha*dfu(U)
V <- V - alpha*dfv(V)
fobj <- c(fobj, f(U, V))
pred <- round(U%*%t(V), 2)
rmsej <- c(rmsej, rmse(pred, test))
}
close(pb)
## Time difference of 3.9513 mins
path1 <- data.frame(itr = 1:(N+1), fobj, fobjp = fobj/max(fobj), rmse = rmsej, rmsep = rmsej/max(rmsej))
path1gg <- melt(path1[c("itr", "fobjp", "rmsep")], id.vars = "itr")
ggplot(path1gg, aes(itr, value, color = variable)) + geom_line()
dimnames(pred) <- list(user = rownames(ui_train), game = colnames(ui_train))
# Final iteration
tail(path1, 1)
## itr fobj fobjp rmse rmsep
## 501 501 61.19801 0.0001875766 1.533043 0.4666799
An improvement on the basic SVD approach. The output shows the objective function converged to 0 on the training data, while the error in the test set essentially halved.
Using the predicted user-item matrix, let’s look at the distribution of hours again for 3 randomly chosen games and apply an EM algoritm to find a reasonable 1-5 star rating
Its reasonable to create a rating based on time played
game_hrs_density_p <- function(pred, GAME = NULL, nclass, print_vals = TRUE){
if(is.null(GAME)){
GAME <- sample(colnames(pred), 1)
}
# subsetting data
game_data <- subset(pred[,GAME], pred[,GAME] > 0)
# em algorithm
mu.init <- seq(min(game_data), max(game_data), length = nclass)
EM <- normalmixEM(game_data, mu = mu.init, sigma=rep(1, nclass), fast = TRUE)
# print results
if(print_vals){
cat(" lambda: ", EM$lambda, "\n mean : ", EM$mu, "\n sigma : ", EM$sigma, "\n")
}
# building data frame for plotting
x <- seq(min(game_data), max(game_data), 0.01)
dens <- data.frame(x = x)
for(k in 1:nclass){
dens[,paste0('y', k)] <- EM$lambda[k]*dnorm(x, EM$mu[k], EM$sigma[k])
}
dens <- melt(dens, 'x', variable.name = 'gaussian')
game_plt <- ggplot(as.data.frame(game_data), aes(x = game_data)) +
geom_histogram(aes(y = ..density..), bins = 45, colour = "black", alpha = 0.7, size = 0.1) +
geom_area(data = dens, aes(x = x, y = value, fill = gaussian), alpha = 0.5, position = position_dodge()) +
geom_density(linetype = 2, size = 0.1) +
ggtitle(GAME)
return(game_plt)
}
## number of iterations= 738
## lambda: 0.1876481 0.07253394 0.4557727 0.1520379 0.1320074
## mean : 1.655965 2.842604 3.96708 4.593527 5.090148
## sigma : 0.4276102 0.2566956 0.4744549 0.1506311 0.2736574
## Warning: Width not defined. Set with `position_dodge(width = ?)`
## number of iterations= 802
## lambda: 0.05115075 0.4510233 0.4120464 0.07204703 0.0137326
## mean : 0.8685749 1.868877 3.802059 5.718908 7.16823
## sigma : 0.08624539 0.5096392 0.6166028 0.4771217 0.2627739
## Warning: Width not defined. Set with `position_dodge(width = ?)`
## number of iterations= 327
## lambda: 0.3853599 0.09297315 0.07272505 0.1986953 0.2502466
## mean : 1.393153 4.077467 2.639323 3.40177 4.45051
## sigma : 0.3988079 0.01170797 0.05984549 0.2915284 0.4708832
## Warning: Width not defined. Set with `position_dodge(width = ?)`
Perhaps not quite as appropriate this time as all the new predictions create a dense distribution. However the 2-4 distributions look like they fit fairly well. The 5 on the otherhand is rather flat and only picks up the very end of the tail.
Now let’s use a percentile approach to recommend the top 10 games for a user who hasn’t purchased
pred_percentile <- apply(pred, 2, percent_rank)
top <- function(n, user = NULL){
if(is.null(user)){
user <- sample(rownames(pred_percentile), 1)
}
not_purchased <- (I-1)%%2
top_games <- names(sort((pred_percentile*not_purchased)[user,], decreasing = TRUE))[1:n]
cat("Top", n, "Recommended games for user", user, ":\n")
for(k in 1:n){
cat(k, ")", top_games[k], "\n")
}
}
top(10)
## Top 10 Recommended games for user u242039522 :
## 1 ) CounterStrikeConditionZero
## 2 ) PathofExile
## 3 ) SniperEliteV2
## 4 ) GuacameleeGoldEdition
## 5 ) HalfLife2EpisodeOne
## 6 ) PokerNightattheInventory
## 7 ) Sacred2Gold
## 8 ) FreeStyle2StreetBasketball
## 9 ) TheLordoftheRingsWarintheNorth
## 10 ) Warframe
Overall this was an interesting problem. SVD Would definitely take a long time to train when the dataset is bit and my docker container’s CPU was always close to 100%. For these usecases cloud/remote solutions like Spark would be ideal.