Final Project

Background Case

Steam is the world’s most popular PC Gaming hub, with over 6,000 games and a community of millions of gamers. With a massive collection that includes everything from AAA blockbusters to small indie titles, great discovery tools are a highly valuable asset for Steam.

As a steam user, I have noticed that steam has built up their recommendation system. It appears in the home page with the tab called “For You”.

This dataset is a list of user behaviors, with columns: user-id, game-title, behavior-name, value. The behaviors included are ‘purchase’ and ‘play’. The value indicates the degree to which the behavior was performed - in the case of ‘purchase’ the value is always 1, and in the case of ‘play’ the value represents the number of hours the user has played the game.

I will try to use SVD approach in this case. Had some recommender system related issues with NA values.

Load required Libraries

library(dplyr)
library(ggplot2)
library(mixtools)
library(caTools)
library(recommenderlab)
library(reshape2)

Download Data File

kaggle datasets download -d tamber/steam-video-games --force
unzip -o steam-video-games.zip

## 
  0%|          | 0.00/1.46M [00:00<?, ?B/s]
 68%|██████▊   | 1.00M/1.46M [00:00<00:00, 6.57MB/s]
100%|██████████| 1.46M/1.46M [00:00<00:00, 7.13MB/s]
## Downloading steam-video-games.zip to /localdisk/Data-612/Final
## 
## Archive:  steam-video-games.zip
##   inflating: steam-200k.csv

Analyse Data

The dataset doesn’t contain the ratings of each game, we can define the users’ preferences on the games by how many hours they have spent in the game.

steam <- read.csv('steam-200k.csv', header = FALSE)
head(steam)

##          V1                         V2       V3    V4 V5
## 1 151603712 The Elder Scrolls V Skyrim purchase   1.0  0
## 2 151603712 The Elder Scrolls V Skyrim     play 273.0  0
## 3 151603712                  Fallout 4 purchase   1.0  0
## 4 151603712                  Fallout 4     play  87.0  0
## 5 151603712                      Spore purchase   1.0  0
## 6 151603712                      Spore     play  14.9  0

colnames(steam) <- c('user', 'game', 'purchase_play', 'hrs') # Adding Column names
head(steam)

##        user                       game purchase_play   hrs NA
## 1 151603712 The Elder Scrolls V Skyrim      purchase   1.0  0
## 2 151603712 The Elder Scrolls V Skyrim          play 273.0  0
## 3 151603712                  Fallout 4      purchase   1.0  0
## 4 151603712                  Fallout 4          play  87.0  0
## 5 151603712                      Spore      purchase   1.0  0
## 6 151603712                      Spore          play  14.9  0

steam_clean <- steam

Are we missing values?

apply(steam_clean, 2, function(x) sum(is.na(x)))

##          user          game purchase_play           hrs          <NA> 
##             0             0             0             0             0

I am splitting out the purchase/play column into two columns purchase and play.

steam_clean$purchase <- sapply(steam_clean$purchase_play, function(x) as.numeric(x == 'purchase'))
steam_clean$play <- sapply(steam_clean$purchase_play, function(x) as.numeric(x == 'play'))
steam_clean$hrs <- steam_clean$hrs-steam_clean$purchase
steam_clean <- steam_clean[,-3]
steam_clean <- aggregate(. ~ user + game, data = steam_clean, FUN = 'sum') # Aggregating users on game
head(steam_clean)

##       user                                                       game hrs NA
## 1 46055854                                                007 Legends 0.7  0
## 2 11940338                                                  0RBITALIS 0.6  0
## 3 86055705                                                  0RBITALIS 0.3  0
## 4 93030550                                                  0RBITALIS 0.3  0
## 5 11794760 1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby) 0.0  0
## 6 35701646 1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby) 0.0  0
##   purchase play
## 1        1    1
## 2        1    1
## 3        1    1
## 4        1    1
## 5        1    0
## 6        1    0

Lets find how many games were purchased by users.

numgames <- length(unique(steam_clean$game))

numusers <- length(unique(steam_clean$user))

cat("Total:- ", numgames, "games purchased by:- ", numusers, "users")

## Total:-  5155 games purchased by:-  12393 users

Let’s find out the top 20 games with the most users and hours played.

game_total_hrs <- aggregate(hrs ~ game, data = steam_clean, FUN = 'sum')
game_total_hrs <- game_total_hrs[order(game_total_hrs$hrs, decreasing = TRUE),]
most_played_games <- head(game_total_hrs, 20)
most_played_games <- data.frame(game = most_played_games$game, hrs = most_played_games$hrs)
head(most_played_games, 20)

##                                           game      hrs
## 1                                       Dota 2 981684.6
## 2              Counter-Strike Global Offensive 322771.6
## 3                              Team Fortress 2 173673.3
## 4                               Counter-Strike 134261.1
## 5                   Sid Meier's Civilization V  99821.3
## 6                        Counter-Strike Source  96075.5
## 7                   The Elder Scrolls V Skyrim  70889.3
## 8                                  Garry's Mod  49725.3
## 9  Call of Duty Modern Warfare 2 - Multiplayer  42009.9
## 10                               Left 4 Dead 2  33596.7
## 11                       Football Manager 2013  32308.6
## 12                       Football Manager 2012  30845.8
## 13                       Football Manager 2014  30574.8
## 14                                    Terraria  29951.8
## 15                                    Warframe  27074.6
## 16                       Football Manager 2015  24283.1
## 17                                      Arma 3  24055.7
## 18                          Grand Theft Auto V  22956.7
## 19                               Borderlands 2  22667.9
## 20                            Empire Total War  21030.3

Lets find game with the highest number of users

game_freq <- data.frame(sort(table(steam_clean$game), decreasing = TRUE))
colnames(game_freq) <- c("game", "numusers")
top20 <- merge(game_freq, game_total_hrs, by = 'game')
top20 <- head(top20[order(top20$numusers, decreasing = TRUE),], 20)
top20

##                                              game numusers      hrs
## 1333                                       Dota 2     4841 981684.6
## 4256                              Team Fortress 2     2323 173673.3
## 4822                                     Unturned     1563  16096.4
## 971               Counter-Strike Global Offensive     1412 322771.6
## 2077                       Half-Life 2 Lost Coast      981    184.4
## 974                         Counter-Strike Source      978  96075.5
## 2475                                Left 4 Dead 2      951  33596.7
## 968                                Counter-Strike      856 134261.1
## 4922                                     Warframe      847  27074.6
## 2074                       Half-Life 2 Deathmatch      823   3712.9
## 1888                                  Garry's Mod      731  49725.3
## 4368                   The Elder Scrolls V Skyrim      717  70889.3
## 3554                                    Robocraft      689   9096.6
## 969                 Counter-Strike Condition Zero      679   7950.0
## 970  Counter-Strike Condition Zero Deleted Scenes      679    418.2
## 2145                            Heroes & Generals      658   3299.5
## 2073                                  Half-Life 2      639   4260.3
## 3833                   Sid Meier's Civilization V      596  99821.3
## 4917                                  War Thunder      590  14381.6
## 3239                                       Portal      588   2282.8

ggplot(top20, aes(x = game, y = numusers, fill = hrs)) + 
    geom_bar(stat = "identity") + 
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
    labs(title = "Top 20 games with the most users", x = "Game Name", y = "Num of users")

“Dota 2” game has the highest number of players and the highest number of total hours played so undeniably the most popular game. Where as other games are much lower in played hrs numbers.

Lets find how many users purchased games but did not play

purchased_not_played <- subset(steam_clean, purchase == 1 & play == 0)
nTransactions <-nrow(purchased_not_played)
numusers <- length(unique(purchased_not_played$user))
nPurchased <- nrow(subset(steam_clean, purchase == 1))
cat("There are", nTransactions, "games purchased out of", nPurchased, "that have not been played by", numusers, "users")

## There are 57904 games purchased out of 128097 that have not been played by 5949 users

That’s a lot of games that haven’t been played. We cant rate games and recommendation only on bought numbers.

It seems reasonable to turn the distribution of hours played for a game into a rating, at least it’s a reasonable way to view the reception of the game. For example a bad game will have a large distribution around the 0-1 hours and a game which was well received will have a distribution of something bigger.

We will use the EM algorithm to highlight 5 groups which can be considered 1-5 star ratings.

Of course this might not be true, for example a user may only play the game for an average number of hours, but love it and would rate it a 5. The EM algorithm will be more appropriate than simply breaking the data into percentiles.

steam_clean$game1 <- gsub("[^a-zA-Z0-9]", "", steam_clean$game) # Cleaning up the game names. It will be easier to use the names of the users.
head(steam_clean)

##       user                                                       game hrs NA
## 1 46055854                                                007 Legends 0.7  0
## 2 11940338                                                  0RBITALIS 0.6  0
## 3 86055705                                                  0RBITALIS 0.3  0
## 4 93030550                                                  0RBITALIS 0.3  0
## 5 11794760 1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby) 0.0  0
## 6 35701646 1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby) 0.0  0
##   purchase play                               game1
## 1        1    1                          007Legends
## 2        1    1                           0RBITALIS
## 3        1    1                           0RBITALIS
## 4        1    1                           0RBITALIS
## 5        1    0 123KICKITDropThatBeatLikeanUglyBaby
## 6        1    0 123KICKITDropThatBeatLikeanUglyBaby

It seems reasonable to create rating based on how long the game was played. Will consider the games that have more than 2 hrs of playing hours registered.

game_hrs_density <- function(GAME, nclass, print_vals = TRUE){
  # subsetting data
  game_data <- subset(steam_clean, game1 == GAME & hrs > 2)
  game_data$loghrs <- log(game_data$hrs)
  
  # em algorithm
  mu.init <- seq(min(game_data$loghrs), max(game_data$loghrs), length = nclass)
  EM <- normalmixEM(game_data$loghrs, mu = mu.init, sigma=rep(1, nclass))
  
  # print results
  if(print_vals){
    cat(" lambda: ", EM$lambda, "\n mean  : ", EM$mu, "\n sigma : ", EM$sigma, "\n")
  }
  
  # building data frame for plotting
  x <- seq(min(game_data$loghrs), max(game_data$loghrs), 0.01)
  dens <- data.frame(x = x)
  for(k in 1:nclass){
    dens[,paste0('y', k)] <- EM$lambda[k]*dnorm(x, EM$mu[k], EM$sigma[k])
  }
  
  dens <- melt(dens, 'x', variable.name = 'gaussian')
  game_plt <- ggplot(game_data, aes(x = loghrs)) + 
    geom_histogram(aes(y = ..density..), bins = 25, colour = "black", alpha = 0.7, size = 0.1) +
    geom_area(data = dens, aes(x = x, y = value, fill = gaussian), alpha = 0.5, position = position_dodge()) +
    geom_density(linetype = 2, size = 0.1) + 
    ggtitle(game_data$game[1])

  return(game_plt)
}

Lets find some sample game hours of some of the games selected at random.

game_hrs_density("TheWitcher3WildHunt", 5, print_vals = TRUE)

## number of iterations= 937 
##  lambda:  0.1817628 0.03697299 0.04630347 0.2034308 0.5315299 
##  mean  :  1.625785 2.57379 2.998076 3.663856 4.553853 
##  sigma :  0.3996334 0.06765078 0.1151694 0.3322277 0.4540627

## Warning: Width not defined. Set with `position_dodge(width = ?)`

game_hrs_density("WarThunder", 5, print_vals = TRUE)

## number of iterations= 800 
##  lambda:  0.0511508 0.4510223 0.4120477 0.07204664 0.01373261 
##  mean  :  0.8685749 1.868876 3.802058 5.718911 7.16823 
##  sigma :  0.08624541 0.5096381 0.6166057 0.4771198 0.2627739

## Warning: Width not defined. Set with `position_dodge(width = ?)`

game_hrs_density("FINALFANTASYVII", 5, print_vals = TRUE)

## number of iterations= 206 
##  lambda:  0.3853429 0.07186484 0.2351439 0.2303342 0.07731413 
##  mean  :  1.39312 2.638814 3.456927 4.208118 5.030528 
##  sigma :  0.3987871 0.05963827 0.3235782 0.1881075 0.1686162

## Warning: Width not defined. Set with `position_dodge(width = ?)`

As you can see most of those who played the The Witcher 3 stuck with it and played 40+ hours. However there were a few users where The Witcher didn’t grab them and stopped playing after a few hours. The EM algorithm does a great job finding the groups of people with similar gaming habits and would potentially rate the game in a similar way.

Lets create user-item matrix is created with the users being the rows and games being the columns. The missing values are set to zero. The observed values are the log hours for each observed user-game combination.

Will consider data subset to games which have greater than 50 users and users which played the game for greater than 2 hours. 2 hours is the limit under which Steam will offer a return if user did not like the purchased game.

# create user item matrix
set.seed(100)
game_freq$game1 <- gsub("[^a-zA-Z0-9]", "", game_freq$game) # Cleaning the game names a bit so its easier to use their names.
game_users <- subset(game_freq, game_freq$numusers > 50)
steam_clean_pos <- subset(steam_clean, steam_clean$hrs > 2 & (steam_clean$game1 %in% game_users$game1))
steam_clean_pos$loghrs <- log(steam_clean_pos$hrs)

# make matrix
games <- data.frame(game1 = sort(unique(steam_clean_pos$game1)), game_id = 1:length(unique(steam_clean_pos$game1)))
users <- data.frame(user = sort(unique(steam_clean_pos$user)), user_id = 1:length(unique(steam_clean_pos$user)))
steam_clean_pos <- merge(steam_clean_pos, games, by = 'game1')
steam_clean_pos <- merge(steam_clean_pos, users, by = 'user')
ui_mat <- matrix(0, nrow = nrow(users), ncol = nrow(games), dimnames = list(user = paste0("u", sort(unique(steam_clean_pos$user))), 
                                                                            game = sort(unique(steam_clean_pos$game1))))
for(k in 1:nrow(steam_clean_pos)){
  ui_mat[steam_clean_pos$user_id[k], steam_clean_pos$game_id[k]] <- steam_clean_pos$loghrs[k]
}

To create a test set 20% of the observed values will be set to 0. The root mean squared error will be calculated to determine the accuracy.

# create training set i.e. suppress a 20th of the actual ratings
index <- sample.split(steam_clean_pos$user, SplitRatio = 0.8)
train <- steam_clean_pos[index,]
test <- steam_clean_pos[!index,]
ui_train <- ui_mat
for(k in 1:nrow(test)){
  ui_train[test$user_id[k], test$game_id[k]] <- 0
}


# root mean squared error function
rmse <- function(pred, test, data_frame = FALSE){
  test_pred <- rep(NA, nrow(test))
  for(k in 1:nrow(test)){
    test_pred[k] <- pred[test$user_id[k], test$game_id[k]]
  }
  if(data_frame){
    return(data.frame(test_pred, test$loghrs))
  }
  return(sqrt(1/(nrow(test)-1)*sum((test_pred - test$loghrs)^2)))
}

cat("Dimensions of training user-item matrix:", dim(ui_train))

## Dimensions of training user-item matrix: 8214 498

Basic SVD

The basic SVD approach will perform matrix factorisation using the first 60 leading components. Since the missing values are set to 0 the factorisation will try and recreate them which is not quite what we want. For this example we will simply impute the missing observations with a mean value.

Y <- ui_train

# mean impute
Y <- apply(Y, 2, function(x) ifelse(x == 0, mean(x), x))
Y_svd <- svd(Y)
U <- Y_svd$u
V <- Y_svd$v
D <- Y_svd$d
ggplot(data.frame(x = 1:length(D), y = D/sum(D)), aes(x = x, y = y)) + 
    geom_line() + 
    labs(x = "Leading component", y = "")

# First 60 leading components
lc <- 50
pred <- U[,1:lc] %*% diag(D[1:lc]) %*% t(V[,1:lc])

rmse(pred, test)

## [1] 2.977975

head(rmse(pred, test, TRUE))

##     test_pred test.loghrs
## 1 -0.04245722    4.127134
## 2  0.13295874    5.602119
## 3  0.16577710    3.218876
## 4  1.00221468    2.272126
## 5  0.09393850    1.280934
## 6  0.48019680    5.105945

Not the best prediction. But this is our first stab at it.

SVD via gradient descent

Now we will use the gradient descent approach to find optimal U and V matrices which retain the actual observations with predict the missing values by drawing on the information between similar users and games. We have chosen a learning rate of 0.001 and will run for 500 iterations tracking the RMSE. The objective function is the squared error between the actual observed values and the predicted values. The U and V matrices are initialised with a random draw from a ~N(0, 0.02) distibution.

leading_components <- 50
Y <- ui_train
I <- apply(Y, 2, function(x) ifelse(x>0, 1, 0))
U <- matrix(rnorm(nrow(Y)*leading_components, 0, 0.02), ncol = leading_components)
V <- matrix(rnorm(ncol(Y)*leading_components, 0, 0.02), ncol = leading_components)

# objective function
f <- function(U, V){
  return(sum(I*(U%*%t(V)-Y)^2))
}
dfu <- function(U){
  return((2*I*(U%*%t(V)-Y))%*%V)
}
dfv <- function(V){
  return(t(2*I*(U%*%t(V)-Y))%*%U)
}

Lets draw out a gradient descent.

# gradient descent
N <- 500
alpha <- 0.001
pred <- round(U%*%t(V), 2)
fobj <- f(U, V)
rmsej <- rmse(pred, test)
pb <- txtProgressBar(min = 0, max = N, style = 3)

## 
  |                                                                            
  |                                                                      |   0%

start <- Sys.time()
for(k in 1:N){
  U <- U - alpha*dfu(U)
  V <- V - alpha*dfv(V)
  fobj <- c(fobj, f(U, V))
  pred <- round(U%*%t(V), 2)
  rmsej <- c(rmsej, rmse(pred, test))
  
}
close(pb)

Sys.time()-start

## Time difference of 3.9513 mins

path1 <- data.frame(itr = 1:(N+1), fobj, fobjp = fobj/max(fobj), rmse = rmsej, rmsep = rmsej/max(rmsej))
path1gg <- melt(path1[c("itr", "fobjp", "rmsep")], id.vars = "itr")
ggplot(path1gg, aes(itr, value, color = variable)) + geom_line()

dimnames(pred) <- list(user = rownames(ui_train), game = colnames(ui_train))

# Final iteration
tail(path1, 1)

##     itr     fobj        fobjp     rmse     rmsep
## 501 501 61.19801 0.0001875766 1.533043 0.4666799

An improvement on the basic SVD approach. The output shows the objective function converged to 0 on the training data, while the error in the test set essentially halved.

Using the predicted user-item matrix, let’s look at the distribution of hours again for 3 randomly chosen games and apply an EM algoritm to find a reasonable 1-5 star rating

Its reasonable to create a rating based on time played

game_hrs_density_p <- function(pred, GAME = NULL, nclass, print_vals = TRUE){
  
  if(is.null(GAME)){
    GAME <- sample(colnames(pred), 1)
  }
  
  # subsetting data
  game_data <- subset(pred[,GAME], pred[,GAME] > 0) 
  
  # em algorithm
  mu.init <- seq(min(game_data), max(game_data), length = nclass)
  EM <- normalmixEM(game_data, mu = mu.init, sigma=rep(1, nclass), fast = TRUE)
  
  # print results
  if(print_vals){
    cat(" lambda: ", EM$lambda, "\n mean  : ", EM$mu, "\n sigma : ", EM$sigma, "\n")
  }
  
  # building data frame for plotting
  x <- seq(min(game_data), max(game_data), 0.01)
  dens <- data.frame(x = x)
  for(k in 1:nclass){
    dens[,paste0('y', k)] <- EM$lambda[k]*dnorm(x, EM$mu[k], EM$sigma[k])
  }
  
  dens <- melt(dens, 'x', variable.name = 'gaussian')
  game_plt <- ggplot(as.data.frame(game_data), aes(x = game_data)) + 
    geom_histogram(aes(y = ..density..), bins = 45, colour = "black", alpha = 0.7, size = 0.1) +
    geom_area(data = dens, aes(x = x, y = value, fill = gaussian), alpha = 0.5, position = position_dodge()) +
    geom_density(linetype = 2, size = 0.1) + 
    ggtitle(GAME)

  return(game_plt)
}

game_hrs_density("TheWitcher3WildHunt", 5, print_vals = TRUE)

## number of iterations= 738 
##  lambda:  0.1876481 0.07253394 0.4557727 0.1520379 0.1320074 
##  mean  :  1.655965 2.842604 3.96708 4.593527 5.090148 
##  sigma :  0.4276102 0.2566956 0.4744549 0.1506311 0.2736574

## Warning: Width not defined. Set with `position_dodge(width = ?)`

game_hrs_density("WarThunder", 5, print_vals = TRUE)

## number of iterations= 802 
##  lambda:  0.05115075 0.4510233 0.4120464 0.07204703 0.0137326 
##  mean  :  0.8685749 1.868877 3.802059 5.718908 7.16823 
##  sigma :  0.08624539 0.5096392 0.6166028 0.4771217 0.2627739

## Warning: Width not defined. Set with `position_dodge(width = ?)`

game_hrs_density("FINALFANTASYVII", 5, print_vals = TRUE)

## number of iterations= 327 
##  lambda:  0.3853599 0.09297315 0.07272505 0.1986953 0.2502466 
##  mean  :  1.393153 4.077467 2.639323 3.40177 4.45051 
##  sigma :  0.3988079 0.01170797 0.05984549 0.2915284 0.4708832

## Warning: Width not defined. Set with `position_dodge(width = ?)`

Perhaps not quite as appropriate this time as all the new predictions create a dense distribution. However the 2-4 distributions look like they fit fairly well. The 5 on the otherhand is rather flat and only picks up the very end of the tail.

Now let’s use a percentile approach to recommend the top 10 games for a user who hasn’t purchased

pred_percentile <- apply(pred, 2, percent_rank)
top <- function(n, user = NULL){
  if(is.null(user)){
    user <- sample(rownames(pred_percentile), 1)
  }
  not_purchased <- (I-1)%%2
  top_games <- names(sort((pred_percentile*not_purchased)[user,], decreasing = TRUE))[1:n]
  cat("Top", n, "Recommended games for user", user, ":\n")
  for(k in 1:n){
    cat(k, ")", top_games[k], "\n")
  }
}
top(10)

## Top 10 Recommended games for user u242039522 :
## 1 ) CounterStrikeConditionZero 
## 2 ) PathofExile 
## 3 ) SniperEliteV2 
## 4 ) GuacameleeGoldEdition 
## 5 ) HalfLife2EpisodeOne 
## 6 ) PokerNightattheInventory 
## 7 ) Sacred2Gold 
## 8 ) FreeStyle2StreetBasketball 
## 9 ) TheLordoftheRingsWarintheNorth 
## 10 ) Warframe

Overall this was an interesting problem. SVD Would definitely take a long time to train when the dataset is bit and my docker container’s CPU was always close to 100%. For these usecases cloud/remote solutions like Spark would be ideal.

Deleting the files from local system

rm -f steam-200k.csv steam-video-games.zip

data-612-Final-Project

Ashish Kumar

07/13/2020