1 Background

Steam is one of the online PC games store, the other one being good old games (GOG). There are thousands of games sold in steam of various genres. Visitors to the steam store can see how many positive and negative ratings a certain game has and if the review has been mostly positive or negative. The store interface does not tell a visitor how many people has downloaded and played a certain game, although there are metrics to tell the most popular games in Steam. It would be interesting if we can predict a game is popular or not based on factors such as the price, the playing times, the genre, etc from a kaggle dataset.

2 Dataset Description

The dataset came from here. There are several files there, but we are focused only on the “steam.csv” file. The author of the dataset stated that some non-game softwares may still be included in the dataset so we will have to do some cleaning. The dataset contains games from 1997 - 2019 so it is a bit out of date but still relevant for our purposes.
The description of the column names in the file are as follows:

  • appid : the ID number of the app
  • name : the title of the game
  • release_date : the date of the game’s release
  • english : If the game is available in english or not (1 = yes, 0 = no)
  • developer : the name of the game’s developer
  • publisher : the name of the game’s publisher
  • platforms : the operating system which can play this game (windows, mac. or linux)
  • required_age : the minimum age required before you can play this game
  • categories: whether the game is a single-player, multi-player, or online
  • genres : the genre of the game, if it is not a game then what kind of software it is
  • steamspy_tags : the genre of the games as categorized by steamspy
  • achievements : the number of achievements of a game
  • positive_ratings : number of positive ratings of a game
  • negative_ratings : number of negative ratings of a game
  • average_playtime : the average playtime in hours
  • median_playtime : the median playtime in hours
  • owners : the number of people who owns the game; the type is categorical rather than numerical
  • price : the price of the game in poundsterling, this will be converted to USD

3 Importing the data

steam_raw <- read.csv("steam.csv")
glimpse(steam_raw)
#> Rows: 27,075
#> Columns: 18
#> $ appid            <int> 10, 20, 30, 40, 50, 60, 70, 80, 130, 220, 240, 280, 3…
#> $ name             <chr> "Counter-Strike", "Team Fortress Classic", "Day of De…
#> $ release_date     <chr> "2000-11-01", "1999-04-01", "2003-05-01", "2001-06-01…
#> $ english          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ developer        <chr> "Valve", "Valve", "Valve", "Valve", "Gearbox Software…
#> $ publisher        <chr> "Valve", "Valve", "Valve", "Valve", "Valve", "Valve",…
#> $ platforms        <chr> "windows;mac;linux", "windows;mac;linux", "windows;ma…
#> $ required_age     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ categories       <chr> "Multi-player;Online Multi-Player;Local Multi-Player;…
#> $ genres           <chr> "Action", "Action", "Action", "Action", "Action", "Ac…
#> $ steamspy_tags    <chr> "Action;FPS;Multiplayer", "Action;FPS;Multiplayer", "…
#> $ achievements     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 33, 147, 0, 54, 0, 0, 0, 1…
#> $ positive_ratings <int> 124534, 3318, 3416, 1273, 5250, 2758, 27755, 12120, 3…
#> $ negative_ratings <int> 3339, 633, 398, 267, 288, 684, 1100, 1439, 420, 2419,…
#> $ average_playtime <int> 17612, 277, 187, 258, 624, 175, 1300, 427, 361, 691, …
#> $ median_playtime  <int> 317, 62, 34, 184, 415, 10, 83, 43, 205, 402, 400, 214…
#> $ owners           <chr> "10000000-20000000", "5000000-10000000", "5000000-100…
#> $ price            <dbl> 7.19, 3.99, 3.99, 3.99, 3.99, 3.99, 7.19, 7.19, 3.99,…
head(steam_raw)

There are 27,075 observations and 18 columns. The appid column need to be dropped because appid is useless as a predictor The owners column need to be translated into popularity levels for easier filtering. The english column needs to be converted into factor type and all the character type columns need to be converted to factors except name. The price column is in GBP (British Poundsterling), We will convert this to the more recognizable US dollars (USD).

4 EDA

The glimpse into our dataset shows that some columns consisted of combinations of three or more unique values. Let’s try to find out what values are there in the genre column.

unique(unlist(strsplit(as.character(steam_raw$genres), ";")))
#>  [1] "Action"                "Free to Play"          "Strategy"             
#>  [4] "Adventure"             "Indie"                 "RPG"                  
#>  [7] "Animation & Modeling"  "Video Production"      "Casual"               
#> [10] "Simulation"            "Racing"                "Violent"              
#> [13] "Massively Multiplayer" "Nudity"                "Sports"               
#> [16] "Early Access"          "Gore"                  "Utilities"            
#> [19] "Design & Illustration" "Web Publishing"        "Education"            
#> [22] "Software Training"     "Sexual Content"        "Audio Production"     
#> [25] "Game Development"      "Photo Editing"         "Accounting"           
#> [28] "Documentary"           "Tutorial"

We see here that there are some genres that do not belong to games. We will remove them later. Let’s do the same thing to several other similar columns.

unique(unlist(strsplit(as.character(steam_raw$categories), ";")))
#>  [1] "Multi-player"               "Online Multi-Player"       
#>  [3] "Local Multi-Player"         "Valve Anti-Cheat enabled"  
#>  [5] "Single-player"              "Steam Cloud"               
#>  [7] "Steam Achievements"         "Steam Trading Cards"       
#>  [9] "Captions available"         "Partial Controller Support"
#> [11] "Includes Source SDK"        "Cross-Platform Multiplayer"
#> [13] "Stats"                      "Commentary available"      
#> [15] "Includes level editor"      "Steam Workshop"            
#> [17] "In-App Purchases"           "Co-op"                     
#> [19] "Full controller support"    "Steam Leaderboards"        
#> [21] "SteamVR Collectibles"       "Online Co-op"              
#> [23] "Shared/Split Screen"        "Local Co-op"               
#> [25] "MMO"                        "VR Support"                
#> [27] "Mods"                       "Mods (require HL2)"        
#> [29] "Steam Turn Notifications"
writeLines("\n")
length(unique(unlist(strsplit(as.character(steam_raw$steamspy_tags), ";"))))
#> [1] 339
writeLines("\n")
unique(unlist(strsplit(as.character(steam_raw$platforms), ";")))
#> [1] "windows" "mac"     "linux"
writeLines("\n")
length(unique(unlist(strsplit(as.character(steam_raw$developer), ";"))))
#> [1] 17953
writeLines("\n")
length(unique(unlist(strsplit(as.character(steam_raw$publisher), ";"))))
#> [1] 14352

Some columns have unique words more than 100 so only the count is displayed. The columns with hundreds of unique words cannot be used as categories for modeling decision trees or random forests so they will be excluded as well. We will now check for the unique values of the owners column

unique(steam_raw$owners)
#>  [1] "10000000-20000000"   "5000000-10000000"    "2000000-5000000"    
#>  [4] "20000000-50000000"   "100000000-200000000" "50000000-100000000" 
#>  [7] "20000-50000"         "500000-1000000"      "100000-200000"      
#> [10] "50000-100000"        "1000000-2000000"     "200000-500000"      
#> [13] "0-20000"

The values in the owners column is hard to work with because they are of type characters and they are in the form of ranges instead of single numbers. This column will be translated into popularity level, where the higher levels will correspond to higher owner range. This will make filtering easier.

Finally for this EDA we will check for missing values

anyNA(steam_raw)
#> [1] FALSE

There are no missing values, so we will proceed to data cleaning.

5 Data cleaning

# drop the appid and release date column, convert prices from GBP to USD calculate ratings ratio, 
# extract year from release date, convert range of owners to popularity level.
# Remove Free to Play tags from games that have nonzero price tags
steam <- steam_raw %>% select(-c(appid)) %>% mutate(price = price * 1.4, 
                                                    release_date = ymd(release_date),
                                                    year = year(release_date),
                                                    ratings_ratio = ifelse(negative_ratings !=0, positive_ratings / negative_ratings, positive_ratings), 
                                                    popularity_level = case_when(owners == "0-20000" ~ 1,
                                                                             owners == "20000-50000" ~ 2,
                                                                            owners == "50000-100000" ~ 3,
                                                                            owners == "100000-200000" ~ 4,
                                                                            owners == "200000-500000" ~ 5,
                                                                            owners == "500000-1000000" ~ 6,
                                                                            owners == "1000000-2000000" ~ 7,
                                                                            owners == "2000000-5000000" ~ 8,
                                                                            owners == "5000000-10000000" ~ 9,
                                                                            owners == "10000000-20000000" ~ 10,
                                                                            owners == "20000000-50000000" ~ 11,
                                                                            owners == "50000000-100000000" ~ 12,
                                                                            owners == "100000000-200000000" ~ 13
), 
genres = ifelse(price > 0 & str_detect(genres,"Free to Play"), str_remove(genres, ";Free to Play|Free to Play;"), genres)
)

# convert all chr columns to factors
# convert english column to factor.
steam <- steam %>% mutate_if(is.character,as.factor) %>% mutate(english = as.factor(english))


# separate the genres for each game
steam_genres <-  steam %>% separate_rows(genres,sep=";") %>% mutate(genres = as.factor(genres))

# remove non game genres
steam_genres <- steam_genres %>% filter(!genres %in% c("Animation & Modeling", "Video Production", "Utilities", "Design & Illustration", "Web Publishing", "Education", "Software Training", "Audio Production", "Game Development", "Photo Editing", "Accounting", "Documentary","Tutorial" )) %>% droplevels()

# combine again the genres for unique games only data frame
steam_games <- steam_genres %>% group_by(name) %>% mutate(genres = as.factor(str_c(genres,collapse = ";"))) %>% ungroup() %>%  distinct(name, .keep_all = TRUE)

Now we have games only dataset, we want to label the games with popularity level above 1 as “yes” in a new “popular” column and “no” for games that have popularity level 1. After adding the new column, we will drop the owners and popularity column as the popular column were derived from these two columns.

steam_popular <- steam_games %>% 
  mutate(popular = as.factor(if_else(popularity_level > 1,"yes","no"))) %>% 
  select(-c(owners, popularity_level))

Now we will separate the words in the multiwords columns into 3 separate columns to make working with categories easier and faster. Some columns will have NA because the observations in a particular row have less than 3 words. The NA will be replaced with string “None”. We will also drop factor columns that have unique values greater than 30, the release_date column and convert the year column into type int.

steam_popular2 <- steam_popular %>%
  separate(col = steamspy_tags, sep = ";", into = c("steamspy_tags1","steamspy_tags2","steamspy_tags3"), convert = F) %>% 
  separate(col = genres, sep = ";", into = c("genres1","genres2","genres3"), convert = F) %>% 
  separate(col = categories, sep = ";", into = c("categories1","categories2","categories3"), convert = F) %>% 
  separate(col = platforms, sep = ";", into = c("platforms1","platforms2","platforms3"), convert = F)
steam_popular2[is.na(steam_popular2)] <- "None"
steam_popular2 <- steam_popular2 %>% select(-c(release_date,publisher,developer,name,steamspy_tags1,steamspy_tags2,steamspy_tags3)) %>% mutate_if(is.character,as.factor) %>% mutate(year = as.numeric(year))
dim(steam_popular2)
#> [1] 26877    20

The steam_popular2 variable now has 76,877 observations and 20 columns.

What about the unique values for factors column ?

steam_popular2 %>% select_if(is.factor) %>% summarise_all(n_distinct)

The factor levels are now less than 30.

6 Cross - validation and downsampling

let us check the proportion of the popular classes.

prop.table(table(steam_popular2$popular))*100
#> 
#>       no      yes 
#> 68.65722 31.34278

The games in the no category of popular column is 69% and the yes category only 31%. The distribution is very unbalanced and we will apply downsampling to balance the classes but before that we will take 8% of the data for our train-test set for faster training with decision trees / random forests.

RNGkind(sample.kind = "Rounding")
set.seed(100)
# your code here
intrain <- initial_split(steam_popular2, prop = 0.08, strata = "popular")
steam_trainval <- training(intrain)
nrow(steam_trainval)
#> [1] 2151
prop.table(table(steam_trainval$popular))*100
#> 
#>       no      yes 
#> 68.66574 31.33426
RNGkind(sample.kind = "Rounding")
set.seed(100)
# now separate again into 80% train 20% test
intrain <- initial_split(steam_trainval, prop = 0.8, strata = "popular")
trainset <- training(intrain)
testset <- testing(intrain)
prop.table(table(steam_trainval$popular))*100
#> 
#>       no      yes 
#> 68.66574 31.33426
# downsampling the trainset for balanced classes
trainset_down <- downSample(x = trainset %>% select(-popular),
                               y = trainset$popular,
                               yname = "popular")

prop.table(table(trainset_down$popular))
#> 
#>  no yes 
#> 0.5 0.5

Now the proportion of the labels in the popular column in the training dataset is balanced and we can proceed with the model building.

7 Model 1 - Naive Bayes

The first model we will build is the Naive Bayes. This model will predict the classification of the popular label of a game by using bayesian probability analysis.

library(e1071)
model_nb <- naiveBayes(x=trainset_down %>% select(-popular),
                          y=trainset_down$popular,
                          laplace = 1)
pred_nb <- predict(model_nb,newdata = testset, type = "class")
confusionMatrix(pred_nb, reference = testset$popular, positive = "yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  283  47
#>        yes  12  87
#>                                                
#>                Accuracy : 0.8625               
#>                  95% CI : (0.8262, 0.8936)     
#>     No Information Rate : 0.6876               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.6553               
#>                                                
#>  Mcnemar's Test P-Value : 0.000009581          
#>                                                
#>             Sensitivity : 0.6493               
#>             Specificity : 0.9593               
#>          Pos Pred Value : 0.8788               
#>          Neg Pred Value : 0.8576               
#>              Prevalence : 0.3124               
#>          Detection Rate : 0.2028               
#>    Detection Prevalence : 0.2308               
#>       Balanced Accuracy : 0.8043               
#>                                                
#>        'Positive' Class : yes                  
#> 

The performance metric we will use is the precision or pos pred value. This is so that we will have low false positives, games that are classified as popular but actually are not. We will still pay attention to the accuracy and recall too. With Naive Bayes, we see that the metrics are:

  • Accuracy: 86.25%
  • Recall: 64.93%
  • Precision: 87.88%

For Naive Bayes we can also calculate and visualise the Area Under the Receiver Operating Curve to check the goodness of the model.

popular_predProb <- predict(model_nb, newdata = testset, type = "raw")


# buat objek prediction
popular_roc <- prediction(predictions = popular_predProb[, 2], # prob kelas positif
                        labels = as.numeric(testset$popular == "yes")) # label kelas positif

# buat performance dari objek prediction
perf <- performance(prediction.obj = popular_roc,
                    measure = "tpr", # tpr = true positive rate
                    x.measure = "fpr") #fpr = false positive rate
                    
# buat plot
plot(perf)
abline(0,1, lty = 2) # utk buat garis diagonal saja = kurva utk model yang buruk (utk jadi pembanding)

According to the AUC the naive bayes model is pretty good.

8 Model 2 - Decision Tree

For our second model, we will build a decision tree. This model will select the predictors and their value based on how much the disorder in the label is decreased. This model is very good if we need a rule-based model to take decisions.

set.seed(100)
# your code here
model_dt <- ctree(popular ~., data = trainset_down)
pred_dt <- predict(model_dt,newdata = testset, type = "response")
confusionMatrix(pred_dt, reference = testset$popular, positive = "yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  201   9
#>        yes  94 125
#>                                                
#>                Accuracy : 0.7599               
#>                  95% CI : (0.7166, 0.7996)     
#>     No Information Rate : 0.6876               
#>     P-Value [Acc > NIR] : 0.0005838            
#>                                                
#>                   Kappa : 0.5236               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.9328               
#>             Specificity : 0.6814               
#>          Pos Pred Value : 0.5708               
#>          Neg Pred Value : 0.9571               
#>              Prevalence : 0.3124               
#>          Detection Rate : 0.2914               
#>    Detection Prevalence : 0.5105               
#>       Balanced Accuracy : 0.8071               
#>                                                
#>        'Positive' Class : yes                  
#> 

The performance metrics are:

  • Accuracy: 75.99%
  • Recall: 93.28%
  • Precision: 57.08%

9 Model 3 - Random Forest

The final model will be Random Forest. This model is actually an ensemble of decision trees focusing on different predictors for each tree.

set.seed(100)
 
ctrl <- trainControl(method = "repeatedcv",
                      number = 2, # k-fold
                      repeats = 1) # repetisi
 
model_rf <- train(popular ~ .,
                    data = trainset_down,
                    method = "rf", # random forest
                    trControl = ctrl,
                   ntree = 19)
 
saveRDS(model_rf, "model_rf_update.RDS") # simpan model
pred_rf <- predict(model_rf,newdata = testset, type = "raw")
confusionMatrix(pred_rf, reference = testset$popular, positive = "yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  259  10
#>        yes  36 124
#>                                                
#>                Accuracy : 0.8928               
#>                  95% CI : (0.8596, 0.9204)     
#>     No Information Rate : 0.6876               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.7629               
#>                                                
#>  Mcnemar's Test P-Value : 0.0002278            
#>                                                
#>             Sensitivity : 0.9254               
#>             Specificity : 0.8780               
#>          Pos Pred Value : 0.7750               
#>          Neg Pred Value : 0.9628               
#>              Prevalence : 0.3124               
#>          Detection Rate : 0.2890               
#>    Detection Prevalence : 0.3730               
#>       Balanced Accuracy : 0.9017               
#>                                                
#>        'Positive' Class : yes                  
#> 

The performance metrics are: - Accuracy: 89.28% - Recall: 92.54% - Precision: 77.50%

So, based on the precision metric, the best model is naive bayes with a precision of approximately 88%. This model will rarely predict a non popular title to be popular but unfortunately it will more often predict some popular games to be unpopular. Based on accuracy, the best model is random forest and based on recall the best model is decision tree.

Let us try to improve the performance of the decision tree and random forest before drawing the final conclusion.

set.seed(100)
# your code here
model_dt_v2 <- ctree(popular ~., data = trainset_down, control = ctree_control(mincriterion=0.005, minsplit=10, minbucket=5))

pred_dt_v2 <- predict(model_dt_v2,newdata = testset, type = "response")
confusionMatrix(pred_dt_v2, reference = testset$popular, positive = "yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  259  15
#>        yes  36 119
#>                                                
#>                Accuracy : 0.8811               
#>                  95% CI : (0.8467, 0.9102)     
#>     No Information Rate : 0.6876               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.7346               
#>                                                
#>  Mcnemar's Test P-Value : 0.005101             
#>                                                
#>             Sensitivity : 0.8881               
#>             Specificity : 0.8780               
#>          Pos Pred Value : 0.7677               
#>          Neg Pred Value : 0.9453               
#>              Prevalence : 0.3124               
#>          Detection Rate : 0.2774               
#>    Detection Prevalence : 0.3613               
#>       Balanced Accuracy : 0.8830               
#>                                                
#>        'Positive' Class : yes                  
#> 

With the new parameters the metrics become:

  • Accuracy: 88.11%
  • Recall: 88.81%
  • Precision: 76.77%
set.seed(100)
 
ctrl <- trainControl(method = "repeatedcv",
                      number = 5, # k-fold
                      repeats = 3) # repetisi
 
model_rf_v2 <- train(popular ~ .,
                    data = trainset_down,
                    method = "rf", # random forest
                    trControl = ctrl,
                   ntree = 100)
 
saveRDS(model_rf, "model_rf_update.RDS") # simpan model

pred_rf_v2 <- predict(model_rf_v2, newdata = testset, type = "raw")
confusionMatrix(pred_rf_v2, reference = testset$popular, positive = "yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  no yes
#>        no  260  10
#>        yes  35 124
#>                                                
#>                Accuracy : 0.8951               
#>                  95% CI : (0.8622, 0.9225)     
#>     No Information Rate : 0.6876               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.7676               
#>                                                
#>  Mcnemar's Test P-Value : 0.0003466            
#>                                                
#>             Sensitivity : 0.9254               
#>             Specificity : 0.8814               
#>          Pos Pred Value : 0.7799               
#>          Neg Pred Value : 0.9630               
#>              Prevalence : 0.3124               
#>          Detection Rate : 0.2890               
#>    Detection Prevalence : 0.3706               
#>       Balanced Accuracy : 0.9034               
#>                                                
#>        'Positive' Class : yes                  
#> 

With the new parameters the metrics become:

  • Accuracy: 89.51%
  • Recall: 92.54%
  • Precision: 77.99%

After the attempt at improvement, the naive bayes classifier still outperforms decision tree and random forests in terms of precision. However, the precision of the decision tree did improve from 57.08% to 76.77% while for random forest the improvement in precision is minuscule, from 77.50% to 77.99%

With decision tree we can plot the tree in order to see the rules used to make predictions

plot(model_dt, type = "simple")

plot(model_dt_v2, type = "simple")

Comparing the decision tree versions 1 and 2 we can see that the first predictor used to split the decision is year, whether it is 2016 and before or after 2016.

What about the random forests ?

plot(varImp(model_rf))

plot(varImp(model_rf_v2))

In both versions the most important predictor is the negative ratings.

10 Conclusion

We have tried to predict if a certain game title is popular or not based on data from the steam store. A popular game is if it is owned by more than 20,000 people. There are three models used: naive bayes, decision tree, and random forest. We decided that it is more important to keep false positives low so we chose precision as the main performance metric though we also looked at accuracy and recall as secondary metrics. We see that naive bayes has the highest precision out of the three model. Even after we tried to fine tune decision tree and random forest, the precision of naive bayes is still higher than these two. As a bonus we also plotted the AUC of the naive bayes and the visualisation does show that the model is quite good, as the curve is near 1. We also plotted the tree diagrams of the decision trees and the variable importance diagrams of the random forests and we got the insight that for the decision tree, the first predictor used is year and for the random forest, the most important variable is the negative ratings. In theory decision trees and random forests should perform better than naive bayes so there is still room for improvement for the tree based models. Or, maybe in this case, naive bayes is indeed the best model.