Background: Our development team is creating a simple video game for their next project. The business and accounting units have determined that to be profitable, this game needs to clear at least $180,000 in global sales- easier said than done in today's saturated gaming markets. The business teams will use a decision tree and Naive Bayes to classify what genres, platforms, and publishers are factors that lead to >180,000 in sales.

1. Load & View Sample of Data:

vgames <- read.csv("Video Game Sales.csv")  
head(vgames)  
##   Rank                     Name Platform Year        Genre Publisher NA_Sales
## 1    1               Wii Sports      Wii 2006       Sports  Nintendo    41.49
## 2    2        Super Mario Bros.      NES 1985     Platform  Nintendo    29.08
## 3    3           Mario Kart Wii      Wii 2008       Racing  Nintendo    15.85
## 4    4        Wii Sports Resort      Wii 2009       Sports  Nintendo    15.75
## 5    5 Pokemon Red/Pokemon Blue       GB 1996 Role-Playing  Nintendo    11.27
## 6    6                   Tetris       GB 1989       Puzzle  Nintendo    23.20
##   EU_Sales JP_Sales Other_Sales Global_Sales
## 1    29.02     3.77        8.46        82.74
## 2     3.58     6.81        0.77        40.24
## 3    12.88     3.79        3.31        35.82
## 4    11.01     3.28        2.96        33.00
## 5     8.89    10.22        1.00        31.37
## 6     2.26     4.22        0.58        30.26

Initial Exploration with Summary, Histogram & Plots

summary(vgames)  
##       Rank                                Name          Platform   
##  Min.   :    1   Need for Speed: Most Wanted:   12   DS     :2163  
##  1st Qu.: 4151   FIFA 14                    :    9   PS2    :2161  
##  Median : 8300   LEGO Marvel Super Heroes   :    9   PS3    :1329  
##  Mean   : 8301   Madden NFL 07              :    9   Wii    :1325  
##  3rd Qu.:12450   Ratatouille                :    9   X360   :1265  
##  Max.   :16600   Angry Birds Star Wars      :    8   PSP    :1213  
##                  (Other)                    :16542   (Other):7142  
##       Year               Genre                             Publisher    
##  2009   :1431   Action      :3316   Electronic Arts             : 1351  
##  2008   :1428   Sports      :2346   Activision                  :  975  
##  2010   :1259   Misc        :1739   Namco Bandai Games          :  932  
##  2007   :1202   Role-Playing:1488   Ubisoft                     :  921  
##  2011   :1139   Shooter     :1310   Konami Digital Entertainment:  832  
##  2006   :1008   Adventure   :1286   THQ                         :  715  
##  (Other):9131   (Other)     :5113   (Other)                     :10872  
##     NA_Sales          EU_Sales          JP_Sales         Other_Sales      
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.00000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.00000  
##  Median : 0.0800   Median : 0.0200   Median : 0.00000   Median : 0.01000  
##  Mean   : 0.2647   Mean   : 0.1467   Mean   : 0.07778   Mean   : 0.04806  
##  3rd Qu.: 0.2400   3rd Qu.: 0.1100   3rd Qu.: 0.04000   3rd Qu.: 0.04000  
##  Max.   :41.4900   Max.   :29.0200   Max.   :10.22000   Max.   :10.57000  
##                                                                           
##   Global_Sales    
##  Min.   : 0.0100  
##  1st Qu.: 0.0600  
##  Median : 0.1700  
##  Mean   : 0.5374  
##  3rd Qu.: 0.4700  
##  Max.   :82.7400  
## 
hist(vgames$Global_Sales)  

plot(vgames$Genre, vgames$Global_Sales, main="Genre vs. Global Sales")   

plot(vgames$Platform, vgames$Global_Sales, main="Video Game Platform vs. Global Sales")    

Decision Tree

1. Add categorical/binary variable to organize by Global Sales with levels of less than $180,000 (0) and >= $180000 in global sales (1). This will be the outcome of interest, or the output variable/predicted class when we make our decision tree.

View samples of data to confirm new variable exists.

vgames$salescat[vgames$Global_Sales< 0.18] <- "0"  
vgames$salescat[vgames$Global_Sales>=0.18] <- "1"  
Make this categorical variable 'salescat' a factor.
vgames$salescat <- factor(vgames$salescat)  

Ensure "factor" designation was successful by checking levels:

levels(vgames$salescat)  
## [1] "0" "1"
Ensure other categorical variables are factors.

Example:

class(vgames$Platform)  
## [1] "factor"
levels(vgames$Platform)  
##  [1] "2600" "3DO"  "3DS"  "DC"   "DS"   "GB"   "GBA"  "GC"   "GEN"  "GG"  
## [11] "N64"  "NES"  "NG"   "PC"   "PCFX" "PS"   "PS2"  "PS3"  "PS4"  "PSP" 
## [21] "PSV"  "SAT"  "SCD"  "SNES" "TG16" "Wii"  "WiiU" "WS"   "X360" "XB"  
## [31] "XOne"

Training Data:

Our dataset has 16,598 rows in total. We chose 90% of the dataset for training (14,938 rows).
Select 14,938 (90%) rows at random for training:

train <- vgames[sample(nrow(vgames), 14938), ]  

Selection of testing data:

Select 1660 (10%) rows at random for testing:

test <- vgames[sample(nrow(vgames), 1660), ]  

Classification with a Decision Tree:

library("rpart")  
library("rpart.plot")  
dtGames <- rpart(salescat ~ Platform + Genre,
            method="class",
            data=train, parms=list(split='information'), 
            minsplit=5, cp=0.01)  
rpart.plot(dtGames, type=4, extra=1)  

Predict test data:

predict <- predict(dtGames, test)  
predict [1:5,]  

This code predicts the probability of the binary outcome variable (0= less than $180,000 | 1= greater than or equal to $180,000) for the test data set (test), which is a random sample of 10% of the original data set (vgames).

But the list of predictions is quite long, so we looked at just the list of the first 5p prediction. It's best to do some checks of individual rows in the test data to see if the prediction is correct. When looking at the first two rows of data in the test data set prediction, and comparing their rows to the original data set, we see that they are classified correctly according to the decision tree.

predict(dtGames, test[1,])  
##              0        1
## 14088 0.651268 0.348732
predict(dtGames, test[2,])  
##               0         1
## 14979 0.3811659 0.6188341

Naive Bayes

In addition to a decision tree, our business analytics team wanted to use a Naive Bayes classifier.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
library(e1071)

1. Load & View Sample of Data:

vgames <- read.csv("Video Game Sales.csv")  
class(vgames$Platform)  
## [1] "factor"
levels(vgames$Platform)  
##  [1] "2600" "3DO"  "3DS"  "DC"   "DS"   "GB"   "GBA"  "GC"   "GEN"  "GG"  
## [11] "N64"  "NES"  "NG"   "PC"   "PCFX" "PS"   "PS2"  "PS3"  "PS4"  "PSP" 
## [21] "PSV"  "SAT"  "SCD"  "SNES" "TG16" "Wii"  "WiiU" "WS"   "X360" "XB"  
## [31] "XOne"
vgames$salescat[vgames$Global_Sales< 0.18] <- "0"  
vgames$salescat[vgames$Global_Sales>=0.18] <- "1" 
vgames$salescat <- factor(vgames$salescat,levels=c(0,1), labels=c("No","Yes"))
train <- vgames[sample(nrow(vgames), 14938), ]  
test <- vgames[sample(nrow(vgames), 1660), ]  
nbCat <- naiveBayes(salescat ~ Platform + Genre, train)
nbCat
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##        No       Yes 
## 0.5073638 0.4926362 
## 
## Conditional probabilities:
##      Platform
## Y             2600          3DO          3DS           DC           DS
##   No  0.0013194353 0.0003958306 0.0362844702 0.0036944188 0.1647974667
##   Yes 0.0152194592 0.0000000000 0.0248675092 0.0027177606 0.0902296508
##      Platform
## Y               GB          GBA           GC          GEN           GG
##   No  0.0011874918 0.0510621454 0.0366803008 0.0015833223 0.0000000000
##   Yes 0.0101916021 0.0487838021 0.0301671423 0.0016306563 0.0000000000
##      Platform
## Y              N64          NES           NG           PC         PCFX
##   No  0.0127985222 0.0002638871 0.0010555482 0.0868188415 0.0001319435
##   Yes 0.0263622775 0.0115504824 0.0005435521 0.0298953662 0.0000000000
##      Platform
## Y               PS          PS2          PS3          PS4          PSP
##   No  0.0577912653 0.1075339755 0.0604301359 0.0183401504 0.0997493073
##   Yes 0.0894143226 0.1523304797 0.1035466775 0.0225574127 0.0456583775
##      Platform
## Y              PSV          SAT          SCD         SNES         TG16
##   No  0.0370761314 0.0131943528 0.0005277741 0.0088402164 0.0002638871
##   Yes 0.0116863704 0.0073379535 0.0001358880 0.0203832042 0.0000000000
##      Platform
## Y              Wii         WiiU           WS         X360           XB
##   No  0.0719092229 0.0072568940 0.0002638871 0.0534371289 0.0550204512
##   Yes 0.0873760022 0.0099198261 0.0004076641 0.0981111564 0.0438918331
##      Platform
## Y             XOne
##   No  0.0102915952
##   Yes 0.0150835711
## 
##      Genre
## Y         Action  Adventure   Fighting       Misc   Platform     Puzzle
##   No  0.18485288 0.11795751 0.04670801 0.10964507 0.04354136 0.04512469
##   Yes 0.21524664 0.03601033 0.05503465 0.09906237 0.06509037 0.02486751
##      Genre
## Y         Racing Role-Playing    Shooter Simulation     Sports   Strategy
##   No  0.07098562   0.08615912 0.06834675 0.05330519 0.12033250 0.05304130
##   Yes 0.07963038   0.09253975 0.08955021 0.05082212 0.16401685 0.02812882
predict(nbCat, test[1,])
## [1] Yes
## Levels: No Yes
predict(nbCat, test[2,])
## [1] Yes
## Levels: No Yes