vgames <- read.csv("Video Game Sales.csv")
head(vgames)
## Rank Name Platform Year Genre Publisher NA_Sales
## 1 1 Wii Sports Wii 2006 Sports Nintendo 41.49
## 2 2 Super Mario Bros. NES 1985 Platform Nintendo 29.08
## 3 3 Mario Kart Wii Wii 2008 Racing Nintendo 15.85
## 4 4 Wii Sports Resort Wii 2009 Sports Nintendo 15.75
## 5 5 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo 11.27
## 6 6 Tetris GB 1989 Puzzle Nintendo 23.20
## EU_Sales JP_Sales Other_Sales Global_Sales
## 1 29.02 3.77 8.46 82.74
## 2 3.58 6.81 0.77 40.24
## 3 12.88 3.79 3.31 35.82
## 4 11.01 3.28 2.96 33.00
## 5 8.89 10.22 1.00 31.37
## 6 2.26 4.22 0.58 30.26
summary(vgames)
## Rank Name Platform
## Min. : 1 Need for Speed: Most Wanted: 12 DS :2163
## 1st Qu.: 4151 FIFA 14 : 9 PS2 :2161
## Median : 8300 LEGO Marvel Super Heroes : 9 PS3 :1329
## Mean : 8301 Madden NFL 07 : 9 Wii :1325
## 3rd Qu.:12450 Ratatouille : 9 X360 :1265
## Max. :16600 Angry Birds Star Wars : 8 PSP :1213
## (Other) :16542 (Other):7142
## Year Genre Publisher
## 2009 :1431 Action :3316 Electronic Arts : 1351
## 2008 :1428 Sports :2346 Activision : 975
## 2010 :1259 Misc :1739 Namco Bandai Games : 932
## 2007 :1202 Role-Playing:1488 Ubisoft : 921
## 2011 :1139 Shooter :1310 Konami Digital Entertainment: 832
## 2006 :1008 Adventure :1286 THQ : 715
## (Other):9131 (Other) :5113 (Other) :10872
## NA_Sales EU_Sales JP_Sales Other_Sales
## Min. : 0.0000 Min. : 0.0000 Min. : 0.00000 Min. : 0.00000
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.00000
## Median : 0.0800 Median : 0.0200 Median : 0.00000 Median : 0.01000
## Mean : 0.2647 Mean : 0.1467 Mean : 0.07778 Mean : 0.04806
## 3rd Qu.: 0.2400 3rd Qu.: 0.1100 3rd Qu.: 0.04000 3rd Qu.: 0.04000
## Max. :41.4900 Max. :29.0200 Max. :10.22000 Max. :10.57000
##
## Global_Sales
## Min. : 0.0100
## 1st Qu.: 0.0600
## Median : 0.1700
## Mean : 0.5374
## 3rd Qu.: 0.4700
## Max. :82.7400
##
hist(vgames$Global_Sales)
plot(vgames$Genre, vgames$Global_Sales, main="Genre vs. Global Sales")
plot(vgames$Platform, vgames$Global_Sales, main="Video Game Platform vs. Global Sales")
View samples of data to confirm new variable exists.
vgames$salescat[vgames$Global_Sales< 0.18] <- "0"
vgames$salescat[vgames$Global_Sales>=0.18] <- "1"
vgames$salescat <- factor(vgames$salescat)
Ensure "factor" designation was successful by checking levels:
levels(vgames$salescat)
## [1] "0" "1"
Example:
class(vgames$Platform)
## [1] "factor"
levels(vgames$Platform)
## [1] "2600" "3DO" "3DS" "DC" "DS" "GB" "GBA" "GC" "GEN" "GG"
## [11] "N64" "NES" "NG" "PC" "PCFX" "PS" "PS2" "PS3" "PS4" "PSP"
## [21] "PSV" "SAT" "SCD" "SNES" "TG16" "Wii" "WiiU" "WS" "X360" "XB"
## [31] "XOne"
Our dataset has 16,598 rows in total. We chose 90% of the dataset for training (14,938 rows).
Select 14,938 (90%) rows at random for training:
train <- vgames[sample(nrow(vgames), 14938), ]
Select 1660 (10%) rows at random for testing:
test <- vgames[sample(nrow(vgames), 1660), ]
library("rpart")
library("rpart.plot")
dtGames <- rpart(salescat ~ Platform + Genre,
method="class",
data=train, parms=list(split='information'),
minsplit=5, cp=0.01)
rpart.plot(dtGames, type=4, extra=1)
predict <- predict(dtGames, test)
predict [1:5,]
This code predicts the probability of the binary outcome variable (0= less than $180,000 | 1= greater than or equal to $180,000) for the test data set (test), which is a random sample of 10% of the original data set (vgames).
But the list of predictions is quite long, so we looked at just the list of the first 5p prediction. It's best to do some checks of individual rows in the test data to see if the prediction is correct. When looking at the first two rows of data in the test data set prediction, and comparing their rows to the original data set, we see that they are classified correctly according to the decision tree.
predict(dtGames, test[1,])
## 0 1
## 14088 0.651268 0.348732
predict(dtGames, test[2,])
## 0 1
## 14979 0.3811659 0.6188341
In addition to a decision tree, our business analytics team wanted to use a Naive Bayes classifier.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
library(e1071)
vgames <- read.csv("Video Game Sales.csv")
class(vgames$Platform)
## [1] "factor"
levels(vgames$Platform)
## [1] "2600" "3DO" "3DS" "DC" "DS" "GB" "GBA" "GC" "GEN" "GG"
## [11] "N64" "NES" "NG" "PC" "PCFX" "PS" "PS2" "PS3" "PS4" "PSP"
## [21] "PSV" "SAT" "SCD" "SNES" "TG16" "Wii" "WiiU" "WS" "X360" "XB"
## [31] "XOne"
vgames$salescat[vgames$Global_Sales< 0.18] <- "0"
vgames$salescat[vgames$Global_Sales>=0.18] <- "1"
vgames$salescat <- factor(vgames$salescat,levels=c(0,1), labels=c("No","Yes"))
train <- vgames[sample(nrow(vgames), 14938), ]
test <- vgames[sample(nrow(vgames), 1660), ]
nbCat <- naiveBayes(salescat ~ Platform + Genre, train)
nbCat
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## No Yes
## 0.5073638 0.4926362
##
## Conditional probabilities:
## Platform
## Y 2600 3DO 3DS DC DS
## No 0.0013194353 0.0003958306 0.0362844702 0.0036944188 0.1647974667
## Yes 0.0152194592 0.0000000000 0.0248675092 0.0027177606 0.0902296508
## Platform
## Y GB GBA GC GEN GG
## No 0.0011874918 0.0510621454 0.0366803008 0.0015833223 0.0000000000
## Yes 0.0101916021 0.0487838021 0.0301671423 0.0016306563 0.0000000000
## Platform
## Y N64 NES NG PC PCFX
## No 0.0127985222 0.0002638871 0.0010555482 0.0868188415 0.0001319435
## Yes 0.0263622775 0.0115504824 0.0005435521 0.0298953662 0.0000000000
## Platform
## Y PS PS2 PS3 PS4 PSP
## No 0.0577912653 0.1075339755 0.0604301359 0.0183401504 0.0997493073
## Yes 0.0894143226 0.1523304797 0.1035466775 0.0225574127 0.0456583775
## Platform
## Y PSV SAT SCD SNES TG16
## No 0.0370761314 0.0131943528 0.0005277741 0.0088402164 0.0002638871
## Yes 0.0116863704 0.0073379535 0.0001358880 0.0203832042 0.0000000000
## Platform
## Y Wii WiiU WS X360 XB
## No 0.0719092229 0.0072568940 0.0002638871 0.0534371289 0.0550204512
## Yes 0.0873760022 0.0099198261 0.0004076641 0.0981111564 0.0438918331
## Platform
## Y XOne
## No 0.0102915952
## Yes 0.0150835711
##
## Genre
## Y Action Adventure Fighting Misc Platform Puzzle
## No 0.18485288 0.11795751 0.04670801 0.10964507 0.04354136 0.04512469
## Yes 0.21524664 0.03601033 0.05503465 0.09906237 0.06509037 0.02486751
## Genre
## Y Racing Role-Playing Shooter Simulation Sports Strategy
## No 0.07098562 0.08615912 0.06834675 0.05330519 0.12033250 0.05304130
## Yes 0.07963038 0.09253975 0.08955021 0.05082212 0.16401685 0.02812882
predict(nbCat, test[1,])
## [1] Yes
## Levels: No Yes
predict(nbCat, test[2,])
## [1] Yes
## Levels: No Yes