6101 final project

Introduction

Our project goal is to investigate video game sales regarding game platform, genre, developer, and review scores. In 2016, the computer and video game industry of the U.S. generated $30.4 billion in revenue, according to new data released by the Entertainment Software Association (Entertainment Software Association) . Thanks for the recent past with the revolution of computer science, video game has become a pervasive entertainment. Our research is interested in the potential factors that influence the sales volume of a game.

We started our research by working upon the data and conducting an exploratory data analysis. After summarizing the main characteristics and building visualizations, we developed our SMART question and designed a model to solve this question. Then, we chose a reasonable regression method with appropriate responses. Finally, a supervised learning method was applied to test our model.

Dataset Introduction

Our raw dataset contains sales information and review scores of 16719 games released from 1980 to 2016. Variables and definition is listed below:

Name: Name of the game. Platform: The hardware platform that a game was released on. One game could be published on multiple platforms. Popular platforms include Wii, X-Box, PS4. Year: of Release The year that a game come to the world. Genre: Categories of games. Examples are action, role playing, racing etc. Publisher: The company that did marketing and manufacturing of a game. Developer: The company that designed and developed the game. NA_Sales: Sales volume in North America. EU_Sales: Sales volume in Europe. JP_Sales: Sales volume in Japan. Other_Sales: Sales volume in other region. Global_Sales: Sum of sales volume in North America, Europe, Japan, and other region. Critic_Score: The score given by Metacritic employees Critic_Count: The number of contribution to the score User_Score: The score given by game players User_Count: The number of contribution to the score Rating: ESRB rating based on age(C-early childhood, E-everyone, T-teen)

Data Cleaning

We cleaned data without valid year of release and genres, which totally 275 observations. After that, there are still 16444 observations to analyze. We believe this cleaning is necessary and would not undermine our analysis.

## inport dataset

video_full<-read.csv("video.csv")
head(video_full)

##                       Name Platform Year_of_Release        Genre Publisher
## 1               Wii Sports      Wii            2006       Sports  Nintendo
## 2        Super Mario Bros.      NES            1985     Platform  Nintendo
## 3           Mario Kart Wii      Wii            2008       Racing  Nintendo
## 4        Wii Sports Resort      Wii            2009       Sports  Nintendo
## 5 Pokemon Red/Pokemon Blue       GB            1996 Role-Playing  Nintendo
## 6                   Tetris       GB            1989       Puzzle  Nintendo
##   NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score
## 1    41.36    28.96     3.77        8.45        82.53           76
## 2    29.08     3.58     6.81        0.77        40.24           NA
## 3    15.68    12.76     3.79        3.29        35.52           82
## 4    15.61    10.93     3.28        2.95        32.77           80
## 5    11.27     8.89    10.22        1.00        31.37           NA
## 6    23.20     2.26     4.22        0.58        30.26           NA
##   Critic_Count User_Score User_Count Developer Rating
## 1           51        8.0        322  Nintendo      E
## 2           NA         NA         NA                 
## 3           73        8.3        709  Nintendo      E
## 4           73        8.0        192  Nintendo      E
## 5           NA         NA         NA                 
## 6           NA         NA         NA

## data discription
summary(video_full)

##                           Name          Platform    Year_of_Release
##  Need for Speed: Most Wanted:   12   PS2    :2161   2008   :1427   
##  FIFA 14                    :    9   DS     :2152   2009   :1426   
##  LEGO Marvel Super Heroes   :    9   PS3    :1331   2010   :1255   
##  Madden NFL 07              :    9   Wii    :1320   2007   :1197   
##  Ratatouille                :    9   X360   :1262   2011   :1136   
##  Angry Birds Star Wars      :    8   PSP    :1209   2006   :1006   
##  (Other)                    :16663   (Other):7284   (Other):9272   
##           Genre                             Publisher    
##  Action      :3370   Electronic Arts             : 1356  
##  Sports      :2348   Activision                  :  985  
##  Misc        :1750   Namco Bandai Games          :  939  
##  Role-Playing:1500   Ubisoft                     :  933  
##  Shooter     :1323   Konami Digital Entertainment:  834  
##  Adventure   :1303   THQ                         :  715  
##  (Other)     :5125   (Other)                     :10957  
##     NA_Sales          EU_Sales         JP_Sales        Other_Sales      
##  Min.   : 0.0000   Min.   : 0.000   Min.   : 0.0000   Min.   : 0.00000  
##  1st Qu.: 0.0000   1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.00000  
##  Median : 0.0800   Median : 0.020   Median : 0.0000   Median : 0.01000  
##  Mean   : 0.2633   Mean   : 0.145   Mean   : 0.0776   Mean   : 0.04733  
##  3rd Qu.: 0.2400   3rd Qu.: 0.110   3rd Qu.: 0.0400   3rd Qu.: 0.03000  
##  Max.   :41.3600   Max.   :28.960   Max.   :10.2200   Max.   :10.57000  
##                                                                         
##   Global_Sales      Critic_Score    Critic_Count      User_Score   
##  Min.   : 0.0100   Min.   :13.00   Min.   :  3.00   Min.   :0.000  
##  1st Qu.: 0.0600   1st Qu.:60.00   1st Qu.: 12.00   1st Qu.:6.400  
##  Median : 0.1700   Median :71.00   Median : 21.00   Median :7.500  
##  Mean   : 0.5335   Mean   :68.97   Mean   : 26.36   Mean   :7.125  
##  3rd Qu.: 0.4700   3rd Qu.:79.00   3rd Qu.: 36.00   3rd Qu.:8.200  
##  Max.   :82.5300   Max.   :98.00   Max.   :113.00   Max.   :9.700  
##                    NA's   :8582    NA's   :8582     NA's   :9129   
##    User_Count          Developer        Rating    
##  Min.   :    4.0            :6623          :6769  
##  1st Qu.:   10.0   Ubisoft  : 204   E      :3991  
##  Median :   24.0   EA Sports: 172   T      :2961  
##  Mean   :  162.2   EA Canada: 167   M      :1563  
##  3rd Qu.:   81.0   Konami   : 162   E10+   :1420  
##  Max.   :10665.0   Capcom   : 139   EC     :   8  
##  NA's   :9129      (Other)  :9252   (Other):   7

dim(video_full)

## [1] 16719    16

str(video_full)

## 'data.frame':    16719 obs. of  16 variables:
##  $ Name           : Factor w/ 11563 levels "","'98 Koshien",..: 11059 9406 5573 11061 7417 9771 6693 11057 6696 2620 ...
##  $ Platform       : Factor w/ 31 levels "2600","3DO","3DS",..: 26 12 26 26 6 6 5 26 26 12 ...
##  $ Year_of_Release: Factor w/ 40 levels "1980","1981",..: 27 6 29 30 17 10 27 27 30 5 ...
##  $ Genre          : Factor w/ 13 levels "","Action","Adventure",..: 12 6 8 12 9 7 6 5 6 10 ...
##  $ Publisher      : Factor w/ 582 levels "10TACLE Studios",..: 371 371 371 371 371 371 371 371 371 371 ...
##  $ NA_Sales       : num  41.4 29.1 15.7 15.6 11.3 ...
##  $ EU_Sales       : num  28.96 3.58 12.76 10.93 8.89 ...
##  $ JP_Sales       : num  3.77 6.81 3.79 3.28 10.22 ...
##  $ Other_Sales    : num  8.45 0.77 3.29 2.95 1 0.58 2.88 2.84 2.24 0.47 ...
##  $ Global_Sales   : num  82.5 40.2 35.5 32.8 31.4 ...
##  $ Critic_Score   : int  76 NA 82 80 NA NA 89 58 87 NA ...
##  $ Critic_Count   : int  51 NA 73 73 NA NA 65 41 80 NA ...
##  $ User_Score     : num  8 NA 8.3 8 NA NA 8.5 6.6 8.4 NA ...
##  $ User_Count     : int  322 NA 709 192 NA NA 431 129 594 NA ...
##  $ Developer      : Factor w/ 1697 levels "","10tacle Studios",..: 1035 1 1035 1035 1 1 1035 1035 1035 1 ...
##  $ Rating         : Factor w/ 9 levels "","AO","E","E10+",..: 3 1 3 3 1 1 3 3 3 1 ...

tail(video_full)

##                                                   Name Platform
## 16714 SCORE International Baja 1000: The Official Game      PS2
## 16715                    Samurai Warriors: Sanada Maru      PS3
## 16716                                 LMA Manager 2007     X360
## 16717                          Haitaka no Psychedelica      PSV
## 16718                                 Spirits & Spells      GBA
## 16719                              Winning Post 8 2016      PSV
##       Year_of_Release      Genre    Publisher NA_Sales EU_Sales JP_Sales
## 16714            2008     Racing   Activision     0.00     0.00     0.00
## 16715            2016     Action   Tecmo Koei     0.00     0.00     0.01
## 16716            2006     Sports  Codemasters     0.00     0.01     0.00
## 16717            2016  Adventure Idea Factory     0.00     0.00     0.01
## 16718            2003   Platform      Wanadoo     0.01     0.00     0.00
## 16719            2016 Simulation   Tecmo Koei     0.00     0.00     0.01
##       Other_Sales Global_Sales Critic_Score Critic_Count User_Score
## 16714           0         0.01           NA           NA         NA
## 16715           0         0.01           NA           NA         NA
## 16716           0         0.01           NA           NA         NA
## 16717           0         0.01           NA           NA         NA
## 16718           0         0.01           NA           NA         NA
## 16719           0         0.01           NA           NA         NA
##       User_Count Developer Rating
## 16714         NA                 
## 16715         NA                 
## 16716         NA                 
## 16717         NA                 
## 16718         NA                 
## 16719         NA

#install.packages("dplyr")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

glimpse(video_full)

## Observations: 16,719
## Variables: 16
## $ Name            <fctr> Wii Sports, Super Mario Bros., Mario Kart Wii...
## $ Platform        <fctr> Wii, NES, Wii, Wii, GB, GB, DS, Wii, Wii, NES...
## $ Year_of_Release <fctr> 2006, 1985, 2008, 2009, 1996, 1989, 2006, 200...
## $ Genre           <fctr> Sports, Platform, Racing, Sports, Role-Playin...
## $ Publisher       <fctr> Nintendo, Nintendo, Nintendo, Nintendo, Ninte...
## $ NA_Sales        <dbl> 41.36, 29.08, 15.68, 15.61, 11.27, 23.20, 11.2...
## $ EU_Sales        <dbl> 28.96, 3.58, 12.76, 10.93, 8.89, 2.26, 9.14, 9...
## $ JP_Sales        <dbl> 3.77, 6.81, 3.79, 3.28, 10.22, 4.22, 6.50, 2.9...
## $ Other_Sales     <dbl> 8.45, 0.77, 3.29, 2.95, 1.00, 0.58, 2.88, 2.84...
## $ Global_Sales    <dbl> 82.53, 40.24, 35.52, 32.77, 31.37, 30.26, 29.8...
## $ Critic_Score    <int> 76, NA, 82, 80, NA, NA, 89, 58, 87, NA, NA, 91...
## $ Critic_Count    <int> 51, NA, 73, 73, NA, NA, 65, 41, 80, NA, NA, 64...
## $ User_Score      <dbl> 8.0, NA, 8.3, 8.0, NA, NA, 8.5, 6.6, 8.4, NA, ...
## $ User_Count      <int> 322, NA, 709, 192, NA, NA, 431, 129, 594, NA, ...
## $ Developer       <fctr> Nintendo, , Nintendo, Nintendo, , , Nintendo,...
## $ Rating          <fctr> E, , E, E, , , E, E, E, , , E, , E, E, E, M, ...

##data cleaning
video_no_year<-video_full[video_full$Year_of_Release !="N/A",]
video2<-video_no_year[video_no_year$Genre!="",]
video3<-video2[video2$Year_of_Release!=2020,]
video<-video3[video3$Year_of_Release!=2017,]
dim(video)

## [1] 16444    16

Limitation of Dataset

Our raw dataset is a combination of video games sales data and a score dataset of another web scrape from Metacritic (http://www.metacritic.com/). Missing observations exist as Metacritic only covers a subset of the platforms. There are 6900 observations contain all critic score and user score information.

Accumulative time length influence is also hard to observe from our dataset. Though people’s taste of games shifts frequently, there are some games who last a long life like Super Mario. Thus a high sales volume is possibly due to a long-lasting life, however, this theory cannot be concluded from our regression. We chose to omit time factor due to the limitation of information.

Since we find that some of the data points are “NA” and 1 “2020” in the Year_of _Release, and two of the Genre are blank, we decide to delete them since they are only small portion of the dataset. The original dataset has 16719 observations, and we delete 275 data points in the dataset. At last, there are 16444 observations in the dataset. We can think that they will not undermine our analysis.

Descriptive Analysis & Exploratory Data Analysis We conducted a thorough exploratory data analysis by using descriptions functions in packages like Hmisc, dplyr, pastecs, and some built-in functions.

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     combine, src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

describe(video)

## video 
## 
##  16  Variables      16444  Observations
## ---------------------------------------------------------------------------
## Name 
##        n  missing distinct 
##    16444        0    11426 
## 
## lowest : '98 Koshien                                    Beyblade Burst                                Fire Emblem Fates                             Frozen: Olaf's Quest                          Haikyu!! Cross Team Match!                  
## highest: Zumba Fitness Core                            Zumba Fitness Rush                            Zumba Fitness: World Party                    Zwei!!                                        Zyuden Sentai Kyoryuger: Game de Gaburincho!!
## ---------------------------------------------------------------------------
## Platform 
##        n  missing distinct 
##    16444        0       31 
## 
## lowest : 2600 3DO  3DS  DC   DS  , highest: WiiU WS   X360 XB   XOne
## ---------------------------------------------------------------------------
## Year_of_Release 
##        n  missing distinct 
##    16444        0       37 
## 
## lowest : 1980 1981 1982 1983 1984, highest: 2012 2013 2014 2015 2016
## ---------------------------------------------------------------------------
## Genre 
##        n  missing distinct 
##    16444        0       12 
## 
## Action (3307, 0.201), Adventure (1293, 0.079), Fighting (837, 0.051), Misc
## (1721, 0.105), Platform (878, 0.053), Puzzle (569, 0.035), Racing (1226,
## 0.075), Role-Playing (1481, 0.090), Shooter (1296, 0.079), Simulation
## (857, 0.052), Sports (2306, 0.140), Strategy (673, 0.041)
## ---------------------------------------------------------------------------
## Publisher 
##        n  missing distinct 
##    16444        0      580 
## 
## lowest : 10TACLE Studios              1C Company                   20th Century Fox Video Games 2D Boy                       3DO                         
## highest: Yumedia                      Zenrin                       Zoo Digital Publishing       Zoo Games                    Zushi Games                 
## ---------------------------------------------------------------------------
## NA_Sales 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    16444        0      401     0.98    0.264    0.401     0.00     0.00 
##      .25      .50      .75      .90      .95 
##     0.00     0.08     0.24     0.61     1.06 
## 
## lowest :  0.00  0.01  0.02  0.03  0.04, highest: 15.68 23.20 26.93 29.08 41.36
## ---------------------------------------------------------------------------
## EU_Sales 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    16444        0      307    0.955   0.1459   0.2384     0.00     0.00 
##      .25      .50      .75      .90      .95 
##     0.00     0.02     0.11     0.35     0.63 
## 
## lowest :  0.00  0.01  0.02  0.03  0.04, highest:  9.20 10.93 10.95 12.76 28.96
## ---------------------------------------------------------------------------
## JP_Sales 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    16444        0      244    0.754  0.07849   0.1392     0.00     0.00 
##      .25      .50      .75      .90      .95 
##     0.00     0.00     0.04     0.18     0.36 
## 
## lowest :  0.00  0.01  0.02  0.03  0.04, highest:  6.04  6.50  6.81  7.20 10.22
## ---------------------------------------------------------------------------
## Other_Sales 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    16444        0      155    0.928  0.04759  0.07719     0.00     0.00 
##      .25      .50      .75      .90      .95 
##     0.00     0.01     0.03     0.10     0.20 
## 
## lowest :  0.00  0.01  0.02  0.03  0.04, highest:  3.29  3.96  7.53  8.45 10.57
## ---------------------------------------------------------------------------
## Global_Sales 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    16444        0      628    0.999   0.5363   0.7682     0.02     0.02 
##      .25      .50      .75      .90      .95 
##     0.06     0.17     0.47     1.21     2.04 
## 
## lowest :  0.01  0.02  0.03  0.04  0.05, highest: 31.37 32.77 35.52 40.24 82.53
## ---------------------------------------------------------------------------
## Critic_Score 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     7983     8461       81    0.999    68.99    15.59       43       50 
##      .25      .50      .75      .90      .95 
##       60       71       79       85       89 
## 
## lowest : 13 17 19 20 21, highest: 94 95 96 97 98
## ---------------------------------------------------------------------------
## Critic_Count 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     7983     8461      106        1    26.44    20.49        5        7 
##      .25      .50      .75      .90      .95 
##       12       22       36       54       65 
## 
## lowest :   3   4   5   6   7, highest: 104 105 106 107 113
## ---------------------------------------------------------------------------
## User_Score 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     7463     8981       95    0.999    7.126    1.602      4.1      5.1 
##      .25      .50      .75      .90      .95 
##      6.4      7.5      8.2      8.7      8.9 
## 
## lowest : 0.0 0.2 0.3 0.5 0.6, highest: 9.3 9.4 9.5 9.6 9.7
## ---------------------------------------------------------------------------
## User_Count 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     7463     8981      883        1      163    266.8      5.0      6.0 
##      .25      .50      .75      .90      .95 
##     10.0     24.0     81.0    317.8    728.6 
## 
## lowest :     4     5     6     7     8, highest:  9073  9629  9851 10179 10665
## ---------------------------------------------------------------------------
## Developer 
##        n  missing distinct 
##    16444        0     1681 
## 
## lowest :                                       10tacle Studios                       10tacle Studios, Fusionsphere Systems 1C, 1C Company                        1C, Ino-Co, 1C Company               
## highest: Zoonami Ltd.                          ZootFly                               zSlide                                Zushi Games Ltd                       Zuxxez                               
## ---------------------------------------------------------------------------
## Rating 
##        n  missing distinct 
##    16444        0        9 
##                                                                 
## Value               AO     E  E10+    EC   K-A     M    RP     T
## Frequency   6676     1  3921  1393     8     3  1536     1  2905
## Proportion 0.406 0.000 0.238 0.085 0.000 0.000 0.093 0.000 0.177
## ---------------------------------------------------------------------------

library(pastecs)

## Loading required package: boot

## 
## Attaching package: 'boot'

## The following object is masked from 'package:survival':
## 
##     aml

## The following object is masked from 'package:lattice':
## 
##     melanoma

## 
## Attaching package: 'pastecs'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

options(scipen = 100)
options(digits = 2)
stat.desc(video[,c("NA_Sales", "EU_Sales","JP_Sales","Other_Sales","Global_Sales","Critic_Score","Critic_Count","User_Score","User_Count")])

##                NA_Sales   EU_Sales   JP_Sales Other_Sales Global_Sales
## nbr.val      16444.0000 16444.0000 16444.0000  16444.0000    16444.000
## nbr.null      4449.0000  5766.0000 10300.0000   6492.0000        0.000
## nbr.na           0.0000     0.0000     0.0000      0.0000        0.000
## min              0.0000     0.0000     0.0000      0.0000        0.010
## max             41.3600    28.9600    10.2200     10.5700       82.530
## range           41.3600    28.9600    10.2200     10.5700       82.520
## sum           4341.4200  2399.6800  1290.6400    782.6300     8818.570
## median           0.0800     0.0200     0.0000      0.0100        0.170
## mean             0.2640     0.1459     0.0785      0.0476        0.536
## SE.mean          0.0064     0.0040     0.0024      0.0015        0.012
## CI.mean.0.95     0.0125     0.0077     0.0048      0.0029        0.024
## var              0.6697     0.2568     0.0968      0.0353        2.429
## std.dev          0.8184     0.5067     0.3111      0.1880        1.559
## coef.var         3.0998     3.4723     3.9637      3.9502        2.906
##              Critic_Score Critic_Count User_Score User_Count
## nbr.val           7983.00      7983.00   7463.000     7463.0
## nbr.null             0.00         0.00      1.000        0.0
## nbr.na            8461.00      8461.00   8981.000     8981.0
## min                 13.00         3.00      0.000        4.0
## max                 98.00       113.00      9.700    10665.0
## range               85.00       110.00      9.700    10661.0
## sum             550782.00    211081.00  53183.800  1216582.0
## median              71.00        22.00      7.500       24.0
## mean                68.99        26.44      7.126      163.0
## SE.mean              0.16         0.21      0.017        6.5
## CI.mean.0.95         0.31         0.42      0.034       12.8
## var                193.77       361.31      2.248   317941.9
## std.dev             13.92        19.01      1.499      563.9
## coef.var             0.20         0.72      0.210        3.5

Before developing our model, let’s take a look at the correlation coefficients of numeric variables:

#install.packages("corrplot")
library(corrplot)

## Warning: package 'corrplot' was built under R version 3.3.3

num.cols <- sapply(na.omit(video), is.numeric)
cor.data <- cor(na.omit(video)[,num.cols])
corrPLOT<-corrplot(cor.data,method='square')

Everything is positive correlated. And the global sales mostly rely on NA sales and EU sales. NA sales, EU sales, and other sales move together at some level, while Japan market is an independent world. The user score is also slightly correlated with the critic score.

Development of EDA

## EDA
## the trend of the Genre
library(dplyr)
q1<-count(video, Year_of_Release,Genre)
q1

## Source: local data frame [387 x 3]
## Groups: Year_of_Release [?]
## 
##    Year_of_Release    Genre     n
##             <fctr>   <fctr> <int>
## 1             1980   Action     1
## 2             1980 Fighting     1
## 3             1980     Misc     4
## 4             1980  Shooter     2
## 5             1980   Sports     1
## 6             1981   Action    25
## 7             1981 Platform     3
## 8             1981   Puzzle     2
## 9             1981   Racing     1
## 10            1981  Shooter    10
## # ... with 377 more rows

library("ggplot2")
#plot the form
ggplot(q1, aes(x = factor(Year_of_Release), y = n, colour=Genre, group=Genre)) + geom_line()+ggtitle("Trend of Genre by Year")+theme(axis.text.x = element_text(angle = 90,hjust = 1))

From the plot, we can see that there is a shift of people’s taste. Basicly, before 2002, the most popular published game genre is sports, after 2003, it turned to be action.The biggest influenced genres are action, misc, and sports. We also can tell that there is a sharp drop of game market from 2010-2013. We think that it’s because since 2010, the smart phone(like iphone) is more and more popular, it became a new platform of the game(such as pokeman-go, super mario,fruit ninga). Since this dataset dosen’t include this new platform, it’s probably the reason why there is a sharp drop.

# EDA  the trend of the platform 
head(video)

##                       Name Platform Year_of_Release        Genre Publisher
## 1               Wii Sports      Wii            2006       Sports  Nintendo
## 2        Super Mario Bros.      NES            1985     Platform  Nintendo
## 3           Mario Kart Wii      Wii            2008       Racing  Nintendo
## 4        Wii Sports Resort      Wii            2009       Sports  Nintendo
## 5 Pokemon Red/Pokemon Blue       GB            1996 Role-Playing  Nintendo
## 6                   Tetris       GB            1989       Puzzle  Nintendo
##   NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score
## 1       41     29.0      3.8        8.45           83           76
## 2       29      3.6      6.8        0.77           40           NA
## 3       16     12.8      3.8        3.29           36           82
## 4       16     10.9      3.3        2.95           33           80
## 5       11      8.9     10.2        1.00           31           NA
## 6       23      2.3      4.2        0.58           30           NA
##   Critic_Count User_Score User_Count Developer Rating
## 1           51        8.0        322  Nintendo      E
## 2           NA         NA         NA                 
## 3           73        8.3        709  Nintendo      E
## 4           73        8.0        192  Nintendo      E
## 5           NA         NA         NA                 
## 6           NA         NA         NA

q2<-count(video, Year_of_Release,Platform)
q2

## Source: local data frame [238 x 3]
## Groups: Year_of_Release [?]
## 
##    Year_of_Release Platform     n
##             <fctr>   <fctr> <int>
## 1             1980     2600     9
## 2             1981     2600    46
## 3             1982     2600    36
## 4             1983     2600    11
## 5             1983      NES     6
## 6             1984     2600     1
## 7             1984      NES    13
## 8             1985     2600     1
## 9             1985       DS     1
## 10            1985      NES    11
## # ... with 228 more rows

str(q2)

## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  238 obs. of  3 variables:
##  $ Year_of_Release: Factor w/ 40 levels "1980","1981",..: 1 2 3 4 4 5 5 6 6 6 ...
##  $ Platform       : Factor w/ 31 levels "2600","3DO","3DS",..: 1 1 1 1 12 1 12 1 5 12 ...
##  $ n              : int  9 46 36 11 6 1 13 1 1 11 ...
##  - attr(*, "vars")=List of 1
##   ..$ : symbol Year_of_Release
##  - attr(*, "drop")= logi TRUE

# trend of all the platform by year 
ggplot(q2, aes(x = factor(Year_of_Release), y = n, colour=Platform, group=Platform)) + geom_line()+ggtitle("Trend of Platform by Year")+theme(axis.text.x = element_text(angle = 90,hjust = 1))

# trend of the PS series platform by year
q3_ps<-q2[q2$Platform=="PS" | q2$Platform=="PS2" | q2$Platform=="PS3"| q2$Platform=="PS4",]
ggplot(q3_ps, aes(x = factor(Year_of_Release), y = n, colour=Platform, group=Platform)) + geom_line()+ggtitle("Trend of 'PS' Platform by Year")+theme(axis.text.x = element_text(angle = 90,hjust = 1))

From the plot we can see that the platform cycle is around 10 years, which the popularity increased to the peak and then decreased. As the new generation of the same platform hit the market, the older platform becomes less popular. Nowadays, PS4 is the most popular platform among the ‘PS’ series. We think it might reach the peak in a few years.

## xb,x360,xone
q4_xb<-q2[q2$Platform=="XB" | q2$Platform=="X360" | q2$Platform=="XOne",]
ggplot(q4_xb, aes(x = factor(Year_of_Release), y = n, colour=Platform, group=Platform)) + geom_line()+ggtitle("Trend of 'Xbox' Platform by Year")+theme(axis.text.x = element_text(angle = 90,hjust = 1))

q5_mix<-q2[q2$Platform=="XB" | q2$Platform=="X360"| q2$Platform=="XOne" | q2$Platform=="PS"|q2$Platform=="PS2" | q2$Platform=="PS3" | q2$Platform=="PS4" ,]
ggplot(q5_mix, aes(x = factor(Year_of_Release), y = n, colour=Platform, group=Platform)) + geom_line()+ggtitle(" The competition between SONY and Microsoft")+theme(axis.text.x = element_text(angle = 90,hjust = 1))

The ‘PS’ series are created by SONY. The ‘XB’,‘XB360’and ’XOne’ are created by Microsoft. From the plot we can see that two platform designer are created new platform synchronously. The two companies are always take actions at the same time. SONY always performs better than Microsoft.

#  EDA 2
# the pie chart 
library(doBy)
attach(video)
sum_sale<-summaryBy(NA_Sales+EU_Sales+JP_Sales+Other_Sales~Genre,data=video,FUN=sum)
sum_sale

##           Genre NA_Sales.sum EU_Sales.sum JP_Sales.sum Other_Sales.sum
## 1        Action          863          511          160             182
## 2     Adventure          102           63           52              16
## 3      Fighting          221           99           87              36
## 4          Misc          400          211          107              73
## 5      Platform          444          199          131              51
## 6        Puzzle          121           50           57              12
## 7        Racing          357          234           57              76
## 8  Role-Playing          330          188          353              59
## 9       Shooter          585          315           39             103
## 10   Simulation          180          113           64              31
## 11       Sports          671          371          135             132
## 12     Strategy           68           45           49              11

# the pie chart of the NA_Sales by genre
label_NA<-paste('(', round(sum_sale$NA_Sales.sum/sum(sum_sale$NA_Sales.sum) * 100, 1), '%)', sep = '')
label_NA

##  [1] "(19.9%)" "(2.3%)"  "(5.1%)"  "(9.2%)"  "(10.2%)" "(2.8%)"  "(8.2%)" 
##  [8] "(7.6%)"  "(13.5%)" "(4.2%)"  "(15.5%)" "(1.6%)"

labelNA <- paste(sum_sale$Genre, label_NA, sep = '')
labelNA

##  [1] "Action(19.9%)"      "Adventure(2.3%)"    "Fighting(5.1%)"    
##  [4] "Misc(9.2%)"         "Platform(10.2%)"    "Puzzle(2.8%)"      
##  [7] "Racing(8.2%)"       "Role-Playing(7.6%)" "Shooter(13.5%)"    
## [10] "Simulation(4.2%)"   "Sports(15.5%)"      "Strategy(1.6%)"

ggplot(data = sum_sale, mapping = aes(x = 'Content', y = NA_Sales.sum, fill = Genre )) + geom_bar(stat = 'identity', position = 'stack', width = 1)+ coord_polar(theta ="y") + ggtitle("Pie chart for NA_Sales")+ theme(axis.text = element_blank())+ scale_fill_discrete(labels =labelNA)

#geom_text(aes(y = sum_sale$NA_Sales.sum/8 + c(0, cumsum(sum_sale$NA_Sales.sum)[-length(sum_sale$NA_Sales.sum)]), x = sum(sum_sale$NA_Sales.sum)/3000, label = labelNA))

North America : Action 19.9% sports 15.5% shooter 13.5%

# the pie chart of the EU_Sales by genre
label_EU<-paste('(', round(sum_sale$EU_Sales.sum/sum(sum_sale$EU_Sales.sum) * 100, 1), '%)', sep = '')
label_EU

##  [1] "(21.3%)" "(2.6%)"  "(4.1%)"  "(8.8%)"  "(8.3%)"  "(2.1%)"  "(9.8%)" 
##  [8] "(7.8%)"  "(13.1%)" "(4.7%)"  "(15.5%)" "(1.9%)"

labelEU <- paste(sum_sale$Genre, label_EU, sep = '')
labelEU

##  [1] "Action(21.3%)"      "Adventure(2.6%)"    "Fighting(4.1%)"    
##  [4] "Misc(8.8%)"         "Platform(8.3%)"     "Puzzle(2.1%)"      
##  [7] "Racing(9.8%)"       "Role-Playing(7.8%)" "Shooter(13.1%)"    
## [10] "Simulation(4.7%)"   "Sports(15.5%)"      "Strategy(1.9%)"

ggplot(data = sum_sale, mapping = aes(x = 'Content', y = EU_Sales.sum, fill = Genre )) + geom_bar(stat = 'identity', position = 'stack', width = 1)+ coord_polar(theta ="y") + ggtitle("Pie chart for EU_Sales")+ theme(axis.text = element_blank())+ scale_fill_discrete(labels =labelEU)

Europe: Action 21.3% Sports 15.5% shooter 13.1%

# the pie chart of the JP_Sales by genre
label_JP<-paste('(', round(sum_sale$JP_Sales.sum/sum(sum_sale$JP_Sales.sum) * 100, 1), '%)', sep = '')
label_JP

##  [1] "(12.4%)" "(4%)"    "(6.8%)"  "(8.3%)"  "(10.1%)" "(4.4%)"  "(4.4%)" 
##  [8] "(27.4%)" "(3%)"    "(4.9%)"  "(10.5%)" "(3.8%)"

labelJP <- paste(sum_sale$Genre, label_JP, sep = '')
labelJP

##  [1] "Action(12.4%)"       "Adventure(4%)"       "Fighting(6.8%)"     
##  [4] "Misc(8.3%)"          "Platform(10.1%)"     "Puzzle(4.4%)"       
##  [7] "Racing(4.4%)"        "Role-Playing(27.4%)" "Shooter(3%)"        
## [10] "Simulation(4.9%)"    "Sports(10.5%)"       "Strategy(3.8%)"

ggplot(data = sum_sale, mapping = aes(x = 'Content', y = JP_Sales.sum, fill = Genre )) + geom_bar(stat = 'identity', position = 'stack', width = 1)+ coord_polar(theta ="y") + ggtitle("Pie chart for JP_Sales")+ theme(axis.text = element_blank())+ scale_fill_discrete(labels =labelJP)

Japan: Role-playing 27.4% Action 12.4% Sports 10.5%

# the pie chart of the OTHER_Sales by genre
label_OTHER<-paste('(', round(sum_sale$Other_Sales.sum/sum(sum_sale$Other_Sales.sum) * 100, 1), '%)', sep = '')
label_OTHER

##  [1] "(23.3%)" "(2.1%)"  "(4.6%)"  "(9.3%)"  "(6.5%)"  "(1.6%)"  "(9.6%)" 
##  [8] "(7.6%)"  "(13.2%)" "(3.9%)"  "(16.9%)" "(1.4%)"

labelOTHER <- paste(sum_sale$Genre, label_OTHER, sep = '')
labelOTHER

##  [1] "Action(23.3%)"      "Adventure(2.1%)"    "Fighting(4.6%)"    
##  [4] "Misc(9.3%)"         "Platform(6.5%)"     "Puzzle(1.6%)"      
##  [7] "Racing(9.6%)"       "Role-Playing(7.6%)" "Shooter(13.2%)"    
## [10] "Simulation(3.9%)"   "Sports(16.9%)"      "Strategy(1.4%)"

ggplot(data = sum_sale, mapping = aes(x = 'Content', y = Other_Sales.sum, fill = Genre )) + geom_bar(stat = 'identity', position = 'stack', width = 1)+ coord_polar(theta ="y") + ggtitle("Pie chart for Other_Sales")+ theme(axis.text = element_blank())+ scale_fill_discrete(labels =labelOTHER)

other: action 23.3% sports 16.9% shooter 13.2%

All in all, action and sports are popular among the world. But, in Japan role-playing games are much more popular than action and sports games.

##EDA 3
# count score for each year 
scorecount <- video[, c(3,12,14)]
#View(scorecount)
scorecount$exist <- c(rep(1,nrow(scorecount)))
for(i in 1:16444)
  {
  if(as.numeric(is.na(scorecount$Critic_Count[i]))==0 && as.numeric(is.na(scorecount$User_Count[i]))==0){scorecount$count[i]=1}
  else{scorecount$count[i]=0}
}

head(scorecount)

##   Year_of_Release Critic_Count User_Count exist count
## 1            2006           51        322     1     1
## 2            1985           NA         NA     1     0
## 3            2008           73        709     1     1
## 4            2009           73        192     1     1
## 5            1996           NA         NA     1     0
## 6            1989           NA         NA     1     0

#sum count
library(doBy)
attach(scorecount)

## The following objects are masked from video:
## 
##     Critic_Count, User_Count, Year_of_Release

sum_count<-summaryBy(exist+count~Year_of_Release,data=scorecount,FUN=sum)
sum_count

##    Year_of_Release exist.sum count.sum
## 1             1980         9         0
## 2             1981        46         0
## 3             1982        36         0
## 4             1983        17         0
## 5             1984        14         0
## 6             1985        14         1
## 7             1986        21         0
## 8             1987        16         0
## 9             1988        15         1
## 10            1989        17         0
## 11            1990        16         0
## 12            1991        41         0
## 13            1992        43         1
## 14            1993        60         0
## 15            1994       121         1
## 16            1995       219         0
## 17            1996       263         8
## 18            1997       289        14
## 19            1998       379        26
## 20            1999       338        30
## 21            2000       350       102
## 22            2001       482       256
## 23            2002       829       455
## 24            2003       775       499
## 25            2004       762       477
## 26            2005       939       562
## 27            2006      1006       528
## 28            2007      1197       590
## 29            2008      1427       595
## 30            2009      1426       554
## 31            2010      1255       431
## 32            2011      1136       466
## 33            2012       653       321
## 34            2013       544       272
## 35            2014       581       256
## 36            2015       606       221
## 37            2016       502       227

sum_count$useless<-sum_count$exist.sum-sum_count$count.sum
sum_count2<-sum_count[,c(1,3,4)]
rownames(sum_count2)<-seq(1980,2016,1)

barplot(t(sum_count2[,c(2,3)]), main="number of games by year and score",
  xlab="Year-of-release", col=c("darkblue","red"))

Each bar is the number of games released by year. The blue part is the number of games which contains both user_score and critic_score. Otherwise, it’s repesented in the red part. For example, the game has no score or only has one kind of the score.

The total bar height presents the total records in one year, and the blue one is data that contains score information. More complete information is centered between 2000 and 2016, as the score provider, Metacritic, is a company with brief history.

# the subset used to do the regression, which has both score of user and critic
video$count<-scorecount$count
video_reg<-subset(video,scorecount$count==1)
video_nintendo<-subset(video_reg,Publisher=="Nintendo")

SMART Question Development

Our smart question is: Whichfactors produce a change in video game sales?

Specific: we examined each possible influential variables Measurable: tests could be used to measure the fitness Achievable: build a connection between our target and data Reasonable: make a reasonable analysis and conclude a corresponding result Timeliness: occur at a favorable time

Regression

Conclusion The mystery of best-selling games. . The majority of video games were not selling well, and their data fits the linear regression model very well. . The data of best-selling games(Super Mario, Call of Duty, etc.) doesn’t fit the model well. When sales is big enough, sales variable itself is also an influential factor to regression model.
Finally, we used square rooted sales as a transform of response Y thus eliminate the error due to butterfly effect. The higher R-square presents a better fitting model.

#regression
dim(video)

## [1] 16444    17

video_nintendo<-subset(video_reg,Publisher=="Nintendo")

# use video_nintendo to do the regression, this dataset only includes nintendo publisher. we only want to do the regression of the nintendo.

# make the four region to be in one row 
# north america

video_nintendo_NA<-video_nintendo[,-c(7,8,9,10)]

colnames(video_nintendo_NA)[6]<-"sales"

video_nintendo_NA$region<-rep("NORTH AMERICA",nrow(video_nintendo_NA))
set.seed(1000)
video_nintendo_NA_sample <- sample(2, nrow(video_nintendo_NA), replace=TRUE, prob=c(0.67, 0.33))
video_nintendo_NA_training <- video_nintendo_NA[video_nintendo_NA_sample==1,]
video_nintendo_NA_test <- video_nintendo_NA[video_nintendo_NA_sample==2,]

# EUROPEAN

video_nintendo_EU<-video_nintendo[,-c(6,8,9,10)]

colnames(video_nintendo_EU)[6]<-"sales"

video_nintendo_EU$region<-rep("EUROPEAN",nrow(video_nintendo_EU))

video_nintendo_EU_sample <- sample(2, nrow(video_nintendo_EU), replace=TRUE, prob=c(0.67, 0.33))
video_nintendo_EU_training <- video_nintendo_EU[video_nintendo_EU_sample==1,]
video_nintendo_EU_test <- video_nintendo_EU[video_nintendo_EU_sample==2,]



# Japan

video_nintendo_JP<-video_nintendo[,-c(6,7,9,10)]

colnames(video_nintendo_JP)[6]<-"sales"

video_nintendo_JP$region<-rep("JAPAN",nrow(video_nintendo_JP))

video_nintendo_JP_sample <- sample(2, nrow(video_nintendo_JP), replace=TRUE, prob=c(0.67, 0.33))
video_nintendo_JP_training <- video_nintendo_JP[video_nintendo_JP_sample==1,]
video_nintendo_JP_test <- video_nintendo_JP[video_nintendo_JP_sample==2,]



# OTHER

video_nintendo_OTHER<-video_nintendo[,-c(6,8,7,10)]

colnames(video_nintendo_OTHER)[6]<-"sales"

video_nintendo_OTHER$region<-rep("OTHER",nrow(video_nintendo_OTHER))

video_nintendo_OTHER_sample <- sample(2, nrow(video_nintendo_OTHER), replace=TRUE, prob=c(0.67, 0.33))
video_nintendo_OTHER_training <- video_nintendo_OTHER[video_nintendo_OTHER_sample==1,]
video_nintendo_OTHER_test <- video_nintendo_OTHER[video_nintendo_OTHER_sample==2,]


# rbind the dataset
new_video_nintendo_training<-rbind(video_nintendo_NA_training,video_nintendo_EU_training,video_nintendo_JP_training,video_nintendo_OTHER_training)
new_video_nintendo_test<-rbind(video_nintendo_NA_test,video_nintendo_EU_test,video_nintendo_JP_test,video_nintendo_OTHER_test)

#Linear Rregression
lm1 <- lm(sales~Platform+Genre+Critic_Score+User_Score+region-1,data=new_video_nintendo_training)
summary(lm1)

## 
## Call:
## lm(formula = sales ~ Platform + Genre + Critic_Score + User_Score + 
##     region - 1, data = new_video_nintendo_training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3.40  -0.72  -0.10   0.37  37.22 
## 
## Coefficients:
##                     Estimate Std. Error t value      Pr(>|t|)    
## Platform3DS         -1.92831    0.71877   -2.68       0.00746 ** 
## PlatformDS          -1.96817    0.75518   -2.61       0.00933 ** 
## PlatformGBA         -2.53429    0.76808   -3.30       0.00101 ** 
## PlatformGC          -2.36726    0.78061   -3.03       0.00251 ** 
## PlatformWii         -0.97327    0.73757   -1.32       0.18738    
## PlatformWiiU        -2.40883    0.74482   -3.23       0.00127 ** 
## GenreAdventure       0.25957    0.41860    0.62       0.53539    
## GenreFighting       -0.04027    0.56135   -0.07       0.94284    
## GenreMisc            0.67150    0.30835    2.18       0.02974 *  
## GenrePlatform        0.72098    0.29836    2.42       0.01591 *  
## GenrePuzzle          0.13531    0.35786    0.38       0.70544    
## GenreRacing          0.79846    0.39680    2.01       0.04455 *  
## GenreRole-Playing    0.13185    0.29139    0.45       0.65104    
## GenreShooter        -0.04048    0.46055   -0.09       0.92998    
## GenreSimulation     -0.15292    0.43766   -0.35       0.72688    
## GenreSports          2.24638    0.34733    6.47 0.00000000018 ***
## GenreStrategy       -0.18933    0.43624   -0.43       0.66441    
## Critic_Score         0.03526    0.00978    3.60       0.00033 ***
## User_Score          -0.03913    0.10588   -0.37       0.71181    
## regionJAPAN         -0.23054    0.21806   -1.06       0.29075    
## regionNORTH AMERICA  0.49694    0.22042    2.25       0.02445 *  
## regionOTHER         -0.63407    0.21829   -2.90       0.00378 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.1 on 761 degrees of freedom
## Multiple R-squared:  0.252,  Adjusted R-squared:  0.23 
## F-statistic: 11.6 on 22 and 761 DF,  p-value: <0.0000000000000002

plot(lm1)

new_video_nintendo_training$sales_sqrt <- sqrt(new_video_nintendo_training$sales)
new_video_nintendo_test$sales_sqrt <- sqrt(new_video_nintendo_test$sales)
lm2 <- lm(sales_sqrt~Platform+Genre+Critic_Score+User_Score+region-1,data=new_video_nintendo_training)
summary(lm2)

## 
## Call:
## lm(formula = sales_sqrt ~ Platform + Genre + Critic_Score + User_Score + 
##     region - 1, data = new_video_nintendo_training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.172 -0.307 -0.045  0.199  4.881 
## 
## Coefficients:
##                     Estimate Std. Error t value        Pr(>|t|)    
## Platform3DS         -0.66758    0.17715   -3.77         0.00018 ***
## PlatformDS          -0.71678    0.18612   -3.85         0.00013 ***
## PlatformGBA         -0.96675    0.18930   -5.11 0.0000004141988 ***
## PlatformGC          -0.85206    0.19239   -4.43 0.0000108613669 ***
## PlatformWii         -0.41417    0.18178   -2.28         0.02298 *  
## PlatformWiiU        -0.86803    0.18357   -4.73 0.0000026934553 ***
## GenreAdventure       0.04503    0.10317    0.44         0.66260    
## GenreFighting       -0.00231    0.13835   -0.02         0.98669    
## GenreMisc            0.33464    0.07600    4.40 0.0000121848230 ***
## GenrePlatform        0.31292    0.07353    4.26 0.0000234690963 ***
## GenrePuzzle          0.01851    0.08820    0.21         0.83385    
## GenreRacing          0.31542    0.09780    3.23         0.00131 ** 
## GenreRole-Playing    0.04920    0.07182    0.69         0.49352    
## GenreShooter        -0.01396    0.11351   -0.12         0.90212    
## GenreSimulation     -0.07560    0.10787   -0.70         0.48358    
## GenreSports          0.50546    0.08560    5.90 0.0000000053227 ***
## GenreStrategy       -0.15276    0.10752   -1.42         0.15579    
## Critic_Score         0.01707    0.00241    7.08 0.0000000000033 ***
## User_Score          -0.01245    0.02609   -0.48         0.63346    
## regionJAPAN         -0.01975    0.05374   -0.37         0.71335    
## regionNORTH AMERICA  0.26121    0.05432    4.81 0.0000018344902 ***
## regionOTHER         -0.29909    0.05380   -5.56 0.0000000374581 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.53 on 761 degrees of freedom
## Multiple R-squared:  0.646,  Adjusted R-squared:  0.636 
## F-statistic: 63.1 on 22 and 761 DF,  p-value: <0.0000000000000002

plot(lm2)

We devide the dataset into two group, training and test. In the training dataset, Then we fit the linear regression (lm1) between the dependent variable “sales” and 5 independent variables. Adjusted R-squared of this model is 0.23 which is pretty low. The Q-Q plot of this linear regression has a heavier tail so that we could reject the normality assumotion.

Then we do the transformation for the dependent variable “sales” and fit another linear regression (lm2). Adjusted R-squared of this model is 0.636. The Q-Q plot of this model still has a heavier tail,but it looks much better than the previous linear model. In addition, the p-value of this model is pretty low, so we can use this model.

pred<- predict(lm2,new_video_nintendo_test)
Eval <- data.frame(Game= new_video_nintendo_test$Name, Actual = new_video_nintendo_test$sales_sqrt)
pred <- round(pred,2)
Eval <- Eval[1:length(pred),]
Eval$Predicted <- abs(pred)
Eval$diff <- abs(Eval$Predicted - Eval$Actual)
head(Eval)

##                          Game Actual Predicted diff
## 1              Mario Kart Wii    4.0      1.46 2.50
## 2       New Super Mario Bros.    3.4      1.27 2.09
## 3               Mario Kart DS    3.1      1.31 1.81
## 4     Super Smash Bros. Brawl    2.6      1.32 1.25
## 5 Animal Crossing: Wild World    1.6      0.83 0.75
## 6          Super Mario Galaxy    2.5      1.70 0.76

RMSE <- sqrt(mean(Eval$diff^2))
RMSE

## [1] 0.49

After that, we use the second model (lm2) to do the prediction for the test dataset. Then we create a table which concludes the actual sales and predicted value of sales. Finally, we calculate the RMSE of this model. The value of RMSE is 0.49.

Conclusion

The mystery of best-selling games.

-The majority of video games were not selling well, and their data fits the linear regression model very well.

-The data of best-selling games(Super Mario, Call of Duty, etc.) doesn’t fit the model well.

When sales is big enough, sales variable itself is also an influential factor to regression model.

Finally, we used square rooted sales as a transform of response Y thus eliminate the error due to butterfly effect. The higher R-square presents a better fitting model.