Our project goal is to investigate video game sales regarding game platform, genre, developer, and review scores. In 2016, the computer and video game industry of the U.S. generated $30.4 billion in revenue, according to new data released by the Entertainment Software Association (Entertainment Software Association) . Thanks for the recent past with the revolution of computer science, video game has become a pervasive entertainment. Our research is interested in the potential factors that influence the sales volume of a game.
We started our research by working upon the data and conducting an exploratory data analysis. After summarizing the main characteristics and building visualizations, we developed our SMART question and designed a model to solve this question. Then, we chose a reasonable regression method with appropriate responses. Finally, a supervised learning method was applied to test our model.
Our raw dataset contains sales information and review scores of 16719 games released from 1980 to 2016. Variables and definition is listed below:
Name: Name of the game. Platform: The hardware platform that a game was released on. One game could be published on multiple platforms. Popular platforms include Wii, X-Box, PS4. Year: of Release The year that a game come to the world. Genre: Categories of games. Examples are action, role playing, racing etc. Publisher: The company that did marketing and manufacturing of a game. Developer: The company that designed and developed the game. NA_Sales: Sales volume in North America. EU_Sales: Sales volume in Europe. JP_Sales: Sales volume in Japan. Other_Sales: Sales volume in other region. Global_Sales: Sum of sales volume in North America, Europe, Japan, and other region. Critic_Score: The score given by Metacritic employees Critic_Count: The number of contribution to the score User_Score: The score given by game players User_Count: The number of contribution to the score Rating: ESRB rating based on age(C-early childhood, E-everyone, T-teen)
We cleaned data without valid year of release and genres, which totally 275 observations. After that, there are still 16444 observations to analyze. We believe this cleaning is necessary and would not undermine our analysis.
## inport dataset
video_full<-read.csv("video.csv")
head(video_full)
## Name Platform Year_of_Release Genre Publisher
## 1 Wii Sports Wii 2006 Sports Nintendo
## 2 Super Mario Bros. NES 1985 Platform Nintendo
## 3 Mario Kart Wii Wii 2008 Racing Nintendo
## 4 Wii Sports Resort Wii 2009 Sports Nintendo
## 5 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo
## 6 Tetris GB 1989 Puzzle Nintendo
## NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score
## 1 41.36 28.96 3.77 8.45 82.53 76
## 2 29.08 3.58 6.81 0.77 40.24 NA
## 3 15.68 12.76 3.79 3.29 35.52 82
## 4 15.61 10.93 3.28 2.95 32.77 80
## 5 11.27 8.89 10.22 1.00 31.37 NA
## 6 23.20 2.26 4.22 0.58 30.26 NA
## Critic_Count User_Score User_Count Developer Rating
## 1 51 8.0 322 Nintendo E
## 2 NA NA NA
## 3 73 8.3 709 Nintendo E
## 4 73 8.0 192 Nintendo E
## 5 NA NA NA
## 6 NA NA NA
## data discription
summary(video_full)
## Name Platform Year_of_Release
## Need for Speed: Most Wanted: 12 PS2 :2161 2008 :1427
## FIFA 14 : 9 DS :2152 2009 :1426
## LEGO Marvel Super Heroes : 9 PS3 :1331 2010 :1255
## Madden NFL 07 : 9 Wii :1320 2007 :1197
## Ratatouille : 9 X360 :1262 2011 :1136
## Angry Birds Star Wars : 8 PSP :1209 2006 :1006
## (Other) :16663 (Other):7284 (Other):9272
## Genre Publisher
## Action :3370 Electronic Arts : 1356
## Sports :2348 Activision : 985
## Misc :1750 Namco Bandai Games : 939
## Role-Playing:1500 Ubisoft : 933
## Shooter :1323 Konami Digital Entertainment: 834
## Adventure :1303 THQ : 715
## (Other) :5125 (Other) :10957
## NA_Sales EU_Sales JP_Sales Other_Sales
## Min. : 0.0000 Min. : 0.000 Min. : 0.0000 Min. : 0.00000
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.00000
## Median : 0.0800 Median : 0.020 Median : 0.0000 Median : 0.01000
## Mean : 0.2633 Mean : 0.145 Mean : 0.0776 Mean : 0.04733
## 3rd Qu.: 0.2400 3rd Qu.: 0.110 3rd Qu.: 0.0400 3rd Qu.: 0.03000
## Max. :41.3600 Max. :28.960 Max. :10.2200 Max. :10.57000
##
## Global_Sales Critic_Score Critic_Count User_Score
## Min. : 0.0100 Min. :13.00 Min. : 3.00 Min. :0.000
## 1st Qu.: 0.0600 1st Qu.:60.00 1st Qu.: 12.00 1st Qu.:6.400
## Median : 0.1700 Median :71.00 Median : 21.00 Median :7.500
## Mean : 0.5335 Mean :68.97 Mean : 26.36 Mean :7.125
## 3rd Qu.: 0.4700 3rd Qu.:79.00 3rd Qu.: 36.00 3rd Qu.:8.200
## Max. :82.5300 Max. :98.00 Max. :113.00 Max. :9.700
## NA's :8582 NA's :8582 NA's :9129
## User_Count Developer Rating
## Min. : 4.0 :6623 :6769
## 1st Qu.: 10.0 Ubisoft : 204 E :3991
## Median : 24.0 EA Sports: 172 T :2961
## Mean : 162.2 EA Canada: 167 M :1563
## 3rd Qu.: 81.0 Konami : 162 E10+ :1420
## Max. :10665.0 Capcom : 139 EC : 8
## NA's :9129 (Other) :9252 (Other): 7
dim(video_full)
## [1] 16719 16
str(video_full)
## 'data.frame': 16719 obs. of 16 variables:
## $ Name : Factor w/ 11563 levels "","'98 Koshien",..: 11059 9406 5573 11061 7417 9771 6693 11057 6696 2620 ...
## $ Platform : Factor w/ 31 levels "2600","3DO","3DS",..: 26 12 26 26 6 6 5 26 26 12 ...
## $ Year_of_Release: Factor w/ 40 levels "1980","1981",..: 27 6 29 30 17 10 27 27 30 5 ...
## $ Genre : Factor w/ 13 levels "","Action","Adventure",..: 12 6 8 12 9 7 6 5 6 10 ...
## $ Publisher : Factor w/ 582 levels "10TACLE Studios",..: 371 371 371 371 371 371 371 371 371 371 ...
## $ NA_Sales : num 41.4 29.1 15.7 15.6 11.3 ...
## $ EU_Sales : num 28.96 3.58 12.76 10.93 8.89 ...
## $ JP_Sales : num 3.77 6.81 3.79 3.28 10.22 ...
## $ Other_Sales : num 8.45 0.77 3.29 2.95 1 0.58 2.88 2.84 2.24 0.47 ...
## $ Global_Sales : num 82.5 40.2 35.5 32.8 31.4 ...
## $ Critic_Score : int 76 NA 82 80 NA NA 89 58 87 NA ...
## $ Critic_Count : int 51 NA 73 73 NA NA 65 41 80 NA ...
## $ User_Score : num 8 NA 8.3 8 NA NA 8.5 6.6 8.4 NA ...
## $ User_Count : int 322 NA 709 192 NA NA 431 129 594 NA ...
## $ Developer : Factor w/ 1697 levels "","10tacle Studios",..: 1035 1 1035 1035 1 1 1035 1035 1035 1 ...
## $ Rating : Factor w/ 9 levels "","AO","E","E10+",..: 3 1 3 3 1 1 3 3 3 1 ...
tail(video_full)
## Name Platform
## 16714 SCORE International Baja 1000: The Official Game PS2
## 16715 Samurai Warriors: Sanada Maru PS3
## 16716 LMA Manager 2007 X360
## 16717 Haitaka no Psychedelica PSV
## 16718 Spirits & Spells GBA
## 16719 Winning Post 8 2016 PSV
## Year_of_Release Genre Publisher NA_Sales EU_Sales JP_Sales
## 16714 2008 Racing Activision 0.00 0.00 0.00
## 16715 2016 Action Tecmo Koei 0.00 0.00 0.01
## 16716 2006 Sports Codemasters 0.00 0.01 0.00
## 16717 2016 Adventure Idea Factory 0.00 0.00 0.01
## 16718 2003 Platform Wanadoo 0.01 0.00 0.00
## 16719 2016 Simulation Tecmo Koei 0.00 0.00 0.01
## Other_Sales Global_Sales Critic_Score Critic_Count User_Score
## 16714 0 0.01 NA NA NA
## 16715 0 0.01 NA NA NA
## 16716 0 0.01 NA NA NA
## 16717 0 0.01 NA NA NA
## 16718 0 0.01 NA NA NA
## 16719 0 0.01 NA NA NA
## User_Count Developer Rating
## 16714 NA
## 16715 NA
## 16716 NA
## 16717 NA
## 16718 NA
## 16719 NA
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
glimpse(video_full)
## Observations: 16,719
## Variables: 16
## $ Name <fctr> Wii Sports, Super Mario Bros., Mario Kart Wii...
## $ Platform <fctr> Wii, NES, Wii, Wii, GB, GB, DS, Wii, Wii, NES...
## $ Year_of_Release <fctr> 2006, 1985, 2008, 2009, 1996, 1989, 2006, 200...
## $ Genre <fctr> Sports, Platform, Racing, Sports, Role-Playin...
## $ Publisher <fctr> Nintendo, Nintendo, Nintendo, Nintendo, Ninte...
## $ NA_Sales <dbl> 41.36, 29.08, 15.68, 15.61, 11.27, 23.20, 11.2...
## $ EU_Sales <dbl> 28.96, 3.58, 12.76, 10.93, 8.89, 2.26, 9.14, 9...
## $ JP_Sales <dbl> 3.77, 6.81, 3.79, 3.28, 10.22, 4.22, 6.50, 2.9...
## $ Other_Sales <dbl> 8.45, 0.77, 3.29, 2.95, 1.00, 0.58, 2.88, 2.84...
## $ Global_Sales <dbl> 82.53, 40.24, 35.52, 32.77, 31.37, 30.26, 29.8...
## $ Critic_Score <int> 76, NA, 82, 80, NA, NA, 89, 58, 87, NA, NA, 91...
## $ Critic_Count <int> 51, NA, 73, 73, NA, NA, 65, 41, 80, NA, NA, 64...
## $ User_Score <dbl> 8.0, NA, 8.3, 8.0, NA, NA, 8.5, 6.6, 8.4, NA, ...
## $ User_Count <int> 322, NA, 709, 192, NA, NA, 431, 129, 594, NA, ...
## $ Developer <fctr> Nintendo, , Nintendo, Nintendo, , , Nintendo,...
## $ Rating <fctr> E, , E, E, , , E, E, E, , , E, , E, E, E, M, ...
##data cleaning
video_no_year<-video_full[video_full$Year_of_Release !="N/A",]
video2<-video_no_year[video_no_year$Genre!="",]
video3<-video2[video2$Year_of_Release!=2020,]
video<-video3[video3$Year_of_Release!=2017,]
dim(video)
## [1] 16444 16
Our raw dataset is a combination of video games sales data and a score dataset of another web scrape from Metacritic (http://www.metacritic.com/). Missing observations exist as Metacritic only covers a subset of the platforms. There are 6900 observations contain all critic score and user score information.
Accumulative time length influence is also hard to observe from our dataset. Though people’s taste of games shifts frequently, there are some games who last a long life like Super Mario. Thus a high sales volume is possibly due to a long-lasting life, however, this theory cannot be concluded from our regression. We chose to omit time factor due to the limitation of information.
Since we find that some of the data points are “NA” and 1 “2020” in the Year_of _Release, and two of the Genre are blank, we decide to delete them since they are only small portion of the dataset. The original dataset has 16719 observations, and we delete 275 data points in the dataset. At last, there are 16444 observations in the dataset. We can think that they will not undermine our analysis.
Descriptive Analysis & Exploratory Data Analysis We conducted a thorough exploratory data analysis by using descriptions functions in packages like Hmisc, dplyr, pastecs, and some built-in functions.
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## combine, src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
describe(video)
## video
##
## 16 Variables 16444 Observations
## ---------------------------------------------------------------------------
## Name
## n missing distinct
## 16444 0 11426
##
## lowest : '98 Koshien Beyblade Burst Fire Emblem Fates Frozen: Olaf's Quest Haikyu!! Cross Team Match!
## highest: Zumba Fitness Core Zumba Fitness Rush Zumba Fitness: World Party Zwei!! Zyuden Sentai Kyoryuger: Game de Gaburincho!!
## ---------------------------------------------------------------------------
## Platform
## n missing distinct
## 16444 0 31
##
## lowest : 2600 3DO 3DS DC DS , highest: WiiU WS X360 XB XOne
## ---------------------------------------------------------------------------
## Year_of_Release
## n missing distinct
## 16444 0 37
##
## lowest : 1980 1981 1982 1983 1984, highest: 2012 2013 2014 2015 2016
## ---------------------------------------------------------------------------
## Genre
## n missing distinct
## 16444 0 12
##
## Action (3307, 0.201), Adventure (1293, 0.079), Fighting (837, 0.051), Misc
## (1721, 0.105), Platform (878, 0.053), Puzzle (569, 0.035), Racing (1226,
## 0.075), Role-Playing (1481, 0.090), Shooter (1296, 0.079), Simulation
## (857, 0.052), Sports (2306, 0.140), Strategy (673, 0.041)
## ---------------------------------------------------------------------------
## Publisher
## n missing distinct
## 16444 0 580
##
## lowest : 10TACLE Studios 1C Company 20th Century Fox Video Games 2D Boy 3DO
## highest: Yumedia Zenrin Zoo Digital Publishing Zoo Games Zushi Games
## ---------------------------------------------------------------------------
## NA_Sales
## n missing distinct Info Mean Gmd .05 .10
## 16444 0 401 0.98 0.264 0.401 0.00 0.00
## .25 .50 .75 .90 .95
## 0.00 0.08 0.24 0.61 1.06
##
## lowest : 0.00 0.01 0.02 0.03 0.04, highest: 15.68 23.20 26.93 29.08 41.36
## ---------------------------------------------------------------------------
## EU_Sales
## n missing distinct Info Mean Gmd .05 .10
## 16444 0 307 0.955 0.1459 0.2384 0.00 0.00
## .25 .50 .75 .90 .95
## 0.00 0.02 0.11 0.35 0.63
##
## lowest : 0.00 0.01 0.02 0.03 0.04, highest: 9.20 10.93 10.95 12.76 28.96
## ---------------------------------------------------------------------------
## JP_Sales
## n missing distinct Info Mean Gmd .05 .10
## 16444 0 244 0.754 0.07849 0.1392 0.00 0.00
## .25 .50 .75 .90 .95
## 0.00 0.00 0.04 0.18 0.36
##
## lowest : 0.00 0.01 0.02 0.03 0.04, highest: 6.04 6.50 6.81 7.20 10.22
## ---------------------------------------------------------------------------
## Other_Sales
## n missing distinct Info Mean Gmd .05 .10
## 16444 0 155 0.928 0.04759 0.07719 0.00 0.00
## .25 .50 .75 .90 .95
## 0.00 0.01 0.03 0.10 0.20
##
## lowest : 0.00 0.01 0.02 0.03 0.04, highest: 3.29 3.96 7.53 8.45 10.57
## ---------------------------------------------------------------------------
## Global_Sales
## n missing distinct Info Mean Gmd .05 .10
## 16444 0 628 0.999 0.5363 0.7682 0.02 0.02
## .25 .50 .75 .90 .95
## 0.06 0.17 0.47 1.21 2.04
##
## lowest : 0.01 0.02 0.03 0.04 0.05, highest: 31.37 32.77 35.52 40.24 82.53
## ---------------------------------------------------------------------------
## Critic_Score
## n missing distinct Info Mean Gmd .05 .10
## 7983 8461 81 0.999 68.99 15.59 43 50
## .25 .50 .75 .90 .95
## 60 71 79 85 89
##
## lowest : 13 17 19 20 21, highest: 94 95 96 97 98
## ---------------------------------------------------------------------------
## Critic_Count
## n missing distinct Info Mean Gmd .05 .10
## 7983 8461 106 1 26.44 20.49 5 7
## .25 .50 .75 .90 .95
## 12 22 36 54 65
##
## lowest : 3 4 5 6 7, highest: 104 105 106 107 113
## ---------------------------------------------------------------------------
## User_Score
## n missing distinct Info Mean Gmd .05 .10
## 7463 8981 95 0.999 7.126 1.602 4.1 5.1
## .25 .50 .75 .90 .95
## 6.4 7.5 8.2 8.7 8.9
##
## lowest : 0.0 0.2 0.3 0.5 0.6, highest: 9.3 9.4 9.5 9.6 9.7
## ---------------------------------------------------------------------------
## User_Count
## n missing distinct Info Mean Gmd .05 .10
## 7463 8981 883 1 163 266.8 5.0 6.0
## .25 .50 .75 .90 .95
## 10.0 24.0 81.0 317.8 728.6
##
## lowest : 4 5 6 7 8, highest: 9073 9629 9851 10179 10665
## ---------------------------------------------------------------------------
## Developer
## n missing distinct
## 16444 0 1681
##
## lowest : 10tacle Studios 10tacle Studios, Fusionsphere Systems 1C, 1C Company 1C, Ino-Co, 1C Company
## highest: Zoonami Ltd. ZootFly zSlide Zushi Games Ltd Zuxxez
## ---------------------------------------------------------------------------
## Rating
## n missing distinct
## 16444 0 9
##
## Value AO E E10+ EC K-A M RP T
## Frequency 6676 1 3921 1393 8 3 1536 1 2905
## Proportion 0.406 0.000 0.238 0.085 0.000 0.000 0.093 0.000 0.177
## ---------------------------------------------------------------------------
library(pastecs)
## Loading required package: boot
##
## Attaching package: 'boot'
## The following object is masked from 'package:survival':
##
## aml
## The following object is masked from 'package:lattice':
##
## melanoma
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
options(scipen = 100)
options(digits = 2)
stat.desc(video[,c("NA_Sales", "EU_Sales","JP_Sales","Other_Sales","Global_Sales","Critic_Score","Critic_Count","User_Score","User_Count")])
## NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
## nbr.val 16444.0000 16444.0000 16444.0000 16444.0000 16444.000
## nbr.null 4449.0000 5766.0000 10300.0000 6492.0000 0.000
## nbr.na 0.0000 0.0000 0.0000 0.0000 0.000
## min 0.0000 0.0000 0.0000 0.0000 0.010
## max 41.3600 28.9600 10.2200 10.5700 82.530
## range 41.3600 28.9600 10.2200 10.5700 82.520
## sum 4341.4200 2399.6800 1290.6400 782.6300 8818.570
## median 0.0800 0.0200 0.0000 0.0100 0.170
## mean 0.2640 0.1459 0.0785 0.0476 0.536
## SE.mean 0.0064 0.0040 0.0024 0.0015 0.012
## CI.mean.0.95 0.0125 0.0077 0.0048 0.0029 0.024
## var 0.6697 0.2568 0.0968 0.0353 2.429
## std.dev 0.8184 0.5067 0.3111 0.1880 1.559
## coef.var 3.0998 3.4723 3.9637 3.9502 2.906
## Critic_Score Critic_Count User_Score User_Count
## nbr.val 7983.00 7983.00 7463.000 7463.0
## nbr.null 0.00 0.00 1.000 0.0
## nbr.na 8461.00 8461.00 8981.000 8981.0
## min 13.00 3.00 0.000 4.0
## max 98.00 113.00 9.700 10665.0
## range 85.00 110.00 9.700 10661.0
## sum 550782.00 211081.00 53183.800 1216582.0
## median 71.00 22.00 7.500 24.0
## mean 68.99 26.44 7.126 163.0
## SE.mean 0.16 0.21 0.017 6.5
## CI.mean.0.95 0.31 0.42 0.034 12.8
## var 193.77 361.31 2.248 317941.9
## std.dev 13.92 19.01 1.499 563.9
## coef.var 0.20 0.72 0.210 3.5
Before developing our model, let’s take a look at the correlation coefficients of numeric variables:
#install.packages("corrplot")
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.3.3
num.cols <- sapply(na.omit(video), is.numeric)
cor.data <- cor(na.omit(video)[,num.cols])
corrPLOT<-corrplot(cor.data,method='square')
Everything is positive correlated. And the global sales mostly rely on NA sales and EU sales. NA sales, EU sales, and other sales move together at some level, while Japan market is an independent world. The user score is also slightly correlated with the critic score.
## EDA
## the trend of the Genre
library(dplyr)
q1<-count(video, Year_of_Release,Genre)
q1
## Source: local data frame [387 x 3]
## Groups: Year_of_Release [?]
##
## Year_of_Release Genre n
## <fctr> <fctr> <int>
## 1 1980 Action 1
## 2 1980 Fighting 1
## 3 1980 Misc 4
## 4 1980 Shooter 2
## 5 1980 Sports 1
## 6 1981 Action 25
## 7 1981 Platform 3
## 8 1981 Puzzle 2
## 9 1981 Racing 1
## 10 1981 Shooter 10
## # ... with 377 more rows
library("ggplot2")
#plot the form
ggplot(q1, aes(x = factor(Year_of_Release), y = n, colour=Genre, group=Genre)) + geom_line()+ggtitle("Trend of Genre by Year")+theme(axis.text.x = element_text(angle = 90,hjust = 1))
From the plot, we can see that there is a shift of people’s taste. Basicly, before 2002, the most popular published game genre is sports, after 2003, it turned to be action.The biggest influenced genres are action, misc, and sports. We also can tell that there is a sharp drop of game market from 2010-2013. We think that it’s because since 2010, the smart phone(like iphone) is more and more popular, it became a new platform of the game(such as pokeman-go, super mario,fruit ninga). Since this dataset dosen’t include this new platform, it’s probably the reason why there is a sharp drop.
# EDA the trend of the platform
head(video)
## Name Platform Year_of_Release Genre Publisher
## 1 Wii Sports Wii 2006 Sports Nintendo
## 2 Super Mario Bros. NES 1985 Platform Nintendo
## 3 Mario Kart Wii Wii 2008 Racing Nintendo
## 4 Wii Sports Resort Wii 2009 Sports Nintendo
## 5 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo
## 6 Tetris GB 1989 Puzzle Nintendo
## NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score
## 1 41 29.0 3.8 8.45 83 76
## 2 29 3.6 6.8 0.77 40 NA
## 3 16 12.8 3.8 3.29 36 82
## 4 16 10.9 3.3 2.95 33 80
## 5 11 8.9 10.2 1.00 31 NA
## 6 23 2.3 4.2 0.58 30 NA
## Critic_Count User_Score User_Count Developer Rating
## 1 51 8.0 322 Nintendo E
## 2 NA NA NA
## 3 73 8.3 709 Nintendo E
## 4 73 8.0 192 Nintendo E
## 5 NA NA NA
## 6 NA NA NA
q2<-count(video, Year_of_Release,Platform)
q2
## Source: local data frame [238 x 3]
## Groups: Year_of_Release [?]
##
## Year_of_Release Platform n
## <fctr> <fctr> <int>
## 1 1980 2600 9
## 2 1981 2600 46
## 3 1982 2600 36
## 4 1983 2600 11
## 5 1983 NES 6
## 6 1984 2600 1
## 7 1984 NES 13
## 8 1985 2600 1
## 9 1985 DS 1
## 10 1985 NES 11
## # ... with 228 more rows
str(q2)
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 238 obs. of 3 variables:
## $ Year_of_Release: Factor w/ 40 levels "1980","1981",..: 1 2 3 4 4 5 5 6 6 6 ...
## $ Platform : Factor w/ 31 levels "2600","3DO","3DS",..: 1 1 1 1 12 1 12 1 5 12 ...
## $ n : int 9 46 36 11 6 1 13 1 1 11 ...
## - attr(*, "vars")=List of 1
## ..$ : symbol Year_of_Release
## - attr(*, "drop")= logi TRUE
# trend of all the platform by year
ggplot(q2, aes(x = factor(Year_of_Release), y = n, colour=Platform, group=Platform)) + geom_line()+ggtitle("Trend of Platform by Year")+theme(axis.text.x = element_text(angle = 90,hjust = 1))
# trend of the PS series platform by year
q3_ps<-q2[q2$Platform=="PS" | q2$Platform=="PS2" | q2$Platform=="PS3"| q2$Platform=="PS4",]
ggplot(q3_ps, aes(x = factor(Year_of_Release), y = n, colour=Platform, group=Platform)) + geom_line()+ggtitle("Trend of 'PS' Platform by Year")+theme(axis.text.x = element_text(angle = 90,hjust = 1))
From the plot we can see that the platform cycle is around 10 years, which the popularity increased to the peak and then decreased. As the new generation of the same platform hit the market, the older platform becomes less popular. Nowadays, PS4 is the most popular platform among the ‘PS’ series. We think it might reach the peak in a few years.
## xb,x360,xone
q4_xb<-q2[q2$Platform=="XB" | q2$Platform=="X360" | q2$Platform=="XOne",]
ggplot(q4_xb, aes(x = factor(Year_of_Release), y = n, colour=Platform, group=Platform)) + geom_line()+ggtitle("Trend of 'Xbox' Platform by Year")+theme(axis.text.x = element_text(angle = 90,hjust = 1))
q5_mix<-q2[q2$Platform=="XB" | q2$Platform=="X360"| q2$Platform=="XOne" | q2$Platform=="PS"|q2$Platform=="PS2" | q2$Platform=="PS3" | q2$Platform=="PS4" ,]
ggplot(q5_mix, aes(x = factor(Year_of_Release), y = n, colour=Platform, group=Platform)) + geom_line()+ggtitle(" The competition between SONY and Microsoft")+theme(axis.text.x = element_text(angle = 90,hjust = 1))
The ‘PS’ series are created by SONY. The ‘XB’,‘XB360’and ’XOne’ are created by Microsoft. From the plot we can see that two platform designer are created new platform synchronously. The two companies are always take actions at the same time. SONY always performs better than Microsoft.
# EDA 2
# the pie chart
library(doBy)
attach(video)
sum_sale<-summaryBy(NA_Sales+EU_Sales+JP_Sales+Other_Sales~Genre,data=video,FUN=sum)
sum_sale
## Genre NA_Sales.sum EU_Sales.sum JP_Sales.sum Other_Sales.sum
## 1 Action 863 511 160 182
## 2 Adventure 102 63 52 16
## 3 Fighting 221 99 87 36
## 4 Misc 400 211 107 73
## 5 Platform 444 199 131 51
## 6 Puzzle 121 50 57 12
## 7 Racing 357 234 57 76
## 8 Role-Playing 330 188 353 59
## 9 Shooter 585 315 39 103
## 10 Simulation 180 113 64 31
## 11 Sports 671 371 135 132
## 12 Strategy 68 45 49 11
# the pie chart of the NA_Sales by genre
label_NA<-paste('(', round(sum_sale$NA_Sales.sum/sum(sum_sale$NA_Sales.sum) * 100, 1), '%)', sep = '')
label_NA
## [1] "(19.9%)" "(2.3%)" "(5.1%)" "(9.2%)" "(10.2%)" "(2.8%)" "(8.2%)"
## [8] "(7.6%)" "(13.5%)" "(4.2%)" "(15.5%)" "(1.6%)"
labelNA <- paste(sum_sale$Genre, label_NA, sep = '')
labelNA
## [1] "Action(19.9%)" "Adventure(2.3%)" "Fighting(5.1%)"
## [4] "Misc(9.2%)" "Platform(10.2%)" "Puzzle(2.8%)"
## [7] "Racing(8.2%)" "Role-Playing(7.6%)" "Shooter(13.5%)"
## [10] "Simulation(4.2%)" "Sports(15.5%)" "Strategy(1.6%)"
ggplot(data = sum_sale, mapping = aes(x = 'Content', y = NA_Sales.sum, fill = Genre )) + geom_bar(stat = 'identity', position = 'stack', width = 1)+ coord_polar(theta ="y") + ggtitle("Pie chart for NA_Sales")+ theme(axis.text = element_blank())+ scale_fill_discrete(labels =labelNA)
#geom_text(aes(y = sum_sale$NA_Sales.sum/8 + c(0, cumsum(sum_sale$NA_Sales.sum)[-length(sum_sale$NA_Sales.sum)]), x = sum(sum_sale$NA_Sales.sum)/3000, label = labelNA))
North America : Action 19.9% sports 15.5% shooter 13.5%
# the pie chart of the EU_Sales by genre
label_EU<-paste('(', round(sum_sale$EU_Sales.sum/sum(sum_sale$EU_Sales.sum) * 100, 1), '%)', sep = '')
label_EU
## [1] "(21.3%)" "(2.6%)" "(4.1%)" "(8.8%)" "(8.3%)" "(2.1%)" "(9.8%)"
## [8] "(7.8%)" "(13.1%)" "(4.7%)" "(15.5%)" "(1.9%)"
labelEU <- paste(sum_sale$Genre, label_EU, sep = '')
labelEU
## [1] "Action(21.3%)" "Adventure(2.6%)" "Fighting(4.1%)"
## [4] "Misc(8.8%)" "Platform(8.3%)" "Puzzle(2.1%)"
## [7] "Racing(9.8%)" "Role-Playing(7.8%)" "Shooter(13.1%)"
## [10] "Simulation(4.7%)" "Sports(15.5%)" "Strategy(1.9%)"
ggplot(data = sum_sale, mapping = aes(x = 'Content', y = EU_Sales.sum, fill = Genre )) + geom_bar(stat = 'identity', position = 'stack', width = 1)+ coord_polar(theta ="y") + ggtitle("Pie chart for EU_Sales")+ theme(axis.text = element_blank())+ scale_fill_discrete(labels =labelEU)
Europe: Action 21.3% Sports 15.5% shooter 13.1%
# the pie chart of the JP_Sales by genre
label_JP<-paste('(', round(sum_sale$JP_Sales.sum/sum(sum_sale$JP_Sales.sum) * 100, 1), '%)', sep = '')
label_JP
## [1] "(12.4%)" "(4%)" "(6.8%)" "(8.3%)" "(10.1%)" "(4.4%)" "(4.4%)"
## [8] "(27.4%)" "(3%)" "(4.9%)" "(10.5%)" "(3.8%)"
labelJP <- paste(sum_sale$Genre, label_JP, sep = '')
labelJP
## [1] "Action(12.4%)" "Adventure(4%)" "Fighting(6.8%)"
## [4] "Misc(8.3%)" "Platform(10.1%)" "Puzzle(4.4%)"
## [7] "Racing(4.4%)" "Role-Playing(27.4%)" "Shooter(3%)"
## [10] "Simulation(4.9%)" "Sports(10.5%)" "Strategy(3.8%)"
ggplot(data = sum_sale, mapping = aes(x = 'Content', y = JP_Sales.sum, fill = Genre )) + geom_bar(stat = 'identity', position = 'stack', width = 1)+ coord_polar(theta ="y") + ggtitle("Pie chart for JP_Sales")+ theme(axis.text = element_blank())+ scale_fill_discrete(labels =labelJP)
Japan: Role-playing 27.4% Action 12.4% Sports 10.5%
# the pie chart of the OTHER_Sales by genre
label_OTHER<-paste('(', round(sum_sale$Other_Sales.sum/sum(sum_sale$Other_Sales.sum) * 100, 1), '%)', sep = '')
label_OTHER
## [1] "(23.3%)" "(2.1%)" "(4.6%)" "(9.3%)" "(6.5%)" "(1.6%)" "(9.6%)"
## [8] "(7.6%)" "(13.2%)" "(3.9%)" "(16.9%)" "(1.4%)"
labelOTHER <- paste(sum_sale$Genre, label_OTHER, sep = '')
labelOTHER
## [1] "Action(23.3%)" "Adventure(2.1%)" "Fighting(4.6%)"
## [4] "Misc(9.3%)" "Platform(6.5%)" "Puzzle(1.6%)"
## [7] "Racing(9.6%)" "Role-Playing(7.6%)" "Shooter(13.2%)"
## [10] "Simulation(3.9%)" "Sports(16.9%)" "Strategy(1.4%)"
ggplot(data = sum_sale, mapping = aes(x = 'Content', y = Other_Sales.sum, fill = Genre )) + geom_bar(stat = 'identity', position = 'stack', width = 1)+ coord_polar(theta ="y") + ggtitle("Pie chart for Other_Sales")+ theme(axis.text = element_blank())+ scale_fill_discrete(labels =labelOTHER)
other: action 23.3% sports 16.9% shooter 13.2%
All in all, action and sports are popular among the world. But, in Japan role-playing games are much more popular than action and sports games.
##EDA 3
# count score for each year
scorecount <- video[, c(3,12,14)]
#View(scorecount)
scorecount$exist <- c(rep(1,nrow(scorecount)))
for(i in 1:16444)
{
if(as.numeric(is.na(scorecount$Critic_Count[i]))==0 && as.numeric(is.na(scorecount$User_Count[i]))==0){scorecount$count[i]=1}
else{scorecount$count[i]=0}
}
head(scorecount)
## Year_of_Release Critic_Count User_Count exist count
## 1 2006 51 322 1 1
## 2 1985 NA NA 1 0
## 3 2008 73 709 1 1
## 4 2009 73 192 1 1
## 5 1996 NA NA 1 0
## 6 1989 NA NA 1 0
#sum count
library(doBy)
attach(scorecount)
## The following objects are masked from video:
##
## Critic_Count, User_Count, Year_of_Release
sum_count<-summaryBy(exist+count~Year_of_Release,data=scorecount,FUN=sum)
sum_count
## Year_of_Release exist.sum count.sum
## 1 1980 9 0
## 2 1981 46 0
## 3 1982 36 0
## 4 1983 17 0
## 5 1984 14 0
## 6 1985 14 1
## 7 1986 21 0
## 8 1987 16 0
## 9 1988 15 1
## 10 1989 17 0
## 11 1990 16 0
## 12 1991 41 0
## 13 1992 43 1
## 14 1993 60 0
## 15 1994 121 1
## 16 1995 219 0
## 17 1996 263 8
## 18 1997 289 14
## 19 1998 379 26
## 20 1999 338 30
## 21 2000 350 102
## 22 2001 482 256
## 23 2002 829 455
## 24 2003 775 499
## 25 2004 762 477
## 26 2005 939 562
## 27 2006 1006 528
## 28 2007 1197 590
## 29 2008 1427 595
## 30 2009 1426 554
## 31 2010 1255 431
## 32 2011 1136 466
## 33 2012 653 321
## 34 2013 544 272
## 35 2014 581 256
## 36 2015 606 221
## 37 2016 502 227
sum_count$useless<-sum_count$exist.sum-sum_count$count.sum
sum_count2<-sum_count[,c(1,3,4)]
rownames(sum_count2)<-seq(1980,2016,1)
barplot(t(sum_count2[,c(2,3)]), main="number of games by year and score",
xlab="Year-of-release", col=c("darkblue","red"))
Each bar is the number of games released by year. The blue part is the number of games which contains both user_score and critic_score. Otherwise, it’s repesented in the red part. For example, the game has no score or only has one kind of the score.
The total bar height presents the total records in one year, and the blue one is data that contains score information. More complete information is centered between 2000 and 2016, as the score provider, Metacritic, is a company with brief history.
# the subset used to do the regression, which has both score of user and critic
video$count<-scorecount$count
video_reg<-subset(video,scorecount$count==1)
video_nintendo<-subset(video_reg,Publisher=="Nintendo")
Our smart question is: Whichfactors produce a change in video game sales?
Specific: we examined each possible influential variables Measurable: tests could be used to measure the fitness Achievable: build a connection between our target and data Reasonable: make a reasonable analysis and conclude a corresponding result Timeliness: occur at a favorable time
Conclusion The mystery of best-selling games. . The majority of video games were not selling well, and their data fits the linear regression model very well. . The data of best-selling games(Super Mario, Call of Duty, etc.) doesn’t fit the model well. When sales is big enough, sales variable itself is also an influential factor to regression model.
Finally, we used square rooted sales as a transform of response Y thus eliminate the error due to butterfly effect. The higher R-square presents a better fitting model.
#regression
dim(video)
## [1] 16444 17
video_nintendo<-subset(video_reg,Publisher=="Nintendo")
# use video_nintendo to do the regression, this dataset only includes nintendo publisher. we only want to do the regression of the nintendo.
# make the four region to be in one row
# north america
video_nintendo_NA<-video_nintendo[,-c(7,8,9,10)]
colnames(video_nintendo_NA)[6]<-"sales"
video_nintendo_NA$region<-rep("NORTH AMERICA",nrow(video_nintendo_NA))
set.seed(1000)
video_nintendo_NA_sample <- sample(2, nrow(video_nintendo_NA), replace=TRUE, prob=c(0.67, 0.33))
video_nintendo_NA_training <- video_nintendo_NA[video_nintendo_NA_sample==1,]
video_nintendo_NA_test <- video_nintendo_NA[video_nintendo_NA_sample==2,]
# EUROPEAN
video_nintendo_EU<-video_nintendo[,-c(6,8,9,10)]
colnames(video_nintendo_EU)[6]<-"sales"
video_nintendo_EU$region<-rep("EUROPEAN",nrow(video_nintendo_EU))
video_nintendo_EU_sample <- sample(2, nrow(video_nintendo_EU), replace=TRUE, prob=c(0.67, 0.33))
video_nintendo_EU_training <- video_nintendo_EU[video_nintendo_EU_sample==1,]
video_nintendo_EU_test <- video_nintendo_EU[video_nintendo_EU_sample==2,]
# Japan
video_nintendo_JP<-video_nintendo[,-c(6,7,9,10)]
colnames(video_nintendo_JP)[6]<-"sales"
video_nintendo_JP$region<-rep("JAPAN",nrow(video_nintendo_JP))
video_nintendo_JP_sample <- sample(2, nrow(video_nintendo_JP), replace=TRUE, prob=c(0.67, 0.33))
video_nintendo_JP_training <- video_nintendo_JP[video_nintendo_JP_sample==1,]
video_nintendo_JP_test <- video_nintendo_JP[video_nintendo_JP_sample==2,]
# OTHER
video_nintendo_OTHER<-video_nintendo[,-c(6,8,7,10)]
colnames(video_nintendo_OTHER)[6]<-"sales"
video_nintendo_OTHER$region<-rep("OTHER",nrow(video_nintendo_OTHER))
video_nintendo_OTHER_sample <- sample(2, nrow(video_nintendo_OTHER), replace=TRUE, prob=c(0.67, 0.33))
video_nintendo_OTHER_training <- video_nintendo_OTHER[video_nintendo_OTHER_sample==1,]
video_nintendo_OTHER_test <- video_nintendo_OTHER[video_nintendo_OTHER_sample==2,]
# rbind the dataset
new_video_nintendo_training<-rbind(video_nintendo_NA_training,video_nintendo_EU_training,video_nintendo_JP_training,video_nintendo_OTHER_training)
new_video_nintendo_test<-rbind(video_nintendo_NA_test,video_nintendo_EU_test,video_nintendo_JP_test,video_nintendo_OTHER_test)
#Linear Rregression
lm1 <- lm(sales~Platform+Genre+Critic_Score+User_Score+region-1,data=new_video_nintendo_training)
summary(lm1)
##
## Call:
## lm(formula = sales ~ Platform + Genre + Critic_Score + User_Score +
## region - 1, data = new_video_nintendo_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.40 -0.72 -0.10 0.37 37.22
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Platform3DS -1.92831 0.71877 -2.68 0.00746 **
## PlatformDS -1.96817 0.75518 -2.61 0.00933 **
## PlatformGBA -2.53429 0.76808 -3.30 0.00101 **
## PlatformGC -2.36726 0.78061 -3.03 0.00251 **
## PlatformWii -0.97327 0.73757 -1.32 0.18738
## PlatformWiiU -2.40883 0.74482 -3.23 0.00127 **
## GenreAdventure 0.25957 0.41860 0.62 0.53539
## GenreFighting -0.04027 0.56135 -0.07 0.94284
## GenreMisc 0.67150 0.30835 2.18 0.02974 *
## GenrePlatform 0.72098 0.29836 2.42 0.01591 *
## GenrePuzzle 0.13531 0.35786 0.38 0.70544
## GenreRacing 0.79846 0.39680 2.01 0.04455 *
## GenreRole-Playing 0.13185 0.29139 0.45 0.65104
## GenreShooter -0.04048 0.46055 -0.09 0.92998
## GenreSimulation -0.15292 0.43766 -0.35 0.72688
## GenreSports 2.24638 0.34733 6.47 0.00000000018 ***
## GenreStrategy -0.18933 0.43624 -0.43 0.66441
## Critic_Score 0.03526 0.00978 3.60 0.00033 ***
## User_Score -0.03913 0.10588 -0.37 0.71181
## regionJAPAN -0.23054 0.21806 -1.06 0.29075
## regionNORTH AMERICA 0.49694 0.22042 2.25 0.02445 *
## regionOTHER -0.63407 0.21829 -2.90 0.00378 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.1 on 761 degrees of freedom
## Multiple R-squared: 0.252, Adjusted R-squared: 0.23
## F-statistic: 11.6 on 22 and 761 DF, p-value: <0.0000000000000002
plot(lm1)
new_video_nintendo_training$sales_sqrt <- sqrt(new_video_nintendo_training$sales)
new_video_nintendo_test$sales_sqrt <- sqrt(new_video_nintendo_test$sales)
lm2 <- lm(sales_sqrt~Platform+Genre+Critic_Score+User_Score+region-1,data=new_video_nintendo_training)
summary(lm2)
##
## Call:
## lm(formula = sales_sqrt ~ Platform + Genre + Critic_Score + User_Score +
## region - 1, data = new_video_nintendo_training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.172 -0.307 -0.045 0.199 4.881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Platform3DS -0.66758 0.17715 -3.77 0.00018 ***
## PlatformDS -0.71678 0.18612 -3.85 0.00013 ***
## PlatformGBA -0.96675 0.18930 -5.11 0.0000004141988 ***
## PlatformGC -0.85206 0.19239 -4.43 0.0000108613669 ***
## PlatformWii -0.41417 0.18178 -2.28 0.02298 *
## PlatformWiiU -0.86803 0.18357 -4.73 0.0000026934553 ***
## GenreAdventure 0.04503 0.10317 0.44 0.66260
## GenreFighting -0.00231 0.13835 -0.02 0.98669
## GenreMisc 0.33464 0.07600 4.40 0.0000121848230 ***
## GenrePlatform 0.31292 0.07353 4.26 0.0000234690963 ***
## GenrePuzzle 0.01851 0.08820 0.21 0.83385
## GenreRacing 0.31542 0.09780 3.23 0.00131 **
## GenreRole-Playing 0.04920 0.07182 0.69 0.49352
## GenreShooter -0.01396 0.11351 -0.12 0.90212
## GenreSimulation -0.07560 0.10787 -0.70 0.48358
## GenreSports 0.50546 0.08560 5.90 0.0000000053227 ***
## GenreStrategy -0.15276 0.10752 -1.42 0.15579
## Critic_Score 0.01707 0.00241 7.08 0.0000000000033 ***
## User_Score -0.01245 0.02609 -0.48 0.63346
## regionJAPAN -0.01975 0.05374 -0.37 0.71335
## regionNORTH AMERICA 0.26121 0.05432 4.81 0.0000018344902 ***
## regionOTHER -0.29909 0.05380 -5.56 0.0000000374581 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.53 on 761 degrees of freedom
## Multiple R-squared: 0.646, Adjusted R-squared: 0.636
## F-statistic: 63.1 on 22 and 761 DF, p-value: <0.0000000000000002
plot(lm2)
We devide the dataset into two group, training and test. In the training dataset, Then we fit the linear regression (lm1) between the dependent variable “sales” and 5 independent variables. Adjusted R-squared of this model is 0.23 which is pretty low. The Q-Q plot of this linear regression has a heavier tail so that we could reject the normality assumotion.
Then we do the transformation for the dependent variable “sales” and fit another linear regression (lm2). Adjusted R-squared of this model is 0.636. The Q-Q plot of this model still has a heavier tail,but it looks much better than the previous linear model. In addition, the p-value of this model is pretty low, so we can use this model.
pred<- predict(lm2,new_video_nintendo_test)
Eval <- data.frame(Game= new_video_nintendo_test$Name, Actual = new_video_nintendo_test$sales_sqrt)
pred <- round(pred,2)
Eval <- Eval[1:length(pred),]
Eval$Predicted <- abs(pred)
Eval$diff <- abs(Eval$Predicted - Eval$Actual)
head(Eval)
## Game Actual Predicted diff
## 1 Mario Kart Wii 4.0 1.46 2.50
## 2 New Super Mario Bros. 3.4 1.27 2.09
## 3 Mario Kart DS 3.1 1.31 1.81
## 4 Super Smash Bros. Brawl 2.6 1.32 1.25
## 5 Animal Crossing: Wild World 1.6 0.83 0.75
## 6 Super Mario Galaxy 2.5 1.70 0.76
RMSE <- sqrt(mean(Eval$diff^2))
RMSE
## [1] 0.49
After that, we use the second model (lm2) to do the prediction for the test dataset. Then we create a table which concludes the actual sales and predicted value of sales. Finally, we calculate the RMSE of this model. The value of RMSE is 0.49.
The mystery of best-selling games.
-The majority of video games were not selling well, and their data fits the linear regression model very well.
-The data of best-selling games(Super Mario, Call of Duty, etc.) doesn’t fit the model well.
When sales is big enough, sales variable itself is also an influential factor to regression model.
Finally, we used square rooted sales as a transform of response Y thus eliminate the error due to butterfly effect. The higher R-square presents a better fitting model.