Outlines :

library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)

Motivation

This year we had a very interesting case with the game No Man's Sky. The game was released on PS4 and PC in August and is famous for having built over the past year a very large hype (see Kotaku,http://kotaku.com/the-no-mans-sky-hype-dilemma-1785416931,Eurogamer : http://www.eurogamer.net/articles/2016-12-20-no-mans-sky-changed-the-video-game-hype-train-forever)

The reason was a game that did not delivered all the features it was promising (although I found the game quite good and exactly what I was expecting) and suffered from very bad reviews on Steam, whereas the game had pretty good professional reviews.

So having the information about the Metacritic score (professional review) and User score (public review) would be interesting to look at, also in regards of the Sales it generates, because most of the people had the game pre-ordered.

Data

df<-read.csv('Video_Games_Sales_as_at_30_Nov_2016.csv',sep=',')
df$year = as.numeric(as.character(df$Year_of_Release))
df$User_Score_num = as.numeric(as.character(df$User_Score)) *10

#create new columns to regroup the Platform by manufacturers
sony<-c('PS','PS2','PS3','PS4' ,'PSP','PSV')
microsoft<-c('PC','X360','XB','XOne')
nintendo<-c('3DS','DS','GBA','GC','N64','Wii','WiiU')
sega<-c('DC')

newPlatform<-function(x){
  if (x %in% sony == TRUE) {return('SONY')}
  else if(x %in% microsoft == TRUE) {return('MICROSOFT')}
  else if(x %in% nintendo == TRUE) {return('NINTENDO')}
  else if(x %in% sega == TRUE) {return('SEGA')}
  else{return('OTHER')}
  }
df$newPlatform<-sapply(df$Platform, newPlatform)

df2 <- na.omit(df)
#there are still few rows for which the Rating is an empty string
df2<-filter(df2,Rating!='')
filter(df2,Name=="No Man's Sky")
##           Name Platform Year_of_Release  Genre   Publisher NA_Sales
## 1 No Man's Sky      PS4            2016 Action Hello Games     0.62
##   EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score Critic_Count
## 1     0.75     0.03        0.27         1.67           71           93
##   User_Score User_Count   Developer Rating year User_Score_num newPlatform
## 1        4.5       5046 Hello Games      T 2016             45        SONY

We can select the same games (Platform, Genre) to check the correlation between the MetaCritic and User scores :

df2<-filter(df2,Genre=='Action' & Platform=='PS4')
summary(df2)
##                                  Name       Platform  Year_of_Release
##  7 Days to Die                     : 1   PS4    :57   2015   :22     
##  Aegis of Earth: Protonovus Assault: 1   3DS    : 0   2014   :17     
##  Anima - Gate of Memories          : 1   DC     : 0   2016   :16     
##  Assassin's Creed Chronicles: China: 1   DS     : 0   2013   : 2     
##  Assassin's Creed IV: Black Flag   : 1   GBA    : 0   1985   : 0     
##  Assassin's Creed: Unity           : 1   GC     : 0   1988   : 0     
##  (Other)                           :51   (Other): 0   (Other): 0     
##        Genre                                     Publisher 
##  Action   :57   Warner Bros. Interactive Entertainment:10  
##  Adventure: 0   Ubisoft                               : 7  
##  Fighting : 0   Activision                            : 5  
##  Misc     : 0   Sony Computer Entertainment           : 5  
##  Platform : 0   Namco Bandai Games                    : 3  
##  Puzzle   : 0   Square Enix                           : 3  
##  (Other)  : 0   (Other)                               :24  
##     NA_Sales         EU_Sales         JP_Sales        Other_Sales    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0200   1st Qu.:0.0300   1st Qu.:0.00000   1st Qu.:0.0100  
##  Median :0.1300   Median :0.1800   Median :0.01000   Median :0.0500  
##  Mean   :0.4054   Mean   :0.5284   Mean   :0.05105   Mean   :0.1816  
##  3rd Qu.:0.5500   3rd Qu.:0.7600   3rd Qu.:0.07000   3rd Qu.:0.2700  
##  Max.   :3.8800   Max.   :6.0400   Max.   :0.48000   Max.   :1.9100  
##                                                                      
##   Global_Sales     Critic_Score    Critic_Count      User_Score
##  Min.   : 0.010   Min.   :44.00   Min.   :  4.00   7.8    : 5  
##  1st Qu.: 0.080   1st Qu.:69.00   1st Qu.: 29.00   7.9    : 4  
##  Median : 0.330   Median :73.00   Median : 42.00   5.7    : 3  
##  Mean   : 1.167   Mean   :72.98   Mean   : 44.96   6.6    : 3  
##  3rd Qu.: 1.670   3rd Qu.:80.00   3rd Qu.: 60.00   7.6    : 3  
##  Max.   :12.190   Max.   :97.00   Max.   :100.00   8.1    : 3  
##                                                    (Other):36  
##    User_Count                Developer      Rating        year     
##  Min.   :   6.0   TT Games        : 5   M      :25   Min.   :2013  
##  1st Qu.:  50.0   Omega Force     : 3   T      :19   1st Qu.:2014  
##  Median :  98.0   PlatinumGames   : 2   E10+   :11   Median :2015  
##  Mean   : 756.1   Sucker Punch    : 2   E      : 2   Mean   :2015  
##  3rd Qu.:1104.0   Techland        : 2          : 0   3rd Qu.:2016  
##  Max.   :6304.0   Ubisoft Montreal: 2   AO     : 0   Max.   :2016  
##                   (Other)         :41   (Other): 0                 
##  User_Score_num  newPlatform       
##  Min.   :34.00   Length:57         
##  1st Qu.:63.00   Class :character  
##  Median :72.00   Mode  :character  
##  Mean   :69.44                     
##  3rd Qu.:78.00                     
##  Max.   :86.00                     
## 
ggplot(data=df2,aes(x= User_Score_num,y= Critic_Score,label=Name)) + geom_point(aes(color=Rating,size=Global_Sales)) + xlim(0,100) + ylim(0,100) + geom_abline(intercept = 0, slope = 1, color="red")

Outliers definition

We can select to plot the outliers games, meaning games for which the difference between the 2 Scores is above a given threshold.

df2$DiffScore<-df2$Critic_Score - df2$User_Score_num
#define the mean and standard deviation for the Scores difference
meanDF<-mean(df2$DiffScore)
sdDF<-sd(df2$DiffScore)
sprintf("mean: %f sd :%f", meanDF,sdDF)
## [1] "mean: 3.543860 sd :12.087582"
ggplot(df2, aes(x = Critic_Score - User_Score_num)) + geom_histogram(aes(y = ..density..),bins=50) + stat_function(fun = dnorm,args = with(df2, c(mean = meanDF, sd = sdDF)),color='red')

#definition of the threshold
Threshold<- meanDF + sdDF

So it turns out that No Man's Sky lies between \(\mu\) \(\pm\) 1*\(\sigma\) (15)

and then plot in particular games for which the difference between the 2 scores is above this threshold :

ggplot(data=filter(df2,Genre=='Action' & Platform=='PS4'),aes(x= User_Score_num,y= Critic_Score,label=Name)) + geom_point(aes(color=Rating,size=Global_Sales)) + xlim(0,100) + ylim(0,100) + geom_abline(intercept = 0, slope = 1, color="red") + geom_text(aes(label=ifelse(abs(Critic_Score-User_Score_num)>Threshold,as.character(Name),''),hjust=0, vjust=0))

The interesting thing to remark is that No Man's Sky is in the third position (by Global Sales) for games above the Threshold (the 2 first being Ubisoft games … ¯\_(ツ) _/¯

df2 %>% filter(abs(DiffScore)>Threshold) %>% select(DiffScore, Global_Sales, Name, year) %>% arrange(desc(Global_Sales))
##    DiffScore Global_Sales                               Name year
## 1         17         4.01                         Watch Dogs 2014
## 2         21         3.93            Assassin's Creed: Unity 2014
## 3         26         1.67                       No Man's Sky 2016
## 4         17         1.56                          Mafia III 2016
## 5         24         0.52              Rory McIlroy PGA Tour 2015
## 6         20         0.43              Skylanders: Trap Team 2014
## 7         47         0.30          Skylanders: SuperChargers 2015
## 8        -21         0.02 Aegis of Earth: Protonovus Assault 2016
## 9        -27         0.02           Anima - Gate of Memories 2016
## 10        16         0.02                      Dead Rising 2 2016

Outlier in number of reviews

The game may not be an outlier in the Score difference however it is an important outlier in the number of reviews, between professional and public reviews.

There goes the hype train …

df2$DiffRev<-df2$Critic_Count - df2$User_Count
#define the mean and standard deviation for the Scores difference
meanRev<-mean(df2$DiffRev)
sdRev<-sd(df2$DiffRev)
sprintf("mean: %f sd :%f", meanRev,sdRev)
## [1] "mean: -711.122807 sd :1313.378890"
ggplot(df2, aes(x = DiffRev)) + geom_histogram(aes(y = ..density..),bins=50) + stat_function(fun = dnorm,args = with(df2, c(mean = meanRev, sd = sdRev)),color='red')

#definition of the threshold
ThresholdRev<- meanRev - 3*sdRev
ggplot(data=df2,aes(x= User_Score_num- Critic_Score,y = Critic_Count - User_Count,label=Name)) + geom_point(aes(color=Rating,size=Global_Sales))  + geom_text(aes(label=ifelse(abs(DiffRev)>abs(ThresholdRev),as.character(Name),''),hjust=0, vjust=0))

Using a 3\(\sigma\) cut really defines No Man's Sky as an outlier here because it’s the 2nd game having the largest difference in the number of reviews and at the same time having a large difference in the Scores difference.