Exploring the Video Games dataset

An introduction to exploring data and how toget a first look at what you have. We will take a look at the video games dataset from Kaggle.

Amount of Data

It is good to know the quantum of data we are working with here. We can see that the Video Games dataset contains 16719 rows of data in 16 columns.

dim(games)
## [1] 16719    16

Looking at the Data

A quick look at the first few rows goes a long way in seeing what we are dealing with here

head(games)
##                       Name Platform Year_of_Release        Genre Publisher
## 1               Wii Sports      Wii            2006       Sports  Nintendo
## 2        Super Mario Bros.      NES            1985     Platform  Nintendo
## 3           Mario Kart Wii      Wii            2008       Racing  Nintendo
## 4        Wii Sports Resort      Wii            2009       Sports  Nintendo
## 5 Pokemon Red/Pokemon Blue       GB            1996 Role-Playing  Nintendo
## 6                   Tetris       GB            1989       Puzzle  Nintendo
##   NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score
## 1       41     29.0      3.8        8.45           83           76
## 2       29      3.6      6.8        0.77           40           NA
## 3       16     12.8      3.8        3.29           36           82
## 4       16     10.9      3.3        2.95           33           80
## 5       11      8.9     10.2        1.00           31           NA
## 6       23      2.3      4.2        0.58           30           NA
##   Critic_Count User_Score User_Count Developer Rating
## 1           51          8        322  Nintendo      E
## 2           NA       <NA>         NA      <NA>   <NA>
## 3           73        8.3        709  Nintendo      E
## 4           73          8        192  Nintendo      E
## 5           NA       <NA>         NA      <NA>   <NA>
## 6           NA       <NA>         NA      <NA>   <NA>

Summary

Summary gives a very quick view into the dataset. The video games dataset here shows us each games, it developer and publisher, its total and regional sales and ratings scores.

summary(games)
##                           Name          Platform    Year_of_Release
##  Need for Speed: Most Wanted:   12   PS2    :2161   2008   :1427   
##  FIFA 14                    :    9   DS     :2152   2009   :1426   
##  LEGO Marvel Super Heroes   :    9   PS3    :1331   2010   :1255   
##  Madden NFL 07              :    9   Wii    :1320   2007   :1197   
##  Ratatouille                :    9   X360   :1262   2011   :1136   
##  (Other)                    :16669   PSP    :1209   2006   :1006   
##  NA's                       :    2   (Other):7284   (Other):9272   
##           Genre                             Publisher        NA_Sales 
##  Action      :3370   Electronic Arts             : 1356   Min.   : 0  
##  Sports      :2348   Activision                  :  985   1st Qu.: 0  
##  Misc        :1750   Namco Bandai Games          :  939   Median : 0  
##  Role-Playing:1500   Ubisoft                     :  933   Mean   : 0  
##  Shooter     :1323   Konami Digital Entertainment:  834   3rd Qu.: 0  
##  (Other)     :6426   THQ                         :  715   Max.   :41  
##  NA's        :   2   (Other)                     :10957               
##     EU_Sales       JP_Sales     Other_Sales    Global_Sales  Critic_Score 
##  Min.   : 0.0   Min.   : 0.0   Min.   : 0.0   Min.   : 0    Min.   :13    
##  1st Qu.: 0.0   1st Qu.: 0.0   1st Qu.: 0.0   1st Qu.: 0    1st Qu.:60    
##  Median : 0.0   Median : 0.0   Median : 0.0   Median : 0    Median :71    
##  Mean   : 0.1   Mean   : 0.1   Mean   : 0.0   Mean   : 1    Mean   :69    
##  3rd Qu.: 0.1   3rd Qu.: 0.0   3rd Qu.: 0.0   3rd Qu.: 0    3rd Qu.:79    
##  Max.   :29.0   Max.   :10.2   Max.   :10.6   Max.   :83    Max.   :98    
##                                                             NA's   :8582  
##   Critic_Count    User_Score     User_Count        Developer   
##  Min.   :  3    tbd    :2425   Min.   :    4   Ubisoft  : 204  
##  1st Qu.: 12    7.8    : 324   1st Qu.:   10   EA Sports: 172  
##  Median : 21    8      : 290   Median :   24   EA Canada: 167  
##  Mean   : 26    8.2    : 282   Mean   :  162   Konami   : 162  
##  3rd Qu.: 36    8.3    : 254   3rd Qu.:   81   Capcom   : 139  
##  Max.   :113    (Other):6440   Max.   :10665   (Other)  :9252  
##  NA's   :8582   NA's   :6704   NA's   :9129    NA's     :6623  
##      Rating    
##  E      :3991  
##  T      :2961  
##  M      :1563  
##  E10+   :1420  
##  EC     :   8  
##  (Other):   7  
##  NA's   :6769

We can also have a look of the structure of the object we have read in to R, so we can get an idea of how we might need to manipulate it.

str(games)
## 'data.frame':    16719 obs. of  16 variables:
##  $ Name           : Factor w/ 11562 levels "'98 Koshien",..: 11058 9405 5572 11060 7416 9770 6692 11056 6695 2619 ...
##  $ Platform       : Factor w/ 31 levels "2600","3DO","3DS",..: 26 12 26 26 6 6 5 26 26 12 ...
##  $ Year_of_Release: Factor w/ 40 levels "1980","1981",..: 27 6 29 30 17 10 27 27 30 5 ...
##  $ Genre          : Factor w/ 12 levels "Action","Adventure",..: 11 5 7 11 8 6 5 4 5 9 ...
##  $ Publisher      : Factor w/ 582 levels "10TACLE Studios",..: 371 371 371 371 371 371 371 371 371 371 ...
##  $ NA_Sales       : num  41.4 29.1 15.7 15.6 11.3 ...
##  $ EU_Sales       : num  28.96 3.58 12.76 10.93 8.89 ...
##  $ JP_Sales       : num  3.77 6.81 3.79 3.28 10.22 ...
##  $ Other_Sales    : num  8.45 0.77 3.29 2.95 1 0.58 2.88 2.84 2.24 0.47 ...
##  $ Global_Sales   : num  82.5 40.2 35.5 32.8 31.4 ...
##  $ Critic_Score   : int  76 NA 82 80 NA NA 89 58 87 NA ...
##  $ Critic_Count   : int  51 NA 73 73 NA NA 65 41 80 NA ...
##  $ User_Score     : Factor w/ 96 levels "0","0.2","0.3",..: 78 NA 81 78 NA NA 83 64 82 NA ...
##  $ User_Count     : int  322 NA 709 192 NA NA 431 129 594 NA ...
##  $ Developer      : Factor w/ 1696 levels "10tacle Studios",..: 1034 NA 1034 1034 NA NA 1034 1034 1034 NA ...
##  $ Rating         : Factor w/ 8 levels "AO","E","E10+",..: 2 NA 2 2 NA NA 2 2 2 NA ...

Missing Data

There will always be missing data, so we want to get an understanding of how much is missing. It is best to do this on a column-by-column basis. Interestingly, for this dataset only a few columns have missing data.

sapply(games, function(x) sum(is.na(x)))
##            Name        Platform Year_of_Release           Genre 
##               2               0               0               2 
##       Publisher        NA_Sales        EU_Sales        JP_Sales 
##               0               0               0               0 
##     Other_Sales    Global_Sales    Critic_Score    Critic_Count 
##               0               0            8582            8582 
##      User_Score      User_Count       Developer          Rating 
##            6704            9129            6623            6769

We can also graph this as a way of seeing the proportion of data per column that is missing. This gives us an idea of how accurate any analysis we do on this column might be.

missing = data.frame(colnames(games)) #create new dataframe with the colnames as rows
missing$totalRows = dim(games)[1] #add column totals
missing$isNaRows = sapply(games, function(x) sum(is.na(x))) #add number of missing rows as new column

g = ggplot(missing, aes(x = colnames.games., y = isNaRows))
g = g + geom_point()
g = g + ggtitle("Summary of missing rows")
g = g + xlab("Row Name")
g = g + ylab("Sum of missing rows")
g = g + geom_hline(yintercept = max(missing$totalRows)) #add max rows line
g = g + annotate("text", x = dim(missing)[1]/2, y = max(missing$totalRows)*1.1, label = "TotalRows")
g = g + scale_y_continuous(limits = c(0, 20000)) #scale y for better viewing
g + theme(axis.text.x = element_text(angle = 90)) #rotate x labels so we can read them

Picking Feature

We might want to pick out some features of the dataset to zoom in on.

Amount of Titles per Platform

Video games are typically tied to a platform, which is just another way of talking about the type of hardware needed to play the games. Let’s look at the platforms with the most games available:

platforms = data.frame(table(games$Platform))
platforms[order(-platforms$Freq),]
##    Var1 Freq
## 17  PS2 2161
## 5    DS 2152
## 18  PS3 1331
## 26  Wii 1320
## 29 X360 1262
## 20  PSP 1209
## 16   PS 1197
## 14   PC  974
## 30   XB  824
## 7   GBA  822
## 8    GC  556
## 3   3DS  520
## 21  PSV  432
## 19  PS4  393
## 11  N64  319
## 31 XOne  247
## 24 SNES  239
## 22  SAT  173
## 27 WiiU  147
## 1  2600  133
## 6    GB   98
## 12  NES   98
## 4    DC   52
## 9   GEN   29
## 13   NG   12
## 23  SCD    6
## 28   WS    6
## 2   3DO    3
## 25 TG16    2
## 10   GG    1
## 15 PCFX    1

Top Developers

Similarly, we can take a look at the top developers of games, as per the amount of games they release. I don’t expect this to be very valuable, but let’s take a look anyway. To cut it down, let’s only show the top 10 developers.

developers = data.frame(table(games$Developer))
developers = developers[order(-developers$Freq),]
head(developers, 10) #only show top 10 rows
##                  Var1 Freq
## 1552          Ubisoft  204
## 451         EA Sports  172
## 433         EA Canada  167
## 826            Konami  162
## 265            Capcom  139
## 455        EA Tiburon  108
## 479   Electronic Arts  104
## 1560 Ubisoft Montreal  101
## 1622  Visual Concepts   99
## 1070      Omega Force   80

As we expected, this isn’t very useful. Only 9.53% of games are developed by the top 10 developers.