An introduction to exploring data and how toget a first look at what you have. We will take a look at the video games dataset from Kaggle.
It is good to know the quantum of data we are working with here. We can see that the Video Games dataset contains 16719 rows of data in 16 columns.
dim(games)
## [1] 16719 16
A quick look at the first few rows goes a long way in seeing what we are dealing with here
head(games)
## Name Platform Year_of_Release Genre Publisher
## 1 Wii Sports Wii 2006 Sports Nintendo
## 2 Super Mario Bros. NES 1985 Platform Nintendo
## 3 Mario Kart Wii Wii 2008 Racing Nintendo
## 4 Wii Sports Resort Wii 2009 Sports Nintendo
## 5 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo
## 6 Tetris GB 1989 Puzzle Nintendo
## NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score
## 1 41 29.0 3.8 8.45 83 76
## 2 29 3.6 6.8 0.77 40 NA
## 3 16 12.8 3.8 3.29 36 82
## 4 16 10.9 3.3 2.95 33 80
## 5 11 8.9 10.2 1.00 31 NA
## 6 23 2.3 4.2 0.58 30 NA
## Critic_Count User_Score User_Count Developer Rating
## 1 51 8 322 Nintendo E
## 2 NA <NA> NA <NA> <NA>
## 3 73 8.3 709 Nintendo E
## 4 73 8 192 Nintendo E
## 5 NA <NA> NA <NA> <NA>
## 6 NA <NA> NA <NA> <NA>
Summary gives a very quick view into the dataset. The video games dataset here shows us each games, it developer and publisher, its total and regional sales and ratings scores.
summary(games)
## Name Platform Year_of_Release
## Need for Speed: Most Wanted: 12 PS2 :2161 2008 :1427
## FIFA 14 : 9 DS :2152 2009 :1426
## LEGO Marvel Super Heroes : 9 PS3 :1331 2010 :1255
## Madden NFL 07 : 9 Wii :1320 2007 :1197
## Ratatouille : 9 X360 :1262 2011 :1136
## (Other) :16669 PSP :1209 2006 :1006
## NA's : 2 (Other):7284 (Other):9272
## Genre Publisher NA_Sales
## Action :3370 Electronic Arts : 1356 Min. : 0
## Sports :2348 Activision : 985 1st Qu.: 0
## Misc :1750 Namco Bandai Games : 939 Median : 0
## Role-Playing:1500 Ubisoft : 933 Mean : 0
## Shooter :1323 Konami Digital Entertainment: 834 3rd Qu.: 0
## (Other) :6426 THQ : 715 Max. :41
## NA's : 2 (Other) :10957
## EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0 Min. :13
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0 1st Qu.:60
## Median : 0.0 Median : 0.0 Median : 0.0 Median : 0 Median :71
## Mean : 0.1 Mean : 0.1 Mean : 0.0 Mean : 1 Mean :69
## 3rd Qu.: 0.1 3rd Qu.: 0.0 3rd Qu.: 0.0 3rd Qu.: 0 3rd Qu.:79
## Max. :29.0 Max. :10.2 Max. :10.6 Max. :83 Max. :98
## NA's :8582
## Critic_Count User_Score User_Count Developer
## Min. : 3 tbd :2425 Min. : 4 Ubisoft : 204
## 1st Qu.: 12 7.8 : 324 1st Qu.: 10 EA Sports: 172
## Median : 21 8 : 290 Median : 24 EA Canada: 167
## Mean : 26 8.2 : 282 Mean : 162 Konami : 162
## 3rd Qu.: 36 8.3 : 254 3rd Qu.: 81 Capcom : 139
## Max. :113 (Other):6440 Max. :10665 (Other) :9252
## NA's :8582 NA's :6704 NA's :9129 NA's :6623
## Rating
## E :3991
## T :2961
## M :1563
## E10+ :1420
## EC : 8
## (Other): 7
## NA's :6769
We can also have a look of the structure of the object we have read in to R, so we can get an idea of how we might need to manipulate it.
str(games)
## 'data.frame': 16719 obs. of 16 variables:
## $ Name : Factor w/ 11562 levels "'98 Koshien",..: 11058 9405 5572 11060 7416 9770 6692 11056 6695 2619 ...
## $ Platform : Factor w/ 31 levels "2600","3DO","3DS",..: 26 12 26 26 6 6 5 26 26 12 ...
## $ Year_of_Release: Factor w/ 40 levels "1980","1981",..: 27 6 29 30 17 10 27 27 30 5 ...
## $ Genre : Factor w/ 12 levels "Action","Adventure",..: 11 5 7 11 8 6 5 4 5 9 ...
## $ Publisher : Factor w/ 582 levels "10TACLE Studios",..: 371 371 371 371 371 371 371 371 371 371 ...
## $ NA_Sales : num 41.4 29.1 15.7 15.6 11.3 ...
## $ EU_Sales : num 28.96 3.58 12.76 10.93 8.89 ...
## $ JP_Sales : num 3.77 6.81 3.79 3.28 10.22 ...
## $ Other_Sales : num 8.45 0.77 3.29 2.95 1 0.58 2.88 2.84 2.24 0.47 ...
## $ Global_Sales : num 82.5 40.2 35.5 32.8 31.4 ...
## $ Critic_Score : int 76 NA 82 80 NA NA 89 58 87 NA ...
## $ Critic_Count : int 51 NA 73 73 NA NA 65 41 80 NA ...
## $ User_Score : Factor w/ 96 levels "0","0.2","0.3",..: 78 NA 81 78 NA NA 83 64 82 NA ...
## $ User_Count : int 322 NA 709 192 NA NA 431 129 594 NA ...
## $ Developer : Factor w/ 1696 levels "10tacle Studios",..: 1034 NA 1034 1034 NA NA 1034 1034 1034 NA ...
## $ Rating : Factor w/ 8 levels "AO","E","E10+",..: 2 NA 2 2 NA NA 2 2 2 NA ...
There will always be missing data, so we want to get an understanding of how much is missing. It is best to do this on a column-by-column basis. Interestingly, for this dataset only a few columns have missing data.
sapply(games, function(x) sum(is.na(x)))
## Name Platform Year_of_Release Genre
## 2 0 0 2
## Publisher NA_Sales EU_Sales JP_Sales
## 0 0 0 0
## Other_Sales Global_Sales Critic_Score Critic_Count
## 0 0 8582 8582
## User_Score User_Count Developer Rating
## 6704 9129 6623 6769
We can also graph this as a way of seeing the proportion of data per column that is missing. This gives us an idea of how accurate any analysis we do on this column might be.
missing = data.frame(colnames(games)) #create new dataframe with the colnames as rows
missing$totalRows = dim(games)[1] #add column totals
missing$isNaRows = sapply(games, function(x) sum(is.na(x))) #add number of missing rows as new column
g = ggplot(missing, aes(x = colnames.games., y = isNaRows))
g = g + geom_point()
g = g + ggtitle("Summary of missing rows")
g = g + xlab("Row Name")
g = g + ylab("Sum of missing rows")
g = g + geom_hline(yintercept = max(missing$totalRows)) #add max rows line
g = g + annotate("text", x = dim(missing)[1]/2, y = max(missing$totalRows)*1.1, label = "TotalRows")
g = g + scale_y_continuous(limits = c(0, 20000)) #scale y for better viewing
g + theme(axis.text.x = element_text(angle = 90)) #rotate x labels so we can read them
We might want to pick out some features of the dataset to zoom in on.
Video games are typically tied to a platform, which is just another way of talking about the type of hardware needed to play the games. Let’s look at the platforms with the most games available:
platforms = data.frame(table(games$Platform))
platforms[order(-platforms$Freq),]
## Var1 Freq
## 17 PS2 2161
## 5 DS 2152
## 18 PS3 1331
## 26 Wii 1320
## 29 X360 1262
## 20 PSP 1209
## 16 PS 1197
## 14 PC 974
## 30 XB 824
## 7 GBA 822
## 8 GC 556
## 3 3DS 520
## 21 PSV 432
## 19 PS4 393
## 11 N64 319
## 31 XOne 247
## 24 SNES 239
## 22 SAT 173
## 27 WiiU 147
## 1 2600 133
## 6 GB 98
## 12 NES 98
## 4 DC 52
## 9 GEN 29
## 13 NG 12
## 23 SCD 6
## 28 WS 6
## 2 3DO 3
## 25 TG16 2
## 10 GG 1
## 15 PCFX 1
Similarly, we can take a look at the top developers of games, as per the amount of games they release. I don’t expect this to be very valuable, but let’s take a look anyway. To cut it down, let’s only show the top 10 developers.
developers = data.frame(table(games$Developer))
developers = developers[order(-developers$Freq),]
head(developers, 10) #only show top 10 rows
## Var1 Freq
## 1552 Ubisoft 204
## 451 EA Sports 172
## 433 EA Canada 167
## 826 Konami 162
## 265 Capcom 139
## 455 EA Tiburon 108
## 479 Electronic Arts 104
## 1560 Ubisoft Montreal 101
## 1622 Visual Concepts 99
## 1070 Omega Force 80
As we expected, this isn’t very useful. Only 9.53% of games are developed by the top 10 developers.