Video_GamesEDA

AIM

The goal here is to establish some understanding of the dataset, the basic data structure, the correlation relations - pairwise and with the Outcome as well as show mastery of visualisations.

Flow

Load Dependencies
Data Cleaning
Outcome Variables
Top-Distributions
Correlation

Loading Dependencies

vgDf <- read.csv("C:/Users/PC/Documents/R_4DS/VideoGames/vgsales.csv/vgsales.csv", stringsAsFactors = TRUE)

str(vgDf)

## 'data.frame':    16598 obs. of  11 variables:
##  $ Rank        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name        : Factor w/ 11493 levels "'98 Koshien",..: 10991 9343 5532 10993 7370 9707 6648 10989 6651 2594 ...
##  $ Platform    : Factor w/ 31 levels "2600","3DO","3DS",..: 26 12 26 26 6 6 5 26 26 12 ...
##  $ Year        : Factor w/ 40 levels "1980","1981",..: 27 6 29 30 17 10 27 27 30 5 ...
##  $ Genre       : Factor w/ 12 levels "Action","Adventure",..: 11 5 7 11 8 6 5 4 5 9 ...
##  $ Publisher   : Factor w/ 579 levels "10TACLE Studios",..: 369 369 369 369 369 369 369 369 369 369 ...
##  $ NA_Sales    : num  41.5 29.1 15.8 15.8 11.3 ...
##  $ EU_Sales    : num  29.02 3.58 12.88 11.01 8.89 ...
##  $ JP_Sales    : num  3.77 6.81 3.79 3.28 10.22 ...
##  $ Other_Sales : num  8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
##  $ Global_Sales: num  82.7 40.2 35.8 33 31.4 ...

summary(vgDf)

##       Rank                                Name          Platform   
##  Min.   :    1   Need for Speed: Most Wanted:   12   DS     :2163  
##  1st Qu.: 4151   FIFA 14                    :    9   PS2    :2161  
##  Median : 8300   LEGO Marvel Super Heroes   :    9   PS3    :1329  
##  Mean   : 8301   Madden NFL 07              :    9   Wii    :1325  
##  3rd Qu.:12450   Ratatouille                :    9   X360   :1265  
##  Max.   :16600   Angry Birds Star Wars      :    8   PSP    :1213  
##                  (Other)                    :16542   (Other):7142  
##       Year               Genre                             Publisher    
##  2009   :1431   Action      :3316   Electronic Arts             : 1351  
##  2008   :1428   Sports      :2346   Activision                  :  975  
##  2010   :1259   Misc        :1739   Namco Bandai Games          :  932  
##  2007   :1202   Role-Playing:1488   Ubisoft                     :  921  
##  2011   :1139   Shooter     :1310   Konami Digital Entertainment:  832  
##  2006   :1008   Adventure   :1286   THQ                         :  715  
##  (Other):9131   (Other)     :5113   (Other)                     :10872  
##     NA_Sales          EU_Sales          JP_Sales         Other_Sales      
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.00000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.00000  
##  Median : 0.0800   Median : 0.0200   Median : 0.00000   Median : 0.01000  
##  Mean   : 0.2647   Mean   : 0.1467   Mean   : 0.07778   Mean   : 0.04806  
##  3rd Qu.: 0.2400   3rd Qu.: 0.1100   3rd Qu.: 0.04000   3rd Qu.: 0.04000  
##  Max.   :41.4900   Max.   :29.0200   Max.   :10.22000   Max.   :10.57000  
##                                                                           
##   Global_Sales    
##  Min.   : 0.0100  
##  1st Qu.: 0.0600  
##  Median : 0.1700  
##  Mean   : 0.5374  
##  3rd Qu.: 0.4700  
##  Max.   :82.7400  
##

Data Cleaning

colSums(is.na(vgDf)) #=> 0, however later I found a "N/A" factor in the "Year" variable

##         Rank         Name     Platform         Year        Genre    Publisher 
##            0            0            0            0            0            0 
##     NA_Sales     EU_Sales     JP_Sales  Other_Sales Global_Sales 
##            0            0            0            0            0

cat("\n")

unique(vgDf$Year)

##  [1] 2006 1985 2008 2009 1996 1989 1984 2005 1999 2007 2010 2013 2004 1990 1988
## [16] 2002 2001 2011 1998 2015 2012 2014 1992 1997 1993 1994 1982 2003 1986 2000
## [31] N/A  1995 2016 1991 1981 1987 1980 1983 2020 2017
## 40 Levels: 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 ... N/A

## Clean "Year" variable and convert to Factor
vgDf <- vgDf[!(vgDf$Year == "N/A"),]
vgDf$Year <- factor(vgDf$Year)

## Filtered Data
dropvars <- c("NA_Sales","EU_Sales","JP_Sales","Other_Sales")
globalDf <- vgDf[, !(names(vgDf) %in% dropvars)]


cat("\n",
    "The dataset has ", dim(vgDf)[1],"rows and ", dim(vgDf)[2],"columns")

## 
##  The dataset has  16327 rows and  11 columns

Outcome Variable - Global_Sales

ggplot(data = globalDf, aes(x = Global_Sales)) +
  geom_histogram(binwidth = 10)

## Top-Distributions

Top-10 Games

## Top games by Names, we will find out how to create a function for this later on.
top_games_names <- globalDf %>%
  group_by(Name) %>%
  summarise(Sales = sum(Global_Sales), .groups = 'drop') %>%
  arrange(desc(Sales))

top_games_names_10 <- head(top_games_names, 10)

ggplot(top_games_names_10, aes(x =reorder(Name, Sales), y = Sales)) +
  geom_bar(stat="summary", fun.y = "mean", fill = "blue") +
  ggtitle("Top-10 Games by Sales") +
  coord_flip() +
  xlab("Games") +
  ylab("in millions") +
  scale_y_continuous(breaks = seq(0, 800000, by = 10)) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), legend.position="none")

## Warning: Ignoring unknown parameters: fun.y

## No summary function supplied, defaulting to `mean_se()`

Top-15 Publishers

top_publishers_15 <- globalDf %>%
                      group_by(Publisher) %>%
                      count() %>%
                      arrange(desc(n)) %>%
                      ungroup() %>%
                      slice(1:15)

ggplot(top_publishers_15, aes(reorder(Publisher, n), n, fill = Publisher)) + 
  geom_bar(stat = 'identity') + 
  coord_flip() + 
  geom_label(aes(label = n)) + 
  theme(legend.position = 'none') + 
  labs(title = 'Top 15 Publishers based on games released', x = '', y = '')

Top-15 Publishers by Sales

Genre and Global_Sales

ggplot(data = globalDf[(globalDf$Global_Sales >10) & (globalDf$Global_Sales <80),], 
       aes(x = factor(Genre), y = Global_Sales)) +
  geom_boxplot(col = "blue") + labs(x = "Genres")

  # geom_text_repel(aes(label = ifelse(globalDf$Genre[globalDf$Global_Sales]>60, rownames(globalDf), '')))

Here we filtered the data for Genre with atleast 10 million and 80 million Global Sales, we see that the biggest variance is within the Platform Genre, and the most profitably stable Genre is Sports. The following we will try to measure this deviation by comparing the Maximum value and the Average value.

#mean(globalDf[globalDf$Genre == "Strategy",]$Global_Sales)

for (genre in levels(globalDf$Genre)) {
  print(genre)
  cat("Max:")
  print((max(globalDf[globalDf$Genre == genre,]$Global_Sales))*1000000)
  cat("Mean:")
  print((mean(globalDf[globalDf$Genre == genre,]$Global_Sales))*1000000)
  cat("\n")
}

## [1] "Action"
## Max:[1] 21400000
## Mean:[1] 529628
## 
## [1] "Adventure"
## Max:[1] 11180000
## Mean:[1] 184012.5
## 
## [1] "Fighting"
## Max:[1] 13040000
## Mean:[1] 531160.3
## 
## [1] "Misc"
## Max:[1] 29020000
## Mean:[1] 466444.4
## 
## [1] "Platform"
## Max:[1] 40240000
## Mean:[1] 946518.3
## 
## [1] "Puzzle"
## Max:[1] 30260000
## Mean:[1] 424203.2
## 
## [1] "Racing"
## Max:[1] 35820000
## Mean:[1] 592797.7
## 
## [1] "Role-Playing"
## Max:[1] 31370000
## Mean:[1] 628035.4
## 
## [1] "Shooter"
## Max:[1] 28310000
## Mean:[1] 800468
## 
## [1] "Simulation"
## Max:[1] 24760000
## Mean:[1] 458472.4
## 
## [1] "Sports"
## Max:[1] 82740000
## Mean:[1] 568246.5
## 
## [1] "Strategy"
## Max:[1] 5450000
## Mean:[1] 258465

## [1] 26

We can begin to understand the gaming industry now, how sheer volume will drive Sales/Prices, i.e Publishers share profits in publsihing these genres.

Interestingly, the mean profit from Strategy genre, 26 million, is not the lowest although they gather the least profits regeardless. We will be exploring the relationship further.

Pivot Tables

In this stage, we are going to view the grouped data based on certain criterias.

1st, we create a new dataset that is keeps only “Global_Sales”, i.e a single Outcome Variable.
2nd, since we will be doing this for various variables we create function that handles the summary.

## Function to quickly arrange variables based on Volumes 
tidy_eval_arrange <- function(.data, var) {
  .data %>%
    group_by({{var}}) %>%
    summarise(Sales = sum(Global_Sales), Volume = n(), Rank = mean(Rank), .groups = 'drop') %>%
    arrange(desc(Sales))
}


## Check
head((tidy_eval_arrange(globalDf, Year)), 10)

## # A tibble: 10 x 4
##    Year  Sales Volume  Rank
##    <fct> <dbl>  <int> <dbl>
##  1 2008   679.   1428 8465.
##  2 2009   667.   1431 8679.
##  3 2007   611.   1202 8606.
##  4 2010   600.   1259 8784.
##  5 2006   521.   1008 9488.
##  6 2011   516.   1139 8795.
##  7 2005   460.    941 8213.
##  8 2004   419.    763 7615.
##  9 2002   396.    829 8239.
## 10 2013   368.    546 8056.

Let’s see what the relatonship between Sales and Volume will look like on a graph.

topGenreSales <- globalDf %>%
                  tidy_eval_arrange(
                    Genre
                  )

## Visualisation
colors <- c("Volume" = "darkblue", 
            "Sales" = "orange")

ggplot(data = topGenreSales, aes(x = reorder(Genre, Rank), y = Volume)) +
  geom_line(aes(y = Volume, group = 1, color = "Volume"), linetype = "dashed") +
  geom_line(aes(y = Sales, group=1, color = "Sales"), linetype = "dashed") +
  geom_point(aes(y = Sales)) +
  geom_point(aes(y = Volume)) +
  scale_color_manual(name = "Top Genre", values = colors) +
  ggtitle("Global Sales and Volume by Genre, ordered by Rank") +
  xlab("Years") +
  ylab("in millions") +
  theme(axis.text.x = element_text(angle = 15, vjust = 0.5, hjust = 0.5), legend.position = "top")

The first thing one might notice here is how the average Rank, the number of sales across all Platforms, influences a positive Outcome, rather we might see that just sheer volume of games pushed will determine the value - sales - of the product.

Trends across Decade Time Periods

Here we will do a simple exploration of the game’ industry dynamic market during two time periods - Nineties and Twenties

YearlySales <- globalDf %>%
                  tidy_eval_arrange(
                    Year
                  )

## Visualisation
colors <- c("Volume" = "darkblue", 
            "Sales" = "orange")

ggplot(data = YearlySales, aes(x = Year, y = Volume)) +
  geom_line(aes(y = Volume, group = 1, color = "Volume"), linetype = "dashed") +
  geom_line(aes(y = Sales, group=1, color = "Sales"), linetype = "dashed") +
  geom_point(aes(y = Sales)) +
  geom_point(aes(y = Volume)) +
  scale_color_manual(name = "Top Genre", values = colors) +
  ggtitle("Global Sales and Volume by Genre, ordered by Rank") +
  xlab("Years") +
  ylab("in millions") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5), legend.position = "top")

## Save
#ggsave(plot = p, width = 3, height = 3, dpi = 300, filename = "not squished axis.pdf")

As you can see there has been a great change in this industry in the new decade.

## Splitting the segments into Segments
vgDf_2 <- vgDf %>%
              mutate(Segment = ifelse(Year %in% (2000:2020), 'Twenties', 'Nineties'))


tidy_eval_arrange(vgDf_2, Segment)

## # A tibble: 2 x 4
##   Segment  Sales Volume  Rank
##   <chr>    <dbl>  <int> <dbl>
## 1 Twenties 7165.  14353 8537.
## 2 Nineties 1655.   1974 6518.

time_segments <- vgDf_2 %>%
                  group_by(Segment)

ggplot(time_segments, aes(Segment)) +
  geom_bar(aes(fill = Segment)) +
  theme(legend.position = 'none') +
  labs(x = '', y = 'in millions', title = 'No. of games released') +
  geom_label(stat = 'count', aes(label = ..count..))

Okay that very much validates our prior opinion, now let’s go back to original dataset and see how each country contributed during these periods.

segment_region <- vgDf_2 %>%
  gather(key = "Region", value = "Sales", NA_Sales:Other_Sales) %>%
  group_by(Segment, Region) %>%
  summarise(Sales = sum(Sales), .groups = 'drop')

ggplot(segment_region, aes(Region, Sales)) +
  geom_col(aes(fill = Segment), position = "dodge") +
  labs(x = "", y = "in millions", title = "Periodic Sales by Region") +
  geom_text(aes(label = Sales, group = Segment), position = position_dodge(width = 1), vjust = -0.7)

North America clearly did not spare time here, while we can see the gentler approach from Japan.

Given the nature of the numerous values within the dataset, we filter our dataset just based on Genre and drop the other numerous datasets, i.e x - Genre and y - Global_Sales. First let’s have a benefit of doubt and look for correlations.

numericVars <- which(sapply(vgDf, is.numeric)) # numeric Variables
numericVarNames <- names(numericVars)
cat("There are", length(numericVars), "numeric variables")

## There are 6 numeric variables

## Pair-wise Correlation
vg_numvars <- vgDf[,numericVars] # Slice based on our numeric Variables
cor_numvar <- cor(vg_numvars, use = "pairwise.complete.obs")
cor_numvar

##                    Rank   NA_Sales   EU_Sales   JP_Sales Other_Sales
## Rank          1.0000000 -0.4003283 -0.3790237 -0.2691378  -0.3325212
## NA_Sales     -0.4003283  1.0000000  0.7689363  0.4512854   0.6345083
## EU_Sales     -0.3790237  0.7689363  1.0000000  0.4364139   0.7262657
## JP_Sales     -0.2691378  0.4512854  0.4364139  1.0000000   0.2906527
## Other_Sales  -0.3325212  0.6345083  0.7262657  0.2906527   1.0000000
## Global_Sales -0.4268798  0.9412677  0.9032710  0.6127938   0.7479742
##              Global_Sales
## Rank           -0.4268798
## NA_Sales        0.9412677
## EU_Sales        0.9032710
## JP_Sales        0.6127938
## Other_Sales     0.7479742
## Global_Sales    1.0000000

## OUTCOME ~ Correlation
cor_sorted <- as.matrix(sort(cor_numvar[,'Global_Sales'], decreasing = TRUE))
### select only high Correlations
corHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x) > 0.5 )))
cor_numvar <- cor_numvar[corHigh, corHigh]
corrplot.mixed(cor_numvar, tl.col = "black", tl.pos = "d", tl.cex = 0.7, cl.cex = 0.7, number.cex = 0.8)

Now we know the greatest numerical indicator of our game sales depend on the american market - 0.9412677,; Let’s us now find out the relationship with Factor variables.

A good task would be to merge all American platforms and see how they perform against others.

— AFL