The goal here is to establish some understanding of the dataset, the basic data structure, the correlation relations - pairwise and with the Outcome as well as show mastery of visualisations.
vgDf <- read.csv("C:/Users/PC/Documents/R_4DS/VideoGames/vgsales.csv/vgsales.csv", stringsAsFactors = TRUE)
str(vgDf)
## 'data.frame': 16598 obs. of 11 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : Factor w/ 11493 levels "'98 Koshien",..: 10991 9343 5532 10993 7370 9707 6648 10989 6651 2594 ...
## $ Platform : Factor w/ 31 levels "2600","3DO","3DS",..: 26 12 26 26 6 6 5 26 26 12 ...
## $ Year : Factor w/ 40 levels "1980","1981",..: 27 6 29 30 17 10 27 27 30 5 ...
## $ Genre : Factor w/ 12 levels "Action","Adventure",..: 11 5 7 11 8 6 5 4 5 9 ...
## $ Publisher : Factor w/ 579 levels "10TACLE Studios",..: 369 369 369 369 369 369 369 369 369 369 ...
## $ NA_Sales : num 41.5 29.1 15.8 15.8 11.3 ...
## $ EU_Sales : num 29.02 3.58 12.88 11.01 8.89 ...
## $ JP_Sales : num 3.77 6.81 3.79 3.28 10.22 ...
## $ Other_Sales : num 8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
## $ Global_Sales: num 82.7 40.2 35.8 33 31.4 ...
summary(vgDf)
## Rank Name Platform
## Min. : 1 Need for Speed: Most Wanted: 12 DS :2163
## 1st Qu.: 4151 FIFA 14 : 9 PS2 :2161
## Median : 8300 LEGO Marvel Super Heroes : 9 PS3 :1329
## Mean : 8301 Madden NFL 07 : 9 Wii :1325
## 3rd Qu.:12450 Ratatouille : 9 X360 :1265
## Max. :16600 Angry Birds Star Wars : 8 PSP :1213
## (Other) :16542 (Other):7142
## Year Genre Publisher
## 2009 :1431 Action :3316 Electronic Arts : 1351
## 2008 :1428 Sports :2346 Activision : 975
## 2010 :1259 Misc :1739 Namco Bandai Games : 932
## 2007 :1202 Role-Playing:1488 Ubisoft : 921
## 2011 :1139 Shooter :1310 Konami Digital Entertainment: 832
## 2006 :1008 Adventure :1286 THQ : 715
## (Other):9131 (Other) :5113 (Other) :10872
## NA_Sales EU_Sales JP_Sales Other_Sales
## Min. : 0.0000 Min. : 0.0000 Min. : 0.00000 Min. : 0.00000
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.00000
## Median : 0.0800 Median : 0.0200 Median : 0.00000 Median : 0.01000
## Mean : 0.2647 Mean : 0.1467 Mean : 0.07778 Mean : 0.04806
## 3rd Qu.: 0.2400 3rd Qu.: 0.1100 3rd Qu.: 0.04000 3rd Qu.: 0.04000
## Max. :41.4900 Max. :29.0200 Max. :10.22000 Max. :10.57000
##
## Global_Sales
## Min. : 0.0100
## 1st Qu.: 0.0600
## Median : 0.1700
## Mean : 0.5374
## 3rd Qu.: 0.4700
## Max. :82.7400
##
colSums(is.na(vgDf)) #=> 0, however later I found a "N/A" factor in the "Year" variable
## Rank Name Platform Year Genre Publisher
## 0 0 0 0 0 0
## NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
## 0 0 0 0 0
cat("\n")
unique(vgDf$Year)
## [1] 2006 1985 2008 2009 1996 1989 1984 2005 1999 2007 2010 2013 2004 1990 1988
## [16] 2002 2001 2011 1998 2015 2012 2014 1992 1997 1993 1994 1982 2003 1986 2000
## [31] N/A 1995 2016 1991 1981 1987 1980 1983 2020 2017
## 40 Levels: 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 ... N/A
## Clean "Year" variable and convert to Factor
vgDf <- vgDf[!(vgDf$Year == "N/A"),]
vgDf$Year <- factor(vgDf$Year)
## Filtered Data
dropvars <- c("NA_Sales","EU_Sales","JP_Sales","Other_Sales")
globalDf <- vgDf[, !(names(vgDf) %in% dropvars)]
cat("\n",
"The dataset has ", dim(vgDf)[1],"rows and ", dim(vgDf)[2],"columns")
##
## The dataset has 16327 rows and 11 columns
ggplot(data = globalDf, aes(x = Global_Sales)) +
geom_histogram(binwidth = 10)
## Top-Distributions
## Top games by Names, we will find out how to create a function for this later on.
top_games_names <- globalDf %>%
group_by(Name) %>%
summarise(Sales = sum(Global_Sales), .groups = 'drop') %>%
arrange(desc(Sales))
top_games_names_10 <- head(top_games_names, 10)
ggplot(top_games_names_10, aes(x =reorder(Name, Sales), y = Sales)) +
geom_bar(stat="summary", fun.y = "mean", fill = "blue") +
ggtitle("Top-10 Games by Sales") +
coord_flip() +
xlab("Games") +
ylab("in millions") +
scale_y_continuous(breaks = seq(0, 800000, by = 10)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), legend.position="none")
## Warning: Ignoring unknown parameters: fun.y
## No summary function supplied, defaulting to `mean_se()`
top_publishers_15 <- globalDf %>%
group_by(Publisher) %>%
count() %>%
arrange(desc(n)) %>%
ungroup() %>%
slice(1:15)
ggplot(top_publishers_15, aes(reorder(Publisher, n), n, fill = Publisher)) +
geom_bar(stat = 'identity') +
coord_flip() +
geom_label(aes(label = n)) +
theme(legend.position = 'none') +
labs(title = 'Top 15 Publishers based on games released', x = '', y = '')
ggplot(data = globalDf[(globalDf$Global_Sales >10) & (globalDf$Global_Sales <80),],
aes(x = factor(Genre), y = Global_Sales)) +
geom_boxplot(col = "blue") + labs(x = "Genres")
# geom_text_repel(aes(label = ifelse(globalDf$Genre[globalDf$Global_Sales]>60, rownames(globalDf), '')))
Here we filtered the data for Genre with atleast 10 million and 80 million Global Sales, we see that the biggest variance is within the Platform Genre, and the most profitably stable Genre is Sports. The following we will try to measure this deviation by comparing the Maximum value and the Average value.
#mean(globalDf[globalDf$Genre == "Strategy",]$Global_Sales)
for (genre in levels(globalDf$Genre)) {
print(genre)
cat("Max:")
print((max(globalDf[globalDf$Genre == genre,]$Global_Sales))*1000000)
cat("Mean:")
print((mean(globalDf[globalDf$Genre == genre,]$Global_Sales))*1000000)
cat("\n")
}
## [1] "Action"
## Max:[1] 21400000
## Mean:[1] 529628
##
## [1] "Adventure"
## Max:[1] 11180000
## Mean:[1] 184012.5
##
## [1] "Fighting"
## Max:[1] 13040000
## Mean:[1] 531160.3
##
## [1] "Misc"
## Max:[1] 29020000
## Mean:[1] 466444.4
##
## [1] "Platform"
## Max:[1] 40240000
## Mean:[1] 946518.3
##
## [1] "Puzzle"
## Max:[1] 30260000
## Mean:[1] 424203.2
##
## [1] "Racing"
## Max:[1] 35820000
## Mean:[1] 592797.7
##
## [1] "Role-Playing"
## Max:[1] 31370000
## Mean:[1] 628035.4
##
## [1] "Shooter"
## Max:[1] 28310000
## Mean:[1] 800468
##
## [1] "Simulation"
## Max:[1] 24760000
## Mean:[1] 458472.4
##
## [1] "Sports"
## Max:[1] 82740000
## Mean:[1] 568246.5
##
## [1] "Strategy"
## Max:[1] 5450000
## Mean:[1] 258465
## [1] 26
We can begin to understand the gaming industry now, how sheer volume will drive Sales/Prices, i.e Publishers share profits in publsihing these genres.
Interestingly, the mean profit from Strategy genre, 26 million, is not the lowest although they gather the least profits regeardless. We will be exploring the relationship further.
In this stage, we are going to view the grouped data based on certain criterias.
## Function to quickly arrange variables based on Volumes
tidy_eval_arrange <- function(.data, var) {
.data %>%
group_by({{var}}) %>%
summarise(Sales = sum(Global_Sales), Volume = n(), Rank = mean(Rank), .groups = 'drop') %>%
arrange(desc(Sales))
}
## Check
head((tidy_eval_arrange(globalDf, Year)), 10)
## # A tibble: 10 x 4
## Year Sales Volume Rank
## <fct> <dbl> <int> <dbl>
## 1 2008 679. 1428 8465.
## 2 2009 667. 1431 8679.
## 3 2007 611. 1202 8606.
## 4 2010 600. 1259 8784.
## 5 2006 521. 1008 9488.
## 6 2011 516. 1139 8795.
## 7 2005 460. 941 8213.
## 8 2004 419. 763 7615.
## 9 2002 396. 829 8239.
## 10 2013 368. 546 8056.
Let’s see what the relatonship between Sales and Volume will look like on a graph.
topGenreSales <- globalDf %>%
tidy_eval_arrange(
Genre
)
## Visualisation
colors <- c("Volume" = "darkblue",
"Sales" = "orange")
ggplot(data = topGenreSales, aes(x = reorder(Genre, Rank), y = Volume)) +
geom_line(aes(y = Volume, group = 1, color = "Volume"), linetype = "dashed") +
geom_line(aes(y = Sales, group=1, color = "Sales"), linetype = "dashed") +
geom_point(aes(y = Sales)) +
geom_point(aes(y = Volume)) +
scale_color_manual(name = "Top Genre", values = colors) +
ggtitle("Global Sales and Volume by Genre, ordered by Rank") +
xlab("Years") +
ylab("in millions") +
theme(axis.text.x = element_text(angle = 15, vjust = 0.5, hjust = 0.5), legend.position = "top")
The first thing one might notice here is how the average Rank, the number of sales across all Platforms, influences a positive Outcome, rather we might see that just sheer volume of games pushed will determine the value - sales - of the product.
Here we will do a simple exploration of the game’ industry dynamic market during two time periods - Nineties and Twenties
YearlySales <- globalDf %>%
tidy_eval_arrange(
Year
)
## Visualisation
colors <- c("Volume" = "darkblue",
"Sales" = "orange")
ggplot(data = YearlySales, aes(x = Year, y = Volume)) +
geom_line(aes(y = Volume, group = 1, color = "Volume"), linetype = "dashed") +
geom_line(aes(y = Sales, group=1, color = "Sales"), linetype = "dashed") +
geom_point(aes(y = Sales)) +
geom_point(aes(y = Volume)) +
scale_color_manual(name = "Top Genre", values = colors) +
ggtitle("Global Sales and Volume by Genre, ordered by Rank") +
xlab("Years") +
ylab("in millions") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 0.5), legend.position = "top")
## Save
#ggsave(plot = p, width = 3, height = 3, dpi = 300, filename = "not squished axis.pdf")
As you can see there has been a great change in this industry in the new decade.
## Splitting the segments into Segments
vgDf_2 <- vgDf %>%
mutate(Segment = ifelse(Year %in% (2000:2020), 'Twenties', 'Nineties'))
tidy_eval_arrange(vgDf_2, Segment)
## # A tibble: 2 x 4
## Segment Sales Volume Rank
## <chr> <dbl> <int> <dbl>
## 1 Twenties 7165. 14353 8537.
## 2 Nineties 1655. 1974 6518.
time_segments <- vgDf_2 %>%
group_by(Segment)
ggplot(time_segments, aes(Segment)) +
geom_bar(aes(fill = Segment)) +
theme(legend.position = 'none') +
labs(x = '', y = 'in millions', title = 'No. of games released') +
geom_label(stat = 'count', aes(label = ..count..))
Okay that very much validates our prior opinion, now let’s go back to original dataset and see how each country contributed during these periods.
segment_region <- vgDf_2 %>%
gather(key = "Region", value = "Sales", NA_Sales:Other_Sales) %>%
group_by(Segment, Region) %>%
summarise(Sales = sum(Sales), .groups = 'drop')
ggplot(segment_region, aes(Region, Sales)) +
geom_col(aes(fill = Segment), position = "dodge") +
labs(x = "", y = "in millions", title = "Periodic Sales by Region") +
geom_text(aes(label = Sales, group = Segment), position = position_dodge(width = 1), vjust = -0.7)
North America clearly did not spare time here, while we can see the gentler approach from Japan.
Given the nature of the numerous values within the dataset, we filter our dataset just based on Genre and drop the other numerous datasets, i.e x - Genre and y - Global_Sales. First let’s have a benefit of doubt and look for correlations.
numericVars <- which(sapply(vgDf, is.numeric)) # numeric Variables
numericVarNames <- names(numericVars)
cat("There are", length(numericVars), "numeric variables")
## There are 6 numeric variables
## Pair-wise Correlation
vg_numvars <- vgDf[,numericVars] # Slice based on our numeric Variables
cor_numvar <- cor(vg_numvars, use = "pairwise.complete.obs")
cor_numvar
## Rank NA_Sales EU_Sales JP_Sales Other_Sales
## Rank 1.0000000 -0.4003283 -0.3790237 -0.2691378 -0.3325212
## NA_Sales -0.4003283 1.0000000 0.7689363 0.4512854 0.6345083
## EU_Sales -0.3790237 0.7689363 1.0000000 0.4364139 0.7262657
## JP_Sales -0.2691378 0.4512854 0.4364139 1.0000000 0.2906527
## Other_Sales -0.3325212 0.6345083 0.7262657 0.2906527 1.0000000
## Global_Sales -0.4268798 0.9412677 0.9032710 0.6127938 0.7479742
## Global_Sales
## Rank -0.4268798
## NA_Sales 0.9412677
## EU_Sales 0.9032710
## JP_Sales 0.6127938
## Other_Sales 0.7479742
## Global_Sales 1.0000000
## OUTCOME ~ Correlation
cor_sorted <- as.matrix(sort(cor_numvar[,'Global_Sales'], decreasing = TRUE))
### select only high Correlations
corHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x) > 0.5 )))
cor_numvar <- cor_numvar[corHigh, corHigh]
corrplot.mixed(cor_numvar, tl.col = "black", tl.pos = "d", tl.cex = 0.7, cl.cex = 0.7, number.cex = 0.8)
Now we know the greatest numerical indicator of our game sales depend on the american market - 0.9412677,; Let’s us now find out the relationship with Factor variables.
A good task would be to merge all American platforms and see how they perform against others.
— AFL