Tin Le
May 3, 2017
library(knitr)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(ggplot2)
library(readr)
vgsales <- read_csv("D:/Spring 2017/CSC360 Data Analysis/VideoGameAnalysis/VideoGameAnalysis/vgsales.csv")## Parsed with column specification:
## cols(
## Rank = col_integer(),
## Name = col_character(),
## Platform = col_character(),
## Year = col_character(),
## Genre = col_character(),
## Publisher = col_character(),
## NA_Sales = col_double(),
## EU_Sales = col_double(),
## JP_Sales = col_double(),
## Other_Sales = col_double(),
## Global_Sales = col_double()
## )
This research on video game sales was created in order to compare several platforms in which consumers use to play their video games with an emphasis on PC platform in particular. I was introduced to the PC platform back in grade school and never grew out of it, as a result, this data set had me at a bit of an interest. I also wanted to look at other regions and what in particular consoles they perfer.
Vgsales data set came from vgchartz.com by using python script to extract the data and posted on kaggle.com by GregorySmith Link:https://www.kaggle.com/gregorut/videogamesales
summary(vgsales)## Rank Name Platform Year
## Min. : 1 Length:16598 Length:16598 Length:16598
## 1st Qu.: 4151 Class :character Class :character Class :character
## Median : 8300 Mode :character Mode :character Mode :character
## Mean : 8301
## 3rd Qu.:12450
## Max. :16600
## Genre Publisher NA_Sales EU_Sales
## Length:16598 Length:16598 Min. : 0.0000 Min. : 0.0000
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000
## Mode :character Mode :character Median : 0.0800 Median : 0.0200
## Mean : 0.2647 Mean : 0.1467
## 3rd Qu.: 0.2400 3rd Qu.: 0.1100
## Max. :41.4900 Max. :29.0200
## JP_Sales Other_Sales Global_Sales
## Min. : 0.00000 Min. : 0.00000 Min. : 0.0100
## 1st Qu.: 0.00000 1st Qu.: 0.00000 1st Qu.: 0.0600
## Median : 0.00000 Median : 0.01000 Median : 0.1700
## Mean : 0.07778 Mean : 0.04806 Mean : 0.5374
## 3rd Qu.: 0.04000 3rd Qu.: 0.04000 3rd Qu.: 0.4700
## Max. :10.22000 Max. :10.57000 Max. :82.7400
head(vgsales)## # A tibble: 6 × 11
## Rank Name Platform Year Genre Publisher
## <int> <chr> <chr> <chr> <chr> <chr>
## 1 1 Wii Sports Wii 2006 Sports Nintendo
## 2 2 Super Mario Bros. NES 1985 Platform Nintendo
## 3 3 Mario Kart Wii Wii 2008 Racing Nintendo
## 4 4 Wii Sports Resort Wii 2009 Sports Nintendo
## 5 5 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo
## 6 6 Tetris GB 1989 Puzzle Nintendo
## # ... with 5 more variables: NA_Sales <dbl>, EU_Sales <dbl>,
## # JP_Sales <dbl>, Other_Sales <dbl>, Global_Sales <dbl>
tail(vgsales)## # A tibble: 6 × 11
## Rank Name Platform Year
## <int> <chr> <chr> <chr>
## 1 16595 Plushees DS 2008
## 2 16596 Woody Woodpecker in Crazy Castle 5 GBA 2002
## 3 16597 Men in Black II: Alien Escape GC 2003
## 4 16598 SCORE International Baja 1000: The Official Game PS2 2008
## 5 16599 Know How 2 DS 2010
## 6 16600 Spirits & Spells GBA 2003
## # ... with 7 more variables: Genre <chr>, Publisher <chr>, NA_Sales <dbl>,
## # EU_Sales <dbl>, JP_Sales <dbl>, Other_Sales <dbl>, Global_Sales <dbl>
glimpse((vgsales))## Observations: 16,598
## Variables: 11
## $ Rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ Name <chr> "Wii Sports", "Super Mario Bros.", "Mario Kart Wi...
## $ Platform <chr> "Wii", "NES", "Wii", "Wii", "GB", "GB", "DS", "Wi...
## $ Year <chr> "2006", "1985", "2008", "2009", "1996", "1989", "...
## $ Genre <chr> "Sports", "Platform", "Racing", "Sports", "Role-P...
## $ Publisher <chr> "Nintendo", "Nintendo", "Nintendo", "Nintendo", "...
## $ NA_Sales <dbl> 41.49, 29.08, 15.85, 15.75, 11.27, 23.20, 11.38, ...
## $ EU_Sales <dbl> 29.02, 3.58, 12.88, 11.01, 8.89, 2.26, 9.23, 9.20...
## $ JP_Sales <dbl> 3.77, 6.81, 3.79, 3.28, 10.22, 4.22, 6.50, 2.93, ...
## $ Other_Sales <dbl> 8.46, 0.77, 3.31, 2.96, 1.00, 0.58, 2.90, 2.85, 2...
## $ Global_Sales <dbl> 82.74, 40.24, 35.82, 33.00, 31.37, 30.26, 30.01, ...
Needed to clean up data before use:
I noticedthe fisrt column, “Rank”, was not nexessary for the data set, as a result I removed the column with the SELECT function from the dyplr package. There was also some anomous data that was removed because some years were either, N/A, missing data in 2017 or even data that as dated in 2020.
Next thing that was needed to be done before playing with the vg data set was changing some of the columns that were orginally just strings into factors and year into numeric; used mutate function from dyplr package to do so.
vg = vgsales %>%
select(Name,
Platform,
Year,
Genre,
Publisher,
NA_Sales,
EU_Sales,
JP_Sales,
Other_Sales,
Global_Sales)
vg = vg %>%
mutate(Name = as.character(Name),
Platform = as.factor(Platform),
Genre = as.factor(Genre),
Publisher = as.factor(Publisher),
Year = as.numeric(Year))## Warning in eval(substitute(expr), envir, enclos): NAs introduced by
## coercion
vg = vg[!(vg$Year %in% c("N/A", "2017", "2020")),]
#cleaning up dataI wanted to keep the graphs relatively the same and looking consistent inorder to have a clearer understanding of the data. To do so, I created my own functions called “theme_0”, and “theme_1” which can be shown below. - creating own function for graphic layout
theme_0 = function() {
return(theme(axis.text.x = element_text(face = "bold", angle = 90)))
}
theme_1 = function() {
return(theme(axis.text.x = element_text(angle = 75, size = 12, vjust = 0.4), plot.title = element_text(size = 16, vjust = 2),axis.title.x = element_text(size = 12, vjust = -0.35)))
}After completing the needed clean up of the data set, We are now able to play with it. To start, I created another dataframe which was subsetted from vg. This subset of data was created to answer the first question I had.
Below is code for subsetting the vg dataframe into a df with only PC as its platform. There were only 960 observations(recorded rows) out of nearly 16,000 rows. Already at this point, I can tell from the data that there may not be a large market share for this console compared to other consoles like the wii, ps3 or ps4. - Want data catered toward to research questions - First look at the PC platform progress - subsetting dataframe
vg_pc = vg %>%
filter(Platform == "PC")
glimpse(vg_pc)## Observations: 960
## Variables: 10
## $ Name <chr> "The Sims 3", "World of Warcraft", "Diablo III", ...
## $ Platform <fctr> PC, PC, PC, PC, PC, PC, PC, PC, PC, PC, PC, PC, ...
## $ Year <dbl> 2009, 2004, 2012, 1996, 2010, 1995, 1997, 2007, 2...
## $ Genre <fctr> Simulation, Role-Playing, Role-Playing, Simulati...
## $ Publisher <fctr> Electronic Arts, Activision, Activision, Microso...
## $ NA_Sales <dbl> 0.98, 0.07, 2.43, 3.22, 2.56, 1.70, 4.03, 2.57, 1...
## $ EU_Sales <dbl> 6.42, 6.21, 2.15, 1.69, 1.68, 2.27, 0.00, 1.52, 2...
## $ JP_Sales <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.09, 0.00, 0...
## $ Other_Sales <dbl> 0.71, 0.00, 0.62, 0.20, 0.59, 0.23, 0.00, 0.00, 0...
## $ Global_Sales <dbl> 8.11, 6.28, 5.20, 5.12, 4.83, 4.21, 4.12, 4.09, 3...
Nonetheless, I created another df using the group_by function to categorize the revenue by year and created a graphical view to have visual view of what was created.
#Combine revenue by year
revenue_by_year_pc = vg_pc %>%
group_by(Year) %>%
summarize(Revenue = sum(Global_Sales))
#graphic view
ggplot(revenue_by_year_pc, aes(Year, Revenue)) +
geom_bar(fill = "blue", stat = "identity") +
theme_1() +
ggtitle("PC videogame revenue by year")ggplot(revenue_by_year_pc, aes(Year, Revenue)) +
geom_point(aes()) +
geom_line() +
theme_1() +
ggtitle("PC videogame revenue by year") From the graph, there was a slight growth in PC gaming with spikes at 1994, 1996, 1997 and then a large spike in revenue in 2011, but after 2011 the slope staggered downwards back to its normal revenue.
After looking at the revenue, I wanted to see what kind of genres the PC users had the most interest in as a whole.
#What genres were popular on PC
#global Concensus
Top_games_pc_Global = vg_pc %>%
group_by(Genre) %>%
summarize(Revenue = sum(Global_Sales), Percentage = Revenue/sum(vg$Global_Sales) * 100) %>%
arrange(desc(Revenue))
#graphics
ggplot(Top_games_pc_Global, aes(Genre, Revenue, fill = Genre)) +
geom_bar(stat = "identity") +
ggtitle("Most Played Genre in PC, Global") +
theme_0() +
ylab("Revenue in Million") From the graph we can see that simulations, shooters, RPG, and strategy the top contenders while other genres had a hard time breaking 10million in revenue.
I further inspected into different regions such as north America, Europe, and Japan.
#NA_Sales
Top_games_pc_NA = vg_pc %>%
group_by(Genre) %>%
summarize(Revenue = sum(NA_Sales), Percentage = Revenue/sum(vg$NA_Sales) * 100) %>%
arrange(desc(Revenue))
#graphics
ggplot(Top_games_pc_NA, aes(Genre, Revenue, fill = Genre)) +
geom_bar(stat = "identity") +
ggtitle("Most Played Genre in PC, North America") +
theme_0() +
ylab("Revenue in Million")#EU_Sales
Top_games_pc_EU = vg_pc %>%
group_by(Genre) %>%
summarize(Revenue = sum(EU_Sales), Percentage = Revenue/sum(vg$EU_Sales) * 100) %>%
arrange(desc(Revenue))
#graphic
ggplot(Top_games_pc_EU, aes(Genre, Revenue, fill = Genre)) +
geom_bar(stat = "identity") +
ggtitle("Most Played Genre in PC, Europe") +
theme_0() +
ylab("Revenue in Million")#JP_Sales
Top_games_pc_JP = vg_pc %>%
group_by(Genre) %>%
summarize(Revenue = sum(JP_Sales), Percentage = Revenue/sum(vg$JP_Sales) * 100) %>%
arrange(desc(Revenue))
#graphics
ggplot(Top_games_pc_JP, aes(Genre, Revenue, fill = Genre)) +
geom_bar(stat = "identity") +
ggtitle("Most Played Genre in PC, Japan") +
theme_0() +
ylab("Revenue in Million")North America and Europe seems to be consistent with Global graphics but Japan is almost non-existent with PC gaming as a whole. The highest grossing genre was shooter but that only made about $150,000
This next graphical view was to created to answer my prediction earlier that PC gaming, as a whole, did not have a large market share. As expected, It had a miniscule share compared to the high of over 600million in revenue of all platforms compared to PC’s high only hitting 20million.
#Combine revenue by year for entire platforms
revenue_by_year = vg %>%
group_by(Year) %>%
summarize(Revenue = sum(Global_Sales))
#graphic view
ggplot(revenue_by_year, aes(Year, Revenue)) +
geom_bar(fill = "salmon", stat = "identity") +
theme_1() +
ggtitle("Video game Revenue by year") +
ylab("Revenue in Million")#Overlaying two graphs
ggplot(NULL, aes(Year, Revenue)) +
geom_point(data = revenue_by_year, aes(col = "Overall revenue", size = 2)) +
geom_point(data = revenue_by_year_pc, aes(col = "PC revenue", size = 2)) +
theme_1() +
ggtitle("Overlap of PC platform and all platforms") +
ylab("Revenue in Million")After the less that satisfactory results from my research into the PC platform, I looked further into which region perfer which consoles. I first looked at the global prefernce and compared them to each other region as a reference.
#Global
High_Platform_Year = vg %>%
group_by(Year, Platform) %>%
summarize(Revenue = sum(Global_Sales)) %>%
arrange(desc(Revenue)) %>%
top_n(1)## Selecting by Revenue
#graphics
ggplot(High_Platform_Year, aes(Year, Revenue, fill = Platform)) +
geom_bar(stat = "identity", colour = "white") +
ggtitle("Top Platform by Revenue Each year, Global") +
ylab("Revenue in Million")In the global chart, I noticed that each platform are more or less have a significant chunk of the timeline where they were number 1 in revenue. This could possibly be due to their release date and the hype around the console. To my surpise, the PS4 has the highest popularity by revenue while there is no sign of the XBOX ONE.
#Na_Sales
High_Platform_Year_NA = vg %>%
group_by(Year, Platform) %>%
summarize(Revenue = sum(NA_Sales)) %>%
arrange(desc(Revenue)) %>%
top_n(1)## Selecting by Revenue
#graphics
ggplot(High_Platform_Year_NA, aes(Year, Revenue, fill = Platform)) +
geom_bar(stat = "identity", colour = "white") +
ggtitle("Top Platform by Revenue Each year, North America") +
ylab("Revenue in Million")In north America, we can see that the ps2 had the most dominant and consistent market share of revenue and timeline between the other consoles. To recall, the play station 2 had a huge following when once released due to the significant graphic improvments coming from the first play station. The wii had the highest popularity, this was probably due to the new motion control technology which caused a craze back in 2006 to 2009 that resulted in nearly 200million consoles sold.
#EU_Sales
High_Platform_Year_EU = vg %>%
group_by(Year, Platform) %>%
summarize(Revenue = sum(EU_Sales)) %>%
arrange(desc(Revenue)) %>%
top_n(1)## Selecting by Revenue
#graphics
ggplot(High_Platform_Year_EU, aes(Year, Revenue, fill = Platform)) +
geom_bar(stat = "identity", colour = "white") +
ggtitle("Top Platform by Revenue Each year, Europe") +
ylab("Revenue in Million")For Europe, the PS3 and PS4 have been recently dominating popularity which is interesting because if we look back at the north american graph, the xbox360 had a large market share there. Maybe Europeans perfer Sony products rather than Microsoft. Though, yet again, the wii still have a dominant share in popularity from 2006 to 2009.
#JP_Sales
High_Platform_Year_JP = vg %>%
group_by(Year, Platform) %>%
summarize(Revenue = sum(JP_Sales)) %>%
arrange(desc(Revenue)) %>%
top_n(1)## Selecting by Revenue
#graphics
ggplot(High_Platform_Year_JP, aes(Year, Revenue, fill = Platform)) +
geom_bar(stat = "identity", colour = "white") +
ggtitle("Top Platform by Revenue Each year, Japan") +
ylab("Revenue in Million")Now Japan was very interesting. If you notice, all the platforms are orginally created in Japan from either Nintendo or Sony which are Japanese companies. This could be due to having strict laws on imported good or possibly more prevalent advertisement from Sony and Nintendo compared to Microsoft in Japan. Or maybe Japan would rather stick to their guns.
Conclusion:
Although I thought that the PC gaming was going to be more interesting or maybe even have a larger market share, it was good effort. Although, I do not know if this data set takes into account of free-to-play games such as league of legends, dota, cs:go and many others. This could be due to only tracking revenue rather than playerbase. If I had to research this question again, I would no doubt look into another variable other than revenue.
It was compelling to see what other regions perfer in terms of the consoles that they are playing. What was more interesting was the difference between Europe’s choice in PS3, PS4 and America’s Xbox360. Also, Japan really seems to enjoy the consoles they make. This may be due to the fact that serious gaming is not as prevalent there as other countries, but rather more casusal family friendly games.
In all, this dataset provided some aspect of video games and platform that was both fascinating and cultivated.
Tin Le