It is interesting that I was never interested in video games. One of the reasons I found this dataset and decided to work on it is because I am curious to know what are the popular games out there that my friends are playing and talking about. The dataset is rich with 16,598 observations and 11 variables. It was scraped from vgchartz.com and contains a list of video games with sales greater than 100,000 copies. The information includes name for the game’s name, ranking, platform of the games release such as PC or PS4, year of the game’s release, genre of each game, and their publishers. Games are sold in North America, Europe, of course Japan, the dreamland of game, sales in the rest of the world, and total worldwide sales.
I will use ggplot, dplyr to create a line chart and a scatter chart showing the video games sales in the world over the course from 1984 to 2020. Plotly and Highcharter are also being used for interactivity.
library(readr)
library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ purrr 1.0.1 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(RColorBrewer)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(ggthemes)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
setwd("/Users/Linh/Desktop/DATASETS ")
vgsales <- read_csv("vgsales.csv")
## Rows: 16598 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Platform, Year, Genre, Publisher
## dbl (6): Rank, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Checking string 11 variables whether they are numeric or character. Apparently, in the original dataset, the year variable was character, I needed to convert those character in the Year variable to numeric.
top_10 <- head(vgsales,10)
vgsales$Year <- as.numeric(vgsales$Year)
## Warning: NAs introduced by coercion
str(vgsales)
## spc_tbl_ [16,598 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Rank : num [1:16598] 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : chr [1:16598] "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
## $ Platform : chr [1:16598] "Wii" "NES" "Wii" "Wii" ...
## $ Year : num [1:16598] 2006 1985 2008 2009 1996 ...
## $ Genre : chr [1:16598] "Sports" "Platform" "Racing" "Sports" ...
## $ Publisher : chr [1:16598] "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
## $ NA_Sales : num [1:16598] 41.5 29.1 15.8 15.8 11.3 ...
## $ EU_Sales : num [1:16598] 29.02 3.58 12.88 11.01 8.89 ...
## $ JP_Sales : num [1:16598] 3.77 6.81 3.79 3.28 10.22 ...
## $ Other_Sales : num [1:16598] 8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
## $ Global_Sales: num [1:16598] 82.7 40.2 35.8 33 31.4 ...
## - attr(*, "spec")=
## .. cols(
## .. Rank = col_double(),
## .. Name = col_character(),
## .. Platform = col_character(),
## .. Year = col_character(),
## .. Genre = col_character(),
## .. Publisher = col_character(),
## .. NA_Sales = col_double(),
## .. EU_Sales = col_double(),
## .. JP_Sales = col_double(),
## .. Other_Sales = col_double(),
## .. Global_Sales = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Because the dataset has over 16,000 names of different games, it is not an effective way to choose game’s name for legend. I created the line chart grouped by genre. The chart reflects the Global Sales ( in million) over time from 1984 to 2020. Thanks to plotly, each mouse over tooptip shows detailed information of each game, ranking, year, genre, publisher, and sales in different regions.
videogames <- ggplot(vgsales,
aes(x = Year, y = Global_Sales, color = Genre, group = 1 )) +
theme_economist() +
geom_line() +
geom_point(
aes(text =
paste(paste("Name:", Name, "<br>"),
paste("Rank:", Rank, "<br>"),
paste("Year:", Year, "<br>"),
paste("Genre:", Genre, "<br>"),
paste("Publisher:", Publisher, "<br>"),
paste("EU Sales:", EU_Sales, "<br>"),
paste("NA Sales:", NA_Sales, "<br>"),
paste("Japan Sales:", JP_Sales, "<br>"),
paste("Global Sales:", Global_Sales, "<br>"))),
size = 1,
data = vgsales) +
scale_x_continuous(breaks=seq(1984, 2020, 6)) +
scale_color_pander() +
labs(x="Year", y="Global Sales in Million") +
ggtitle("The Global Sales of Video Game over time")
## Warning in geom_point(aes(text = paste(paste("Name:", Name, "<br>"),
## paste("Rank:", : Ignoring unknown aesthetics: text
videogames2 <- ggplotly(videogames, tooltip = "text")
videogames2
highchart() %>%
hc_add_series(data = vgsales,
type = "scatter",
hcaes(x = Year,
y = Global_Sales,
group = Genre)) %>%
hc_legend(align = "center", verticalAlign = "top", layout = "horizontal") %>%
hc_title(text = "My Centered Title", align = "center") %>%
hc_xAxis(title = list(text="Year")) %>%
hc_yAxis(title = list(text="Global Sales in Million")) %>%
hc_title(text = "Global Sales of Video Games over time") %>%
hc_plotOptions(series = list(marker = list(symbol = "circle"))) %>%
hc_add_theme(hc_theme_ffx()) %>%
hc_tooltip(shared = TRUE,
pointFormat = "Year: {point.Year}<br>Name: {point.Name}<br>Rank: {point.Rank}<br>Genre: {point.Genre}<br>Global Sales: {point.Global_Sales}")
What I found interesting from these two visualization is that people started to play game in the early years, but starting from 2002 to 2016 is really the period in which video game gained traction.
There is an outlier which is the release of Wii Sports in 2006 with the highest individual global sales of all games in this dataset.
Nintendo seems to be the publisher that has most of the popular games that break sale records.
The most popular genre is Sport, produced by large publisher Nintendo. Popular genres such as Sport, Racing, and role- playing are produced by larger publishers to North America, Europe, and Japan markets.
North America has been the dominance market that contributing to the Global sales.
I have noticed that from 2017 to 2020, there might be missing data or errors and also, the year 2016 is likely not containing a full year of sales.
What I wish to improve: I wish I could do a better job in telling a story of the relation between sales in different regions such as North America compared to Europe or a comparison between sales in Japan, a single country, and a whole different region.