Data110FinalProject

Author

Nathaniel Nguyen

Source: https://www.esquire.com/lifestyle/g26572573/best-video-games-ranked/

Introduction

In this project I will be exploring a video game dataset. This dataset contains information regarding the different games’ names, publishers, platforms, genres, years, and the number of sales from the US, Europe, Japan, and other. In this project I will specifically be using the names, publishers, platforms, and each number of sales from all the different countries. By using dyplr functions such as filter and mutate, as well as, ggplot and plotly, I will be able to make very interesting and informative data visualizations. What I am seeking through this project is seeing which games sold the most copies, and in which countries. I also want to see which publishers are the most popular in each country. I chose this dataset because I’ve always been interested in video games, from ever since I was young to now. I plan to answer questions about the background information of this subject by obtaining knowledge through a book I find on the MC library catalog. I plan to discover how this data was found by going through the data set and finding the source of each piece of data. My source is: Data is scraped from vgchartz.com

#Finds working directory
getwd()
[1] "C:/Users/Nathaniel/DATA110"

Calling all packages, setting work directory, and reading csv file

#Installs libraries, sets working directory, and reads the csv file
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(RColorBrewer)
library(highcharter)
Warning: package 'highcharter' was built under R version 4.3.3
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
setwd("C:/Users/Nathaniel/DATA110")
data <- read_csv("vgsales.csv")
Rows: 16598 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Platform, Year, Genre, Publisher
dbl (6): Rank, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Converts year from a categorical variable to a numerical variable
data$Year <- as.numeric(data$Year)
Warning: NAs introduced by coercion
#Finds the top 25 games in the data set
clean1 <- data |> arrange(Rank) |>
head(arrange(data, Rank), n = 25)
head(clean1)
# A tibble: 6 × 11
   Rank Name           Platform  Year Genre Publisher NA_Sales EU_Sales JP_Sales
  <dbl> <chr>          <chr>    <dbl> <chr> <chr>        <dbl>    <dbl>    <dbl>
1     1 Wii Sports     Wii       2006 Spor… Nintendo      41.5    29.0      3.77
2     2 Super Mario B… NES       1985 Plat… Nintendo      29.1     3.58     6.81
3     3 Mario Kart Wii Wii       2008 Raci… Nintendo      15.8    12.9      3.79
4     4 Wii Sports Re… Wii       2009 Spor… Nintendo      15.8    11.0      3.28
5     5 Pokemon Red/P… GB        1996 Role… Nintendo      11.3     8.89    10.2 
6     6 Tetris         GB        1989 Puzz… Nintendo      23.2     2.26     4.22
# ℹ 2 more variables: Other_Sales <dbl>, Global_Sales <dbl>

Graph 1

#Creates a graph using the data from clean1
graph1 <- clean1 |>
#Changes the global sales variable to make it easier to read (used mutate)
  mutate(Global_Sales = Global_Sales * 1) |>
#Creates a scatterplot with an x axis, y axis, legend, and declares the variables for the tooltips that will be used later
  ggplot(aes(x = Year, y = Global_Sales, color = Platform, text = paste("Game: ", Name,
                                                                        "\nNumber of Sales (in Millions): ", Global_Sales, 
                                                                        "\nPlatform: ", Platform,
                                                                        "\nYear: ", Year))) +
#Changes the color palette to Dark2
  scale_color_brewer(palette = "Dark2") +
#Changes the name of the axis, legend, and creates a caption for the source
  labs(x = "Year",
       y = "Total Number of Global Sales (in Millions)",
       color = "Game Platform",
       caption = "Source: Data is scraped from vgchartz.com") +
#Changes the theme to minimal
  theme_minimal(base_size = 12) +
  geom_point() +
#Adds a title for the graph
  ggtitle("Number of Global Sales of Games per Year")
#Calls graph to be displayed
graph1

In this first graph, I tried to display the games with the most global sales over all the years. I also added the platform as the legend to see which platforms had the highest selling games.

#Adds Interactivity for Graph 1

library(plotly)
Warning: package 'plotly' was built under R version 4.3.3

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
ggplotly(graph1, tooltip = "text")

This is the same graph as graph 1, however, now it is interactive so we can see which games are the highest or lowest selling

Correlation Between Year and Global Sales

cor(clean1$Year, clean1$Global_Sales)
[1] 0.003914364
#Gets the data for the linear regression equation
fit1 <- lm(Global_Sales ~ Year, data = clean1)
summary(fit1)

Call:
lm(formula = Global_Sales ~ Year, data = clean1)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.914  -6.385  -3.947   2.923  55.653 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.557e+01  6.123e+02   0.025    0.980
Year        5.742e-03  3.059e-01   0.019    0.985

Residual standard error: 13.48 on 23 degrees of freedom
Multiple R-squared:  1.532e-05, Adjusted R-squared:  -0.04346 
F-statistic: 0.0003524 on 1 and 23 DF,  p-value: 0.9852

Liner Regression Equation

Year = 5.74(Global_Sales) + 1.6

Diagnosis of the Correlation

Based on the p-value, plots, and other data, it has been concluded that the number of global sales is not strongly correlated with the year of the games. As seen above, the p-value is 0.9852, which is far to big to be a significant value. The adjusted R-square value is also -0.04. In conclusion, it is safe to say that when comparing year and global sales to find out their correlation, their relationship is not strong.

#Filters for the top 50 games that are from between the years 2000-2016 and are from the publishers, Nintendo, Microsoft, Activision, or Ubisoft
clean2 <- data |> filter(Year >= 2000 & Year <= 2016) |>
  filter(Publisher == "Nintendo" | Publisher == "Microsoft Game Studios" | Publisher == "Activision" | Publisher == "Ubisoft") |>
  arrange(Rank) |>
head(arrange(data, Rank), n = 50)

Graph 2 (Highcharter)

#Sets the color to the color palette "Dark2"
cols <- brewer.pal(7, "Dark2")
#Uses highcharter to create a graph
chart <- highchart() |>
#Defines the x axis, y axis, and legend
  hc_add_series(data = clean2,
                   type = "line",
                   hcaes(x = Global_Sales,
                   y = Year, 
                   group = Genre)) |>
#Determines the color of the graph
  hc_colors(cols) |>
#Renames the x axis
  hc_xAxis(title = list(text="Total Number of Global Sales (in Millions)")) |>
#Renames the y axis
  hc_yAxis(title = list(text="Year")) |>
#Makes the points circles while hovering over the graph
  hc_plotOptions(series = list(marker = list(symbol = "circle"))) |>
#Dictates where the legend is located and how it is displayed
  hc_legend(align = "right", 
            verticalAlign = "middle",
            layout = "vertical") |>
#Enables the mouse over functions
  hc_tooltip(shared = TRUE,
             borderColor = "green",
             pointFormat = "{point.state}: {point.count:.2f}<br>") |>
#Adds a title to the graph
  hc_title(
  text = "Global Sales of Each Genre by Year",
  margin = 20,
  align = "left"
  )
#Calls the graph to be displayed
chart

In this highcharter graph I tried to show the total global sales of each genre throughout the year. The graph is sideways because it looks more aesthetically pleasing this way, but I do know that the data is hard to read.

#Filters for games from the years 2000-2015, are on the platforms, Wii, X360, or PS2, and are from the publishers Nintendo, Ubisoft, Sega, 505 Games, or Activision
clean3 <- data |> filter(Year > 1999 & Year < 2016) |>
  filter(Platform == "Wii" | Platform == "X360" | Platform == "PS2") |>
  filter(Publisher == "Nintendo" | Publisher == "Ubisoft" | Publisher == "Sega" | Publisher == "505 Games" | Publisher == "Activision")

Graph 3

#Creates a graph using the data from clean3
graph3 <- clean3 |>
  ggplot() +
#Creates a bar chart with a x axis, y axis, and legend
  geom_bar(aes(x = Publisher, y = NA_Sales, fill = Platform),
           position = "dodge", stat = "identity") +
#Changes the name for the axis, legend, title, and caption
  labs(x = "Game Publisher",
       y = "Total North America Sales (in Millions)",
       fill = "Game Platform",
       title = "North America Sales per Publisher v1",
       caption = "Source: Data is scraped from vgchartz.com",
       ) +
#Changes the theme to minimal
  theme_minimal(base_size = 8) +
#Changes the color palette to Set2
  scale_fill_brewer(palette = "Set2")
#Calls graph to be displayed
graph3

In this graph, I wanted to show the publishers with the most game sales in North America. I added the platform as the legend to see the percentage of the games on a platform compared to a different platform.

Adds Interactivity to Graph 3

ggplotly(graph3)
#Same filtering as above but without Nintendo
clean4 <- data |> filter(Year > 1999 & Year < 2016) |>
  filter(Platform == "Wii" | Platform == "X360" | Platform == "PS2") |>
  filter(Publisher == "Ubisoft" | Publisher == "Sega" | Publisher == "505 Games" | Publisher == "Activision")
#Creates a graph using the data from clean4
graph4 <- clean4 |>
  ggplot() +
#Creates a bar chart with a x axis, y axis, and legend
  geom_bar(aes(x = Publisher, y = NA_Sales, fill = Platform),
           position = "dodge", stat = "identity") +
#Changes the name for the axis, legend, title, and caption
  labs(x = "Game Publisher",
       y = "Total North America Sales (in Millions)",
       fill = "Game Platform",
       title = "North America Sales per Publisher v2",
       caption = "Source: Data is scraped from vgchartz.com",
       ) +
#Changes the theme to minimal
  theme_minimal(base_size = 8) +
#Changes the color palette to Set2
  scale_fill_brewer(palette = "Set2")
#Calls graph to be displayed
graph4

This is the same graph as graph 3, however, I have now removed the publisher Nintendo. I did this because Nintendo was such a powerhouse and consumed most of the highest sold games. I removed them to get a better visual of the smaller game publishers compared to one another.

Adds Interactivity to Graph 4

ggplotly(graph4)

End of Project Essay

After working on this project, I was very intrigued with the data set. I dove into the data set more to try and find out more details about the accuracy of this data and how recent it is. I concluded that the way this data was obtained for this data set was, it was web scraped from another website. This website is called vgchartz.com. I chose this topic because I’ve always loved video games and I thought that it would be an interesting topic to research considering the vast data that it has and the endless potential that this subject has. When I came across this data set, I was instantly interested in it because it had some of my favorite games from when I was growing up. Like I said, I’ve always been interested in video games and because of this, I researched various articles on it through the use of the MC library catalog. In this article I found called, “Role of video games in improving health-related outcomes: a systematic review” by Brian Primack, it stated that video games have been around forever and even provide some health benefits including therapeutic purposes. Through this project I found my studies very interesting. One of the biggest and most obvious trends I found while using this data set is the fact that Nintendo not only makes a lot of games, but is one of, if not, the top selling game publisher in this data set.