Over time, video games have become an essential part of modern entertainment. As it is evolving, changes have been made to aspects such as the graphics, game play, storytelling, and much more, impacting its ratings and global sales. The data set, Video Game Sales 2024 from Kaggle (sourced from VGchartz), shows numerous video games, listing the title, type of console, genre, publisher, developer, critic scores, total sales, North America sales, Japan sales, Europe/Africa sales, Other sales, release date, and last updates for each game as of 2024. I plan to research whether there is a correlation between the critic scores and the total sales, and also determine whether there are any noticeable trends in sales based on genre or console. For this I will be needing the columns title, console, genre, critic score, and finally total sales.
library(tidyverse) #setting libraries
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(highcharter)
## Warning: package 'highcharter' was built under R version 4.3.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
setwd("C:/Users/asman/Documents/data101")
videogames <- read_csv("vgchartz2024.csv") #Dataset
## Rows: 64016 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): img, title, console, genre, publisher, developer
## dbl (6): critic_score, total_sales, na_sales, jp_sales, pal_sales, other_sales
## date (2): release_date, last_update
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
videogames1 <- videogames |>
select(title, console, genre, critic_score, total_sales) |> #selecting columns
filter(!is.na(critic_score)) |> #filtering NAs
filter(!is.na(total_sales)) |>
filter(total_sales > 3) #filtering the total sales to fit visualization
Here I will be performing a linear regression analysis in order to see the relationship between critic scores and total sales. This will help me answer my first question and further assist me in my final visualizations for the following question.
linearmodel <- lm(critic_score ~ total_sales, data = videogames1) #equation
summary(linearmodel)
##
## Call:
## lm(formula = critic_score ~ total_sales, data = videogames1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0864 -0.3965 0.0968 0.6695 1.3429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.98147 0.12449 64.113 < 2e-16 ***
## total_sales 0.07511 0.01959 3.834 0.000168 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8824 on 201 degrees of freedom
## Multiple R-squared: 0.06816, Adjusted R-squared: 0.06353
## F-statistic: 14.7 on 1 and 201 DF, p-value: 0.0001683
The model has the equation: critic_score = 0.07(total_sales) + 7.98
The p-value is 0.0001683, which is less than alpha (0.05). This suggests that the total sales can be explained by the critic score. Additionally, the p-value on the right of total_sales has 3 asterisks which suggests it is a meaningful variable to explain the linear increase in critic_score. However, the Adjusted R-Squared value states that about 6% of the variation may be explained by the model. In other words, 94% of the variation in the data is likely not explained by this model.
linearplot <- ggplot(videogames1, aes(x = total_sales, y = critic_score)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "#344E41") + #linear method
labs(x = "Total Sales ($M)",
y = "Critic Score",
title = "Linear Regression: Total Sales vs Critic Score")+ # Axis labels and title
theme_classic() +
theme(panel.background = element_rect(fill = "#A3B18A"))
linearplot
## `geom_smooth()` using formula = 'y ~ x'
This simple plot shows us which genre typically has the most global sales
simpleplot1 <- ggplot(videogames1, aes(x = genre, y = total_sales)) +
geom_boxplot(fill ="#194569") + # line to show distribution
coord_flip() + # Flipping the axes
labs(x = "Genre", y = "Total Sales (in millions)", title = "Global Sales by Genre") +
theme_grey() + # grey theme
theme(panel.background = element_rect(fill = "#B7D0E1")) #background color
simpleplot1
This shows us that shooter and action have the most global sales, making
it a more popular genre.
This simple plot shows us which console type typically has the most global sales
simpleplot2 <- ggplot(videogames1, aes(x = console, y = total_sales)) +
geom_boxplot(color = "white", fill = "#B7D0E1" ) + # boxplot to show distribution
coord_flip() + # Flipping the axes
labs(x = "Console", y = "Total Sales (in millions)", title = "Global Sales by Console") +
theme_dark() + # dark theme
theme(panel.background = element_rect(fill = "#194569"))#background color
simpleplot2
This shows us that PS4, PS3, and Xbox360 have the most total sales,
making it a more popular console type.
colors1 <- c("#d0cdc9", "#918981", "#9fa4a3", "#717b75", "#667579", "#4c5b6c", "#3a3a3a", "#080808", "#527896", "#2c434b", "#a7bcc3","#255045")
bygenre_plot <- highchart() |>
hc_add_series(data = videogames1,
type = "scatter",
hcaes(x = total_sales,
y = critic_score,
group = genre,
size = total_sales)) |>
hc_xAxis(title = list(text="Total Sales ($ Millions)")) |>
hc_yAxis(title = list(text="Critic Scores")) |>
hc_title(text = "Video Games: Total Sales vs Critic Scores by Genre") |>
hc_caption(text = "Source: VGchartz")|> #source
hc_chart(backgroundColor = "#A3B18A") |>
hc_colors(colors1)
bygenre_plot
colors2 <- c("#31572c", "#4f772d", "#90a955","#a9c191", "#ecf39e", "#93a880", "#132a13","#505c45", "#d4f3b7", "#cdd6d1", "#cdcc9e","#e7ecd8")
byconsole_plot <- highchart() |>
hc_add_series(data = videogames1,
type = "scatter",
hcaes(x = total_sales,
y = critic_score,
group = console,
size = total_sales)) |>
hc_xAxis(title = list(text="Total Sales ($ Millions)")) |>
hc_yAxis(title = list(text="Critic Scores")) |>
hc_title(text = "Video Games: Total Sales vs Critic Scores by Console Type") |>
hc_caption(text = "Source: VGchartz")|> #source
hc_chart(backgroundColor = "#a7bcc3") |>
hc_colors(colors2)
byconsole_plot
From my analysis, I found that there seems to be a correlation with the critic scores and the total sales. From the visualizations, I noticed that the highest sales have a high critic score. By grouping the data by genre and console, I found that the most popular genres are action and shooter games and that the most popular console types are the PS3, PS4, and the Xbox360. These findings can be utilized in the future in order to create new games that fall under the same genre or console type, assisting with marketing strategies and effective gameplay techniques. I could also further continue my analysis and figure out which specific games had the top sales.
Games Icon Image: Pinterest - https://www.pinterest.com/pin/games-icon--389068855312942314/ Dataset: Kaggle - https://www.kaggle.com/datasets/asaniczka/video-game-sales-2024?select=vgchartz-2024.csv Source for Dataset: VGchartz - https://www.vgchartz.com/