VideoGamesDataSetProject1

Author

Ryan Nicholas

Introduction To Dataset

For my project I decided to use a data set I got off Kaggle but had data collected data on video games off of Metacritic. These games released in the years 1995 to January 2024. Data was collected by User “BERIDZEG45” on Kaggle. The data set has 9 variables. The variables are, the title of the game, the release date, the developer of the game, the publisher, the genre of the game, the product rating (From E for everyone to M), User score average (From 1-10 gathered off of Metacritic) and number of reviews from user. They also have a variable that says what platform the game was released on. In this project I want to explore two possible relationships. I first would like to see how the review score for the top video game genres have changed throughout the years and then Id also like to see on a separate graph and use linear regression on it how the number of reviews on Metacritic relate to the score of the game directly.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)
library(lubridate)
library(dplyr)
library(htmltools)
Games <-read.csv("all_video_games.csv")

Data Filtering

Below I begin to filter data, for this I want to add a new category for release year and then just take the years from the release date variable. I also will remove all data that has a N/A in user score and user ratings count.

I found out how to use the Lubridate function in R to achieve this goal using it I was able to make a column for just the year.

NewGames <-Games |>
  mutate(Release.Date = mdy(Release.Date),
       Release.Year= year(Release.Date))

Below like I described earlier I filter out the NA values.

Games_NoNa <-NewGames |>
  filter(!is.na(User.Score) &!is.na(User.Ratings.Count))

But then I want to see what game genre has the most releases so I can filter the data even lower. I also Just want to see the amount of data for each genre in general.

Genre_Popularity <- table(Games_NoNa$Genres)
Sorted_Genre_Popularity <- sort(Genre_Popularity, decreasing = TRUE)

For my linear regression I want to choose data with a smaller range this is because Id like my data to be more readable, so I went with visual novel games.

Linear Regression

For this linear regression I wanted to see how Visual novel games number of reviews or popularity relate to the overall score of the game.

First I need to make a data set of just visual novel games.

VNGames<- Games_NoNa |>
  filter(Genres == "Visual Novel")

From here I will make a linear reg graph

p1 <- ggplot(VNGames, aes(x = User.Score, y = User.Ratings.Count)) +
 labs(title = "User Ratings Count vs User Score for\nVisual Novels",
 caption = "Source: Metacritic\nCollected By: BERIDZEG45 ") +
 xlab("User Review (1-10)") +
 ylab ("User Review Count") +
 theme_minimal(base_size = 12)
p1 + geom_point()

There is two out liars here though, The House in Fata Morgana - Dreams of the Revenants Edition - and Doki Doki Literature Club! so I will remove both of those.

VNGames2 <- VNGames[VNGames$Title != "The House in Fata Morgana - Dreams of the Revenants Edition -",]
VNGames2 <- VNGames2[VNGames2$Title != "Doki Doki Literature Club!",]
p2 <-ggplot(VNGames2, aes(x = User.Score, y = User.Ratings.Count)) +
 labs(title = "User Ratings Count vs User Score\nfor Visual Novels",
 caption = "Source: Metacritic\nCollected By: BERIDZEG45" ) +
 xlab("User Review (1-10)") +
 ylab ("User Review Count") +
 theme_minimal(base_size = 12) + xlim(0,10)+ ylim(0,400) +
  geom_point() + geom_smooth(method='lm',formula=y~x, se = FALSE, linetype= "dotdash", size = 0.3)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

p2 + geom_point()

Warning: Removed 28 rows containing missing values (`geom_smooth()`).

We can see there is a small correlation.

cor(VNGames2$User.Score, VNGames2$User.Ratings.Count)

[1] 0.4232717

fit1 <- lm(User.Ratings.Count ~ User.Score, data=VNGames2)
summary(fit1)


Call:
lm(formula = User.Ratings.Count ~ User.Score, data = VNGames2)

Residuals:
    Min      1Q  Median      3Q     Max 
-119.45  -51.40  -25.77   23.17  276.82 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -177.761     54.305  -3.273   0.0015 ** 
User.Score    32.522      7.258   4.481 2.13e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 80.98 on 92 degrees of freedom
Multiple R-squared:  0.1792,    Adjusted R-squared:  0.1702 
F-statistic: 20.08 on 1 and 92 DF,  p-value: 2.133e-05

I have a correlation of 0.423 Which means that there is very little to no correlation in the data here.

The equation for the Linear Reg would be

User Review Count = 32.522(Review Score) - 177.761

Visualization Of Average Genre Scores Through Out The Years

For my actual visualization I want to compare the average score for the top 4 genres throughout the years. In order to do this I first need to filter out the top 4 genres.

head(Sorted_Genre_Popularity)


  Action Adventure         Action RPG      2D Platformer                FPS 
               647                610                591                537 
Real-Time Strategy               JRPG 
               345                343

Top4Genres<- Games_NoNa |>
  filter(Genres %in% c("Action Adventure", "Action RPG", "2D Platformer", "FPS"))

Now I want to get the average reviews for each year for each genre.

Top4GenresAverage <- Top4Genres |>
  group_by(Genres,Release.Year) |>
  summarise(Average_Review_Score = mean(User.Score))

`summarise()` has grouped output by 'Genres'. You can override using the
`.groups` argument.

Now that we are done with that, I decided I wanted to make a scatter plot to display this information.

p3 <- ggplot(Top4GenresAverage, aes(x=Release.Year, y=Average_Review_Score, color=Genres)) + geom_point() + geom_line() + scale_color_brewer(palette = "Dark2") +
  xlim(1995,2024)+ 
  ylim(5.5,9) +
  theme_classic()+
  labs(title = "Genres of Video Games and Their \nUser Review Score Throught the Years.", x="Years (1995-2024)", y="Review Score(5.5-9)",caption = "Source: Metacritic\nCollected By: BERIDZEG45") 
p3

Conclusion

To conclude I will discuss how I cleaned up the data that was provided. To start off with I first filtered out any piece of data that had a N/A in user score or in user score count. This left me with data I could use. I then used lubridate to transform the release date variable into a date variable and from there I extracted out just the year from that release date variable so I could use it later. I then sorted the data into genres and gauged the popularity of genres by seeing the number of games in that specific genre this would help me with both visualizations that I made. For my final visualization I got the average review score for each year of the top 4 genres and graphed those.

My visualization above represents how the average review score for the top 4 genres of video games changed from 1995 to 2024, I find interesting that none of the averages fell bellow 5 and none of them were 10. I also find that the change in FPS’s average review score is interesting as it starts out with a high review score but later on faces a massive dip.

I wish I could of played with this data set more, maybe do other genres or even see the average genres content rating as I feel both would have been stellar investigations.