For my project I decided to use a data set I got off Kaggle but had data collected data on video games off of Metacritic. These games released in the years 1995 to January 2024. Data was collected by User “BERIDZEG45” on Kaggle. The data set has 9 variables. The variables are, the title of the game, the release date, the developer of the game, the publisher, the genre of the game, the product rating (From E for everyone to M), User score average (From 1-10 gathered off of Metacritic) and number of reviews from user. They also have a variable that says what platform the game was released on. In this project I want to explore two possible relationships. I first would like to see how the review score for the top video game genres have changed throughout the years and then Id also like to see on a separate graph and use linear regression on it how the number of reviews on Metacritic relate to the score of the game directly.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Below I begin to filter data, for this I want to add a new category for release year and then just take the years from the release date variable. I also will remove all data that has a N/A in user score and user ratings count.
I found out how to use the Lubridate function in R to achieve this goal using it I was able to make a column for just the year.
But then I want to see what game genre has the most releases so I can filter the data even lower. I also Just want to see the amount of data for each genre in general.
For my linear regression I want to choose data with a smaller range this is because Id like my data to be more readable, so I went with visual novel games.
Linear Regression
For this linear regression I wanted to see how Visual novel games number of reviews or popularity relate to the overall score of the game.
First I need to make a data set of just visual novel games.
There is two out liars here though, The House in Fata Morgana - Dreams of the Revenants Edition - and Doki Doki Literature Club! so I will remove both of those.
VNGames2 <- VNGames[VNGames$Title !="The House in Fata Morgana - Dreams of the Revenants Edition -",]VNGames2 <- VNGames2[VNGames2$Title !="Doki Doki Literature Club!",]p2 <-ggplot(VNGames2, aes(x = User.Score, y = User.Ratings.Count)) +labs(title ="User Ratings Count vs User Score\nfor Visual Novels",caption ="Source: Metacritic\nCollected By: BERIDZEG45" ) +xlab("User Review (1-10)") +ylab ("User Review Count") +theme_minimal(base_size =12) +xlim(0,10)+ylim(0,400) +geom_point() +geom_smooth(method='lm',formula=y~x, se =FALSE, linetype="dotdash", size =0.3)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Call:
lm(formula = User.Ratings.Count ~ User.Score, data = VNGames2)
Residuals:
Min 1Q Median 3Q Max
-119.45 -51.40 -25.77 23.17 276.82
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -177.761 54.305 -3.273 0.0015 **
User.Score 32.522 7.258 4.481 2.13e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 80.98 on 92 degrees of freedom
Multiple R-squared: 0.1792, Adjusted R-squared: 0.1702
F-statistic: 20.08 on 1 and 92 DF, p-value: 2.133e-05
I have a correlation of 0.423 Which means that there is very little to no correlation in the data here.
The equation for the Linear Reg would be
User Review Count = 32.522(Review Score) - 177.761
Visualization Of Average Genre Scores Through Out The Years
For my actual visualization I want to compare the average score for the top 4 genres throughout the years. In order to do this I first need to filter out the top 4 genres.
`summarise()` has grouped output by 'Genres'. You can override using the
`.groups` argument.
Now that we are done with that, I decided I wanted to make a scatter plot to display this information.
p3 <-ggplot(Top4GenresAverage, aes(x=Release.Year, y=Average_Review_Score, color=Genres)) +geom_point() +geom_line() +scale_color_brewer(palette ="Dark2") +xlim(1995,2024)+ylim(5.5,9) +theme_classic()+labs(title ="Genres of Video Games and Their \nUser Review Score Throught the Years.", x="Years (1995-2024)", y="Review Score(5.5-9)",caption ="Source: Metacritic\nCollected By: BERIDZEG45") p3
Conclusion
To conclude I will discuss how I cleaned up the data that was provided. To start off with I first filtered out any piece of data that had a N/A in user score or in user score count. This left me with data I could use. I then used lubridate to transform the release date variable into a date variable and from there I extracted out just the year from that release date variable so I could use it later. I then sorted the data into genres and gauged the popularity of genres by seeing the number of games in that specific genre this would help me with both visualizations that I made. For my final visualization I got the average review score for each year of the top 4 genres and graphed those.
My visualization above represents how the average review score for the top 4 genres of video games changed from 1995 to 2024, I find interesting that none of the averages fell bellow 5 and none of them were 10. I also find that the change in FPS’s average review score is interesting as it starts out with a high review score but later on faces a massive dip.
I wish I could of played with this data set more, maybe do other genres or even see the average genres content rating as I feel both would have been stellar investigations.