Final Project

Author

Asma Abbas

(source: Media Play News)

Introduction:

The topic of my final project is video games, and my dataset goes by the same. It was found on the corgis website, and compiled by Austin Cory Bart. All of the data for this dataset comes from crowd sourced data from something called “How Long to Beat.” This is a website with a large amount of information on games, and their complete playtime. Moving on to variables, here’s what I’ll be looking at in this exploration:

Features.Multiplatform?: Whether or not a game is accessible across different platforms.

Features.Online?: Whether or not the game has online play.

Metadata.Genres: The genre of the video game.

*Metrics.Review Score: The general review scores for the game.

Release.Year: : The year of the games release

Length.Main Story.Average: The average playtime of the games main story.

I chose this dataset because video games are something that are super nostalgic to me. The dataset consists of games from the early 2000s and their sales, so I thought it would be fun to look at. My older brother was in his teen years during this time, so I thought it would be great to examine the statistics of games that resonated with him (and now me!) so much. On top of that, I want to see how different genres overtook the gaming industry, and how other genres were irrelevant during that time (cozy games, dating sims,) Overall, it’s just a fun topic.

Load necessary libraries

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Warning: package 'plotly' was built under R version 4.4.3

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(dplyr)
library(ggplot2)
library(ggfortify)
Warning: package 'ggfortify' was built under R version 4.4.2
setwd("C:/Users/Saima Abbas/Downloads")
games <- read_csv("video_games.csv")
Rows: 1212 Columns: 36
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Title, Metadata.Genres, Metadata.Publishers, Release.Console, Rele...
dbl (25): Features.Max Players, Metrics.Review Score, Metrics.Sales, Metrics...
lgl  (6): Features.Handheld?, Features.Multiplatform?, Features.Online?, Met...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning the dataset

I want to look at sales over the years by genre, so I’ll begin cleaning the dataset.

games_clean <- games |> 
  select(
    Title,
    `Features.Multiplatform?`,
    `Features.Online?`,
    `Metadata.Genres`,
    `Metrics.Review Score`,
    `Metrics.Sales`,
    `Release.Year`,
    `Length.Main Story.Average`)|>
  filter(
    !is.na(`Metrics.Review Score`) & 
    !is.na(`Metrics.Sales`) & 
    !is.na(`Length.Main Story.Average`)
  )

In this chunk, I decided to create a new dataframe that cleans up the data. I only selected the variables I was interested in, like what the video games feature, their genre, sales, and more. I used the select function to isolate these variables. Then, I went ahead and filtered out any missing values from numerical variables, so that they dont hinder the visualizations. is.na is meant to take out any 0s from the data. (Pretty sure Ive used this before, but I did look online to affirm the function if is.na, and it will be cited below!)

games_clean <- games_clean |> 
  mutate(Decade = floor(Release.Year / 10) * 10)

Then, to group release times together, I used mutate to create a new variable being “decade.” Then I used the floor function to make sure it’s rounded to every 10 years. (Found online, will also be cited down below!)

Now that the data is clean, lets move on to statistical analysis

Statistical Analysis

We’re gonna do a linear regression model. Essentially what its asking is if we’re able to predict a game’s sales based on how long its main story is and how good its review score is.

fit <- lm(`Metrics.Sales` ~ `Metrics.Review Score` + `Length.Main Story.Average`, data = games)
summary(fit)

Call:
lm(formula = Metrics.Sales ~ `Metrics.Review Score` + `Length.Main Story.Average`, 
    data = games)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0213 -0.4316 -0.1953  0.1022 14.4353 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 -1.156769   0.159611  -7.247 7.55e-13 ***
`Metrics.Review Score`       0.023418   0.002356   9.938  < 2e-16 ***
`Length.Main Story.Average`  0.005683   0.003150   1.804   0.0715 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.021 on 1209 degrees of freedom
Multiple R-squared:  0.09117,   Adjusted R-squared:  0.08967 
F-statistic: 60.64 on 2 and 1209 DF,  p-value: < 2.2e-16

Model equation: Predicted Sales=−1.1568+0.0234⋅(Review Score)+0.0057⋅(Main Story Length)

Through the p-values, we can understand that the review score is highly signifigant, whereas the mainstory length isn’t entirely significant. However, the R squared being low shows that things that impact variation in game sales comes from outside factors, not included in this model. Things like marketing, franchise popularity and more. So essentially, based on the results, we can deduce that the review score of a game has a big impact on sales (I guess that’s pretty obvious), but we can also deduce that the length of the games story has a smaller impact on sales, but still an impact of some sort.However, there are more factors that play a bigger role in the sales of games, and how well they do.

Statistical plots:

autoplot(fit)

Based on the plots, we can see a non-linear relationship in the residuals vs fitted plot, due to the curving. The q-q plot shows an upwards curve, indicating some outliers. The scale-location plot displays a bigger spread of the residuals, which shows a bit of variance. The last plot shows some influential points, so games that woulld be 24 and 159 have higher influence than others.

Visualization 1

Interactive!

I wanted to see game sales compared to their scores, across different video game genres. For this, I thought a scatterplot would be optimal.

p1 <- ggplot(games_clean, aes(x = `Metrics.Review Score`, y = `Metrics.Sales`, color = `Metadata.Genres`)) +
  geom_point(size = 3, alpha = 0.9) +
  labs(
    title = "Game Review Score vs Sales by Genre",
    subtitle = "Exploring how critical reception relates to commercial success",
    x = "Review Score",
    y = "Sales (millions)",
    caption = "Source: Video Game Dataset /n by Austin Cory Bart",
  ) +
  scale_color_brewer(palette = "Set3") +
  theme_minimal(base_size = 15) +
 theme(
    legend.key.size = unit(0.2, "cm"),      
    legend.text = element_text(size = 5),   
    legend.title = element_text(size = 5))
p1 <- p1 + annotate("text", x = 85, y =10.7, (games_clean$Metrics.Sales), label = "Top seller", color = "black")
Warning in annotate("text", x = 85, y = 10.7, (games_clean$Metrics.Sales), :
Ignoring unknown aesthetics: xmin
ggplotly(p1)
Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set3 is 12
Returning the palette you asked for with that many colors

In this chunk, I followed standard procedure to create a scatterplot. I selected the values for the x and y axis, and then used geom point to adjust the points on the plot. Then I labeled each axis, and added the source as well. From there I picked out a color pallete (the one with the most distinct colors), and then adjusted the legend for better readability. From there, I added an annotation to label the best seller.Then, I used plotly to make it interactive and have a tooltip with varying information on each point, like exact score and sale amount.

This graph displays the relationship between the review score of a video game, and its sales across different genres. We can see here that they have a positive relationship, which makes it sure that the review score of a game has an impact on its sales. The obvious overtake in genre would be action games, but it is followed up with blooms of pink, being the racing and driving games. Other genres that can be seen are action adventure, and a mix of the action and racing genres. I suppose this shows that action was the most popular, but I also believe that many games can fall under ‘action’ because it is quite the umbrella term. The top seller would be a game that recieved exactly 10.3 million sales and a score of 89.

Visualization 2: Bar Graph

racing_games <- games_clean |>
   filter(`Metrics.Review Score` >= 85,)|>
  filter(str_detect(`Metadata.Genres`, regex("racing|driving", ignore_case = TRUE))) |>
  filter(!is.na(`Metrics.Sales`) & !is.na(`Metrics.Review Score`))

Here I made a new dataframe to go with my second visualization. I filtered for games that had a score higher to or equal to 85, and then selected the proper genre. I then filtered out the missing values.

top_score_game <- racing_games |> 
  filter(`Metrics.Review Score` == max(`Metrics.Review Score`, na.rm = TRUE))

Then, I made a dataframe to set apart the game with the highest score. (This was made so that the annotation would work.)

# Plot
ggplot(racing_games, aes(x = reorder(Title, `Metrics.Sales`), y = `Metrics.Sales`, fill = `Metrics.Review Score`)) +
  geom_bar(stat = "identity")+
  coord_flip() +
  labs(
    title = "Sales of Racing/Driving Video Games",
    subtitle = "Colored by Review Score",
    x = "Game Title",
    y = "Sales (millions)",
    fill = "Review Score",
    caption = "Source: Video Game Dataset By Austin Cory Bart"
  ) +
  scale_fill_gradient(low = "#00a1ff", high = "#00ff8f") +
  theme_light(base_size = 13) +
annotate("text",
    x = top_score_game$Title,
    y = top_score_game$`Metrics.Sales` + 0.5, 
  label = "Top Score",
    size = 3,
    color = "white"
  )

In this chunk, I followed the standard procedure of creating a barplot. First I created a data frame which isolated the necessary components, and then I moved onto the actual graph. I selected the values for each axis and the legend as well, and then used co-ord flip to make it easy to view. Then, I labeled each axis and picked out the colors for the graph. I selected a theme, and then used the “annotate” feature to add an annotation.

In this graph, you can see the sales of different racing/car video games, as well as the score that they received. It’s observable that Mariokart DS outsold GTA, however, GTA recieved a higher score. The game with the least amount of sales would be Ridge Racer. I think it’s interesting to observe the sales of these games, considering that racing and driving games were the second most popular genre, second to action games. I wonder how it would look like compared to the new generation of these games (GTA 6, MarioKart World, Forza Horizon 5)

Another one (different genre)

actionRPG_games <- games_clean |>
   filter(`Metrics.Review Score` >= 93,)|>
  filter(str_detect(`Metadata.Genres`, regex("Action", ignore_case = TRUE))) |>
  filter(!is.na(`Metrics.Sales`) & !is.na(`Metrics.Review Score`))

Filtering for games that are action. I messed around with a review score until the graph became reasonable. Too many games caused too much clutter. Then I selected the genre, and filtered out for any missing values.

ggplot(actionRPG_games, aes(x = reorder(Title, `Metrics.Sales`), y = `Metrics.Sales`, fill = `Metrics.Review Score`)) +
  geom_bar(stat = "identity")+
  coord_flip() +
  labs(
    title = "Sales of Action RPG Video Games",
    subtitle = "Colored by Review Score",
    x = "Game Title",
    y = "Sales (millions)",
    fill = "Review Score",
    caption = "Source: Video Game Dataset"
  ) +
  scale_fill_gradient(low = "#00ddff", high = "#ff00d4") +
  theme_light(base_size = 13)

Over in this chunk, I followed the exact same steps as I did with the previous graph. First I created a data frame, to isolate the important data for this graph. Then after that, I used it to fill in the values for each axis on the bargraph. Then I labeled each axis, added a caption, and picked out the appropriate colors for the graph.

In this graph, you can see the varying sales and scores of video games within the action rpg genre. The one that appears to sell the highest, as well as the one with the highest score, would be GTA IV. A close second in scores would be Super Mario Galaxy, but the runner up in sales would be Call Of Duty 4: Modern Warfare. I’m shocked by the lower sales of Fallout 3, and Oblivion, but I’m assuming that perhaps those games aged better than how they were recieved during their release. The lowest selling game, with a surprisingly higher score than most, would be The Orange Box.

Final notes, and outside research:

In each visualization, there seems to be a couple of patterns. Of course, most of the games with the best scores, ended up being some of the top sellers. However, there are different patterns to observe as well. In the racing category, it seems that the most family oriented game- and the one paired with a new console (the DS) did the best in sales. However, the games that hold the highest scores, tend to be those geared towards adults. This is present in all of the graphs, with action being the overtaking genre, GTA being the highest scored driving game, and of course, it also being the highest rated action/rpg game as well. I was sad to see that Super Smash Bros Brawl was rated so low, it’s one of my favorite games of all time, as well as Fallout 3. If there was anything else i’d like to observe, or do differently, I’d want to see the timeline for each of these games. Not only that, but I wanted to explore even more genres, and do an in depth observation of each action related sub-genre. However, because most of the action games seemed to do well, I wanted to see why it was such a popular genre.

In each of the visualizations, it appears that games that are the most action-oriented tend to do the best in terms of scores. The louder, visually capturing, and more stimulating a game is, the better it tends to do. Anything that stimulates the player more, engages them well into the game. On top of that, action games have more immersive feaures, like GTA. You get to explore real locations, and interact with NPC’s and characters that could be real people. That’s where the next factor of a game comes in, it’s the freedom element. Games with more freedom, more “escape” tend to do better. The setting, action, appearance, and more are all left up to the player. This leads to a game having hours and hours of engaging content, with the free-will allowed in the game offering as a better “escape” for players. Action games tend to do better because not only are they more stimulating, but they’re engaging and interactive to play with others- on top of that, they often replicate insane things that can’t usually be done in real life. The more wild, the more fun.

Sources that were used to help:

What is is.na() function in R?. Educative. (n.d.). https://www.educative.io/answers/what-is-isna-function-in-r

Laksh_28, Laksh_ MindsMapped. https://www.mindsmapped.com/numeric-functions-in-r-programming/

Chambers, K. (2018, February 6). ggplot basics: Labels and annotations. 36 Chambers – The Legendary Journeys: Execution to the max. https://36chambers.wordpress.com/2018/02/06/ggplot-basics-labels-and-annotations/

Sources for research:

Granic, I., Lobel, A., & Engels, R. C. M. E. (2014). The benefits of playing video games. In B. J. Bushman (Ed.), Oxford handbook of media psychology (pp. 155–170). Oxford University Press. https://books.google.com/books?hl=en&lr=&id=foE8AnfQmZoC&oi=fnd&pg=PA155

Green, C. S., & Bavelier, D. (2012). Learning, attentional control, and action video game play. Current Biology, 22(6), R197–R206. https://www.cell.com/current-biology/fulltext/S0960-9822(12)00130-3