About Dataset

This dataset features playtime statistics featuring from main story to main story plus side missions to completionist (100%) and features columns such as release_year, publisher, release_month, etc. Here is the link: https://www.kaggle.com/datasets/b4n4n4p0wer/how-long-to-beat-video-game-playtime-dataset?select=hltb_dataset.csv In this assignment, my hypothesis is that over time, video games will have longer playtimes over the years whether it be due to more players becoming immersed into the gameplay and they play the game for longer or the games have become longer itself to complete. I examine with a scope from the first video games made in the 1970s to the latest video games of 2025.

library("ggplot2")
library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library ("tidyr")

Question 1

Load your CSV dataset into R as instructed in previous assignment. Select columns to plot together to visualize your data. Plot a graph out of any of the ones we reviewed in class. Make sure axes, lines are annotated and it has a title. Briefly explain what your graph shows. Show the R code that resulted in the graph.,

data <- read.csv("hltb_dataset_filtered.csv") 

ggplot(data,aes(x=release_year, y = main_story)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Playtime Statistics of Videogames by Main Story Per Release Year", x = "Year", y = "Main Story Completion by Hours") + 
  theme_minimal()

## Warning: Removed 10988 rows containing missing values or values outside the scale range
## (`geom_bar()`).

My graph shows the playtime of video games over the span of the videogames’ lifetime.It shows the trend of video games becoming longer and longer.

Question 2

Do a simple statistical calculation (e.g. mean, standard deviation, mode, median, etc.) with R that aligns with your hypothesis and plot/report results. Explain what the result means in terms of your question.

data_summary <- data %>%
  group_by(release_year) %>%
  summarize(mean_playtime = mean(main_story, na.rm = TRUE))

# Plot mean per year
ggplot(data_summary, aes(x = release_year, y = mean_playtime)) + 
         geom_line(color = "blue", size = 1) + 
         geom_point(color = "red", size = 2) + 
         labs(title = "Mean Playtime per Year", 
              x = "Release Year",
              y = "Mean  Average Playtime (hours)") +
          theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

The result means that on average, people could be playing more and more of the video game or that video games are becoming longer to complete.

Question 3

3.We will apply statistical tests to your dataset to gain insight in answering your questions. Start by first applying a correlation or regression analysis to detect a relationship. Display the results by graphing the same variables using a scatter plot. Explain how the relationship aligns with your questions.

#Scatterplot and Regression Analysis 

#Scatterplot (Part 1)
ggplot(data_summary, aes(x = release_year, y = mean_playtime)) +
  geom_point(color = "blue", size = 2) +    # scatter points
  geom_smooth(method = "lm", se = TRUE, color = "red") +  # linear regression line
  labs(title = "Regression Analysis: Release Year vs Mean Playtime",
       x = "Release Year",
       y = "Mean Average Playtime (hours)") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

#Regression Analysis Model (Part 2)

model <- lm(mean_playtime ~ release_year, data = data_summary)
summary(model)

## 
## Call:
## lm(formula = mean_playtime ~ release_year, data = data_summary)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0103 -1.4502 -0.6392  2.0276  6.5537 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -427.33813   36.44759  -11.72   <2e-16 ***
## release_year    0.21698    0.01824   11.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.283 on 54 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.7238, Adjusted R-squared:  0.7187 
## F-statistic: 141.5 on 1 and 54 DF,  p-value: < 2.2e-16

With an adjusted R-squared value of 0.7187, there is a positive correlation between time and video game length.

Question 4

For the second statistical test, select a numerical column that you want to check. First, plot a histogram of it and discuss about its distribution. As you did with the above graph, make sure the histogram is properly annotated.,

# Load dataset
data <- read.csv("hltb_dataset_filtered.csv")

# Convert main_story to numeric (in case it’s stored as text)
data$main_story <- as.numeric(data$main_story)

# Remove missing values
dataset <- subset(data, !is.na(main_story))

# Plot histogram with annotations
hist(dataset$main_story,
     main = "Distribution of Main Story Playtimes",
     xlab = "Main Story Playtime (hours)",
     ylab = "Number of Games",
     col = "lightblue",
     border = "black",
     breaks = 30)   # adjust bin count if needed

# Add mean line
mu <- mean(dataset$main_story, na.rm = TRUE)
abline(v = mu, col = "red", lwd = 2, lty = 2)

# Add legend
legend("topright", 
       legend = paste0("Mean = ", round(mu, 1), " hours"),
       col = "red", lwd = 2, lty = 2, bty = "n")

Question 5

Then, divide your dataset into two groups of rows based on another column that matters for your question and apply a test that we discussed in class (t-test, ANOVA, …) to test for significant differences between the two groups. Make sure the test you selected is consistent with the distribution that you observed earlier. Show the code and briefly explain the results.,

# Load dataset
data <- read.csv("hltb_dataset_filtered.csv")

# Make sure numeric
data$main_story <- as.numeric(data$main_story)

# Remove missing values
dataset <- subset(data, !is.na(main_story) & !is.na(release_year))

# Split into two samples
sample1 <- subset(dataset, release_year >= 1985 & release_year < 2005)
sample2 <- subset(dataset, release_year >= 2005 & release_year <= 2025)

# Run two-sample t-test (unequal variance)
t_test_result <- t.test(sample1$main_story, sample2$main_story, var.equal = FALSE)

# Show result
print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  sample1$main_story and sample2$main_story
## t = -5.7634, df = 22481, p-value = 8.351e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.520128 -1.241007
## sample estimates:
## mean of x mean of y 
##  7.275649  9.156216

With a very small p-value of 8.351e-09, it is certain that video games have gotten longer to complete as the years progress.

Assignment 3: R Analysis of Videogame Completion Time

Austin Pham

2025-09-01