This dataset features playtime statistics featuring from main story to main story plus side missions to completionist (100%) and features columns such as release_year, publisher, release_month, etc. Here is the link: https://www.kaggle.com/datasets/b4n4n4p0wer/how-long-to-beat-video-game-playtime-dataset?select=hltb_dataset.csv In this assignment, my hypothesis is that over time, video games will have longer playtimes over the years whether it be due to more players becoming immersed into the gameplay and they play the game for longer or the games have become longer itself to complete. I examine with a scope from the first video games made in the 1970s to the latest video games of 2025.
library("ggplot2")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library ("tidyr")
data <- read.csv("hltb_dataset_filtered.csv")
ggplot(data,aes(x=release_year, y = main_story)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Playtime Statistics of Videogames by Main Story Per Release Year", x = "Year", y = "Main Story Completion by Hours") +
theme_minimal()
## Warning: Removed 10988 rows containing missing values or values outside the scale range
## (`geom_bar()`).
My graph shows the playtime of video games over the span of the videogames’ lifetime.It shows the trend of video games becoming longer and longer.
data_summary <- data %>%
group_by(release_year) %>%
summarize(mean_playtime = mean(main_story, na.rm = TRUE))
# Plot mean per year
ggplot(data_summary, aes(x = release_year, y = mean_playtime)) +
geom_line(color = "blue", size = 1) +
geom_point(color = "red", size = 2) +
labs(title = "Mean Playtime per Year",
x = "Release Year",
y = "Mean Average Playtime (hours)") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
The result means that on average, people could be playing more and more of the video game or that video games are becoming longer to complete.
3.We will apply statistical tests to your dataset to gain insight in answering your questions. Start by first applying a correlation or regression analysis to detect a relationship. Display the results by graphing the same variables using a scatter plot. Explain how the relationship aligns with your questions.
#Scatterplot and Regression Analysis
#Scatterplot (Part 1)
ggplot(data_summary, aes(x = release_year, y = mean_playtime)) +
geom_point(color = "blue", size = 2) + # scatter points
geom_smooth(method = "lm", se = TRUE, color = "red") + # linear regression line
labs(title = "Regression Analysis: Release Year vs Mean Playtime",
x = "Release Year",
y = "Mean Average Playtime (hours)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
#Regression Analysis Model (Part 2)
model <- lm(mean_playtime ~ release_year, data = data_summary)
summary(model)
##
## Call:
## lm(formula = mean_playtime ~ release_year, data = data_summary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0103 -1.4502 -0.6392 2.0276 6.5537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -427.33813 36.44759 -11.72 <2e-16 ***
## release_year 0.21698 0.01824 11.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.283 on 54 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.7238, Adjusted R-squared: 0.7187
## F-statistic: 141.5 on 1 and 54 DF, p-value: < 2.2e-16
With an adjusted R-squared value of 0.7187, there is a positive correlation between time and video game length.
# Load dataset
data <- read.csv("hltb_dataset_filtered.csv")
# Convert main_story to numeric (in case it’s stored as text)
data$main_story <- as.numeric(data$main_story)
# Remove missing values
dataset <- subset(data, !is.na(main_story))
# Plot histogram with annotations
hist(dataset$main_story,
main = "Distribution of Main Story Playtimes",
xlab = "Main Story Playtime (hours)",
ylab = "Number of Games",
col = "lightblue",
border = "black",
breaks = 30) # adjust bin count if needed
# Add mean line
mu <- mean(dataset$main_story, na.rm = TRUE)
abline(v = mu, col = "red", lwd = 2, lty = 2)
# Add legend
legend("topright",
legend = paste0("Mean = ", round(mu, 1), " hours"),
col = "red", lwd = 2, lty = 2, bty = "n")
# Load dataset
data <- read.csv("hltb_dataset_filtered.csv")
# Make sure numeric
data$main_story <- as.numeric(data$main_story)
# Remove missing values
dataset <- subset(data, !is.na(main_story) & !is.na(release_year))
# Split into two samples
sample1 <- subset(dataset, release_year >= 1985 & release_year < 2005)
sample2 <- subset(dataset, release_year >= 2005 & release_year <= 2025)
# Run two-sample t-test (unequal variance)
t_test_result <- t.test(sample1$main_story, sample2$main_story, var.equal = FALSE)
# Show result
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: sample1$main_story and sample2$main_story
## t = -5.7634, df = 22481, p-value = 8.351e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.520128 -1.241007
## sample estimates:
## mean of x mean of y
## 7.275649 9.156216
With a very small p-value of 8.351e-09, it is certain that video games have gotten longer to complete as the years progress.