Data Analysis*
To address the research question, exploratory data analysis (EDA) was conducted to understand the distribution and relationships among key Twitch streamer metrics. Summary statistics and visualizations were used to examine viewer engagement and streamer popularity. The dataset was cleaned to remove missing values and select relevant variables for modeling. Several dplyr functions were applied to filter, select, and mutate variables in preparation for multiple linear regression. Scatterplots and correlation analysis were used to assess potential linear relationships between predictors and the outcome variable before fitting the regression mode
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
twitch <- read.csv("twitchdata-update.csv")
dim(twitch)
## [1] 1000 11
summary(twitch)
## Channel Watch.time.Minutes. Stream.time.minutes. Peak.viewers
## Length:1000 Min. :1.222e+08 Min. : 3465 Min. : 496
## Class :character 1st Qu.:1.632e+08 1st Qu.: 73759 1st Qu.: 9114
## Mode :character Median :2.350e+08 Median :108240 Median : 16676
## Mean :4.184e+08 Mean :120515 Mean : 37065
## 3rd Qu.:4.337e+08 3rd Qu.:141844 3rd Qu.: 37570
## Max. :6.196e+09 Max. :521445 Max. :639375
## Average.viewers Followers Followers.gained Views.gained
## Min. : 235 Min. : 3660 Min. : -15772 Min. : 175788
## 1st Qu.: 1458 1st Qu.: 170546 1st Qu.: 43758 1st Qu.: 3880602
## Median : 2425 Median : 318063 Median : 98352 Median : 6456324
## Mean : 4781 Mean : 570054 Mean : 205518 Mean : 11668166
## 3rd Qu.: 4786 3rd Qu.: 624332 3rd Qu.: 236131 3rd Qu.: 12196762
## Max. :147643 Max. :8938903 Max. :3966525 Max. :670137548
## Partnered Mature Language
## Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
twitch_clean <- twitch %>%
select(Channel, `Watch.time.Minutes.`, Average.viewers, Followers, `Stream.time.minutes.`) %>%
filter(`Watch.time.Minutes.` > 0, Average.viewers > 0, Followers > 0, `Stream.time.minutes.` > 0) %>%
mutate(
log_watch_time = log(`Watch.time.Minutes.`),
log_followers = log(Followers)
)
ggplot(twitch_clean, aes(x = Average.viewers, y = log_watch_time)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
labs(
title = "Average Viewers vs Log Watch Time",
x = "Average Viewers",
y = "Log Watch Time (minutes)"
)
## `geom_smooth()` using formula = 'y ~ x'
model1 <- lm(
log_watch_time ~ Average.viewers + log_followers + `Stream.time.minutes.`,
data = twitch_clean
)
summary(model1)
##
## Call:
## lm(formula = log_watch_time ~ Average.viewers + log_followers +
## Stream.time.minutes., data = twitch_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3190 -0.3405 -0.0332 0.3286 1.7282
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.421e+01 2.291e-01 62.03 <2e-16 ***
## Average.viewers 3.267e-05 2.129e-06 15.35 <2e-16 ***
## log_followers 3.732e-01 1.790e-02 20.86 <2e-16 ***
## Stream.time.minutes. 3.053e-06 1.978e-07 15.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5133 on 996 degrees of freedom
## Multiple R-squared: 0.5228, Adjusted R-squared: 0.5213
## F-statistic: 363.7 on 3 and 996 DF, p-value: < 2.2e-16
Regression Analysis
A multiple linear regression model was fitted to examine how streamer popularity and activity influence total watch time on Twitch. The dependent variable is the natural logarithm of total watch time (measured in minutes). The independent variables include average viewers, total followers (log-transformed), and total stream time (in minutes). The final model is specified as:
log(Watch Time) = β₀ + β₁(Average Viewers) + β₂(log(Followers)) + β₃(Stream Time) + ε
Interpretation of Coefficient*
All predictors in the model are statistically significant at the 0.001 level (p < 2.2e-16). Average viewers have a positive association with watch time, indicating that for each additional average viewer, log watch time increases by approximately 0.000033 units, holding other variables constant.
The coefficient for log-transformed followers (β = 0.373) suggests that a 1% increase in followers is associated with an approximate 0.37% increase in total watch time, controlling for other factors. This highlights the importance of long-term audience growth for streamer success.
Stream time also has a positive and statistically significant effect on watch time, meaning that streamers who broadcast for longer durations tend to accumulate more total watch time. Overall, these results indicate that both audience size and streaming effort play a meaningful role in driving viewer engagement on Twitch.
The model explains approximately 52.3% of the variation in log watch time (R² = 0.523), indicating a moderately strong fit. The adjusted R² value of 0.521 suggests that the included predictors contribute meaningfully to explaining streamer watch time without overfitting. The overall F-test is statistically significant (p < 2.2e-16), confirming that the model provides a better fit than a null model with no predictors.
MODEL ASSUMPTIONS & DIAGNOSTICS
Linear
plot(model1, which = 1)
The linearity assumption was assessed using a residuals versus fitted
values plot. The plot shows no strong curvature or systematic pattern,
suggesting that the relationship between the predictors and the
log-transformed watch time is approximately linear.
Independence of Observations
Independence of observations is assumed because each observation represents a different Twitch channel, and there is no reason to believe that the streaming behavior of one channel directly influences another within the dataset.
Homoscedasticity
plot(model1, which = 3)
Homoscedasticity was evaluated using the Scale-Location plot. The
residuals display a relatively constant spread across fitted values,
indicating that the variance of the residuals is approximately
constant.
Normality of Residuals
plot(model1, which = 2)
Normality of residuals was assessed using a Normal Q–Q plot. The
residuals largely follow the reference line, with minor deviations at
the tails. Given the large sample size (n = 1000), these deviations are
not considered problematic.
Multicollinearity
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vif(model1)
## Average.viewers log_followers Stream.time.minutes.
## 1.228384 1.202524 1.081678
Multicollinearity was assessed using Variance Inflation Factors (VIF). All VIF values were below commonly accepted thresholds (VIF < 5), indicating that multicollinearity is not a concern in this model.
CONCLUSION
This study examined how streamer popularity and streaming behavior influence total watch time on Twitch using multiple linear regression. The results indicate that average viewers, total followers, and stream time are all statistically significant predictors of watch time. In particular, follower count and average viewers show strong positive associations with watch time, highlighting the importance of building and maintaining an engaged audience. Stream time also contributes meaningfully, suggesting that consistent and sustained broadcasting increases overall viewer engagement.
The model explains approximately 52% of the variation in log-transformed watch time, indicating a moderately strong fit. While the model performs well, it has limitations. The analysis does not account for content type, streamer personality, or platform algorithms, which may also influence viewer behavior. Additionally, the use of aggregated yearly data prevents examination of short-term trends or causal relationships.
Future research could improve this model by incorporating additional predictors such as language, partnered status, or content categories. Interaction terms or non-linear models could also be explored to capture more complex relationships between streamer characteristics and audience engagement. Despite these limitations, this analysis provides valuable insight into the key factors that drive watch time on Twitch and offers a foundation for further statistical investigation.