Data Analysis*

To address the research question, exploratory data analysis (EDA) was conducted to understand the distribution and relationships among key Twitch streamer metrics. Summary statistics and visualizations were used to examine viewer engagement and streamer popularity. The dataset was cleaned to remove missing values and select relevant variables for modeling. Several dplyr functions were applied to filter, select, and mutate variables in preparation for multiple linear regression. Scatterplots and correlation analysis were used to assess potential linear relationships between predictors and the outcome variable before fitting the regression mode

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
twitch <- read.csv("twitchdata-update.csv")
dim(twitch)
## [1] 1000   11
summary(twitch)
##    Channel          Watch.time.Minutes. Stream.time.minutes.  Peak.viewers   
##  Length:1000        Min.   :1.222e+08   Min.   :  3465       Min.   :   496  
##  Class :character   1st Qu.:1.632e+08   1st Qu.: 73759       1st Qu.:  9114  
##  Mode  :character   Median :2.350e+08   Median :108240       Median : 16676  
##                     Mean   :4.184e+08   Mean   :120515       Mean   : 37065  
##                     3rd Qu.:4.337e+08   3rd Qu.:141844       3rd Qu.: 37570  
##                     Max.   :6.196e+09   Max.   :521445       Max.   :639375  
##  Average.viewers    Followers       Followers.gained   Views.gained      
##  Min.   :   235   Min.   :   3660   Min.   : -15772   Min.   :   175788  
##  1st Qu.:  1458   1st Qu.: 170546   1st Qu.:  43758   1st Qu.:  3880602  
##  Median :  2425   Median : 318063   Median :  98352   Median :  6456324  
##  Mean   :  4781   Mean   : 570054   Mean   : 205518   Mean   : 11668166  
##  3rd Qu.:  4786   3rd Qu.: 624332   3rd Qu.: 236131   3rd Qu.: 12196762  
##  Max.   :147643   Max.   :8938903   Max.   :3966525   Max.   :670137548  
##   Partnered            Mature            Language        
##  Length:1000        Length:1000        Length:1000       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
## 
twitch_clean <- twitch %>%
  select(Channel, `Watch.time.Minutes.`, Average.viewers, Followers, `Stream.time.minutes.`) %>%
  filter(`Watch.time.Minutes.` > 0, Average.viewers > 0, Followers > 0, `Stream.time.minutes.` > 0) %>%
  mutate(
    log_watch_time = log(`Watch.time.Minutes.`),
    log_followers = log(Followers)
  )
ggplot(twitch_clean, aes(x = Average.viewers, y = log_watch_time)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  labs(
    title = "Average Viewers vs Log Watch Time",
    x = "Average Viewers",
    y = "Log Watch Time (minutes)"
  )
## `geom_smooth()` using formula = 'y ~ x'

model1 <- lm(
  log_watch_time ~ Average.viewers + log_followers + `Stream.time.minutes.`,
  data = twitch_clean
)

summary(model1)
## 
## Call:
## lm(formula = log_watch_time ~ Average.viewers + log_followers + 
##     Stream.time.minutes., data = twitch_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3190 -0.3405 -0.0332  0.3286  1.7282 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.421e+01  2.291e-01   62.03   <2e-16 ***
## Average.viewers      3.267e-05  2.129e-06   15.35   <2e-16 ***
## log_followers        3.732e-01  1.790e-02   20.86   <2e-16 ***
## Stream.time.minutes. 3.053e-06  1.978e-07   15.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5133 on 996 degrees of freedom
## Multiple R-squared:  0.5228, Adjusted R-squared:  0.5213 
## F-statistic: 363.7 on 3 and 996 DF,  p-value: < 2.2e-16

Regression Analysis

A multiple linear regression model was fitted to examine how streamer popularity and activity influence total watch time on Twitch. The dependent variable is the natural logarithm of total watch time (measured in minutes). The independent variables include average viewers, total followers (log-transformed), and total stream time (in minutes). The final model is specified as:

log(Watch Time) = β₀ + β₁(Average Viewers) + β₂(log(Followers)) + β₃(Stream Time) + ε

Interpretation of Coefficient*

All predictors in the model are statistically significant at the 0.001 level (p < 2.2e-16). Average viewers have a positive association with watch time, indicating that for each additional average viewer, log watch time increases by approximately 0.000033 units, holding other variables constant.

The coefficient for log-transformed followers (β = 0.373) suggests that a 1% increase in followers is associated with an approximate 0.37% increase in total watch time, controlling for other factors. This highlights the importance of long-term audience growth for streamer success.

Stream time also has a positive and statistically significant effect on watch time, meaning that streamers who broadcast for longer durations tend to accumulate more total watch time. Overall, these results indicate that both audience size and streaming effort play a meaningful role in driving viewer engagement on Twitch.

The model explains approximately 52.3% of the variation in log watch time (R² = 0.523), indicating a moderately strong fit. The adjusted R² value of 0.521 suggests that the included predictors contribute meaningfully to explaining streamer watch time without overfitting. The overall F-test is statistically significant (p < 2.2e-16), confirming that the model provides a better fit than a null model with no predictors.

MODEL ASSUMPTIONS & DIAGNOSTICS

Linear

plot(model1, which = 1)

The linearity assumption was assessed using a residuals versus fitted values plot. The plot shows no strong curvature or systematic pattern, suggesting that the relationship between the predictors and the log-transformed watch time is approximately linear.

Independence of Observations

Independence of observations is assumed because each observation represents a different Twitch channel, and there is no reason to believe that the streaming behavior of one channel directly influences another within the dataset.

Homoscedasticity

plot(model1, which = 3)

Homoscedasticity was evaluated using the Scale-Location plot. The residuals display a relatively constant spread across fitted values, indicating that the variance of the residuals is approximately constant.

Normality of Residuals

plot(model1, which = 2)

Normality of residuals was assessed using a Normal Q–Q plot. The residuals largely follow the reference line, with minor deviations at the tails. Given the large sample size (n = 1000), these deviations are not considered problematic.

Multicollinearity

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
vif(model1)
##      Average.viewers        log_followers Stream.time.minutes. 
##             1.228384             1.202524             1.081678

Multicollinearity was assessed using Variance Inflation Factors (VIF). All VIF values were below commonly accepted thresholds (VIF < 5), indicating that multicollinearity is not a concern in this model.

CONCLUSION

This study examined how streamer popularity and streaming behavior influence total watch time on Twitch using multiple linear regression. The results indicate that average viewers, total followers, and stream time are all statistically significant predictors of watch time. In particular, follower count and average viewers show strong positive associations with watch time, highlighting the importance of building and maintaining an engaged audience. Stream time also contributes meaningfully, suggesting that consistent and sustained broadcasting increases overall viewer engagement.

The model explains approximately 52% of the variation in log-transformed watch time, indicating a moderately strong fit. While the model performs well, it has limitations. The analysis does not account for content type, streamer personality, or platform algorithms, which may also influence viewer behavior. Additionally, the use of aggregated yearly data prevents examination of short-term trends or causal relationships.

Future research could improve this model by incorporating additional predictors such as language, partnered status, or content categories. Interaction terms or non-linear models could also be explored to capture more complex relationships between streamer characteristics and audience engagement. Despite these limitations, this analysis provides valuable insight into the key factors that drive watch time on Twitch and offers a foundation for further statistical investigation.