Research Question

Can stream performance metrics predict the number of followers a Twitch streamer has?

Introduction

Streaming these days is one of the most popular forms of media that people consume. People often live broadcast themselves playing video games, talking, baking, simply doing anything while interacting with their live chat and getting paid for it. Twitch is an online platform that is home to over 8 million streamers. It first started off as Justin.tv in 2007, which was a 24/7 live broadcast of Justin Kan (one of Twitch’s founders). After gaining popularity, they decided to expand and allow other people to live stream, and soon live gaming became the main genre of content on Justin.tv. Because of this, the founders, Justin Kan, Emmett Shear, Michael Seibel, and Kyle Vogt created a similar website called Twitch.

Twitch soon became a source of income for successful streamers when they became site partners and earned revenue through advertising. Later in 2014, Kan shut down Justin.tv to focus more on Twitch. In that same year, Amazon acquired Twitch for $970 million. Over the years, Twitch added features to support their streamers, such as “Bits,” which viewers can purchase and give to streamers, and chat moderators who are able to filter out unwanted comments.

The dataset I am using contains a list of the top 1000 streamers from 2019. It has 11 variables and 1000 observations, but the main variables I am focusing on are “Followers” (the number of followers a streamer had at that time), “Peak.viewers” (the highest number of viewers the streamer had on one stream), “Stream.time.minutes.” (how many minutes the streamer spent streaming), “Watch.time.minutes.” (how many minutes viewers spent watching the streamer), and “Average.viewers” (the average number of viewers on a stream). I chose these variables because together they represent different aspects of a streamer’s performance. Peak viewers shows their reach, watch time reflects how engaged the audience is, stream time represents how much content they produced, and average viewers shows the amount of their typical audience. All of these seem like they could give us an idea on their follower count. I found this dataset on a GitHub server called Awesome Public Datasets, and the original source is sullygnome.com.

Dataset Link

Data Analysis

My first step in the data analysis was to check out my dataset. I used colnames() to look at the names of my columns and noticed that the column names were capitalized, so I lowercased the names of the columns. Then I created a new dataset called Twitch1 and selected only the variables that I would need and use. After that, I made another dataset called Twitch2 so I could rename the columns, replacing the “.” with “_” to make them easier to read and understand. For example, I changed “stream.time.minutes.” to “stream_time_min”.

Next, I looked at the structure of my Twitch2 dataset to see what class each variable belonged to, and then I checked if there were any NAs in the dataset using colSums(is.na() . Fortunately, there were zero NAs. After that, I used summary() to get the five-number summary for my variables. I also created a dataset called twitch_summary and used summarise() to get the means of the variables. I used head() so I could view the dataset and gain a better understanding of the range my numbers fall into. I converted the data to long format and created a facet plot with trend lines to explore the relationship between each performance metric and follower count.

I am going to use a multiple linear regression model in this project since this model is best for predicting a continuous outcome variable, and in this case I am predicting the amount of followers. This is different from a logistic regression, which is used for binary categorical outcome variables. For the multiple linear regression model, I will need to check linearity, independence of observations, homoscedasticity, normality of residuals, and multicollinearity to make sure the model is valid. To do this, I will look at the Component + Residual plots to see if the points form a straight-line pattern for linearity, use the Residuals vs. Order plot to see if the residuals bounce randomly around zero without patterns to check independence, and examine the Residuals vs. Fitted plot to verify that the residuals are scattered randomly and consistently around zero. I will also review the Scale-Location plot to make sure the red line stays horizontal and that the points are spread evenly, which supports homoscedasticity. I will use a Q–Q plot to see whether the residuals follow the straight diagonal line, and I will check for high correlations between the predictors to detect any multicollinearity. These diagnostics will help confirm whether the model meets the main regression assumptions.

library(car)
## Loading required package: carData
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   4.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some()   masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
twitch <- read.csv('twitchdata-update.csv')
colnames(twitch)
##  [1] "Channel"              "Watch.time.Minutes."  "Stream.time.minutes."
##  [4] "Peak.viewers"         "Average.viewers"      "Followers"           
##  [7] "Followers.gained"     "Views.gained"         "Partnered"           
## [10] "Mature"               "Language"
names(twitch) <- tolower(names(twitch))
colnames(twitch)
##  [1] "channel"              "watch.time.minutes."  "stream.time.minutes."
##  [4] "peak.viewers"         "average.viewers"      "followers"           
##  [7] "followers.gained"     "views.gained"         "partnered"           
## [10] "mature"               "language"
twitch1 <- twitch |>
  select(followers,peak.viewers,stream.time.minutes., watch.time.minutes., average.viewers)
head(twitch1)
##   followers peak.viewers stream.time.minutes. watch.time.minutes.
## 1   3246298       222720               215250          6196161750
## 2   5310163       310998               211845          6091677300
## 3   1767635       387315               515280          5644590915
## 4   3944850       300575               517740          3970318140
## 5   8938903       285644               123660          3671000070
## 6   1563438       263720                82260          3668799075
##   average.viewers
## 1           27716
## 2           25610
## 3           10976
## 4            7714
## 5           29602
## 6           42414
twitch2 <- twitch1 |>
  rename(
    peak_viewers = peak.viewers,
    stream_time_min = stream.time.minutes.,
    watch_time_min = watch.time.minutes.,
    avg_viewers = average.viewers
  )
head(twitch2)
##   followers peak_viewers stream_time_min watch_time_min avg_viewers
## 1   3246298       222720          215250     6196161750       27716
## 2   5310163       310998          211845     6091677300       25610
## 3   1767635       387315          515280     5644590915       10976
## 4   3944850       300575          517740     3970318140        7714
## 5   8938903       285644          123660     3671000070       29602
## 6   1563438       263720           82260     3668799075       42414
str(twitch2)
## 'data.frame':    1000 obs. of  5 variables:
##  $ followers      : int  3246298 5310163 1767635 3944850 8938903 1563438 4074287 508816 3530767 2607076 ...
##  $ peak_viewers   : int  222720 310998 387315 300575 285644 263720 115633 68795 89387 125408 ...
##  $ stream_time_min: int  215250 211845 515280 517740 123660 82260 136275 147885 122490 92880 ...
##  $ watch_time_min : num  6.20e+09 6.09e+09 5.64e+09 3.97e+09 3.67e+09 ...
##  $ avg_viewers    : int  27716 25610 10976 7714 29602 42414 24181 18985 22381 12377 ...
colSums(is.na(twitch2)) 
##       followers    peak_viewers stream_time_min  watch_time_min     avg_viewers 
##               0               0               0               0               0
summary(twitch2)
##    followers        peak_viewers    stream_time_min  watch_time_min     
##  Min.   :   3660   Min.   :   496   Min.   :  3465   Min.   :1.222e+08  
##  1st Qu.: 170546   1st Qu.:  9114   1st Qu.: 73759   1st Qu.:1.632e+08  
##  Median : 318063   Median : 16676   Median :108240   Median :2.350e+08  
##  Mean   : 570054   Mean   : 37065   Mean   :120515   Mean   :4.184e+08  
##  3rd Qu.: 624332   3rd Qu.: 37570   3rd Qu.:141844   3rd Qu.:4.337e+08  
##  Max.   :8938903   Max.   :639375   Max.   :521445   Max.   :6.196e+09  
##   avg_viewers    
##  Min.   :   235  
##  1st Qu.:  1458  
##  Median :  2425  
##  Mean   :  4781  
##  3rd Qu.:  4786  
##  Max.   :147643
twitch_summary <- twitch2 |>
  summarise(
    avg_peak = mean(peak_viewers),
    avg_watch = mean(watch_time_min),
    avg_followers = mean(followers),
    avg_stream= mean(stream_time_min)
  )
head(twitch_summary)
##   avg_peak avg_watch avg_followers avg_stream
## 1 37065.05 418427930      570054.1   120515.2
twitch_long <- twitch2 |>
  pivot_longer(
    cols = c(peak_viewers, watch_time_min, avg_viewers, stream_time_min),
    names_to = "metric",
    values_to = "value"
  )
head(twitch_long)
## # A tibble: 6 × 3
##   followers metric               value
##       <int> <chr>                <dbl>
## 1   3246298 peak_viewers        222720
## 2   3246298 watch_time_min  6196161750
## 3   3246298 avg_viewers          27716
## 4   3246298 stream_time_min     215250
## 5   5310163 peak_viewers        310998
## 6   5310163 watch_time_min  6091677300
plot <- twitch_long |>
  ggplot(aes(x = value, y = followers)) +
  geom_point(alpha = 0.6, color = "#86a7bf") +
  geom_smooth(color = "#b01a1a") +
  facet_wrap(~ metric, scales = "free_x") +
  labs(title = "Stream Metrics vs Total Followers",
       subtitle = "Top 1000 Twitch streamers (2019)",
       x = "Value",
       y = "Followers",
       caption = "Source: sullygnome.com") +
  theme_minimal()
plot
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Regression Analysis

multiple_model <- lm(followers ~ peak_viewers + stream_time_min + watch_time_min + avg_viewers , data = twitch2)
multiple_model
## 
## Call:
## lm(formula = followers ~ peak_viewers + stream_time_min + watch_time_min + 
##     avg_viewers, data = twitch2)
## 
## Coefficients:
##     (Intercept)     peak_viewers  stream_time_min   watch_time_min  
##       3.105e+05        2.851e+00       -1.352e+00        7.568e-04  
##     avg_viewers  
##       3.893e-02
summary(multiple_model)
## 
## Call:
## lm(formula = followers ~ peak_viewers + stream_time_min + watch_time_min + 
##     avg_viewers, data = twitch2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3222326  -224595  -105510   112381  5457436 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.105e+05  3.756e+04   8.267 4.38e-16 ***
## peak_viewers     2.851e+00  4.703e-01   6.061 1.92e-09 ***
## stream_time_min -1.352e+00  2.421e-01  -5.586 3.00e-08 ***
## watch_time_min   7.568e-04  4.509e-05  16.783  < 2e-16 ***
## avg_viewers      3.893e-02  3.224e+00   0.012     0.99    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 599000 on 995 degrees of freedom
## Multiple R-squared:  0.4478, Adjusted R-squared:  0.4456 
## F-statistic: 201.7 on 4 and 995 DF,  p-value: < 2.2e-16

This multiple linear regression model has an adjusted R² of 0.446, which means the model explains about 44.6% of the variation in follower count. The intercept of 310,500 represents the predicted number of followers when all predictors are zero, but that’s just a calculation and not practical or realistic. Among the predictors, three are statistically significant using a p-value of 0.05. Peak viewers has a positive coefficient of 2.851, meaning that for every additional peak viewer, a channel is expected to gain about 2.85 more followers, holding everything else constant. Total watch time (in minutes)has a positive coefficient of 0.0007568, showing that more total minutes watched is linked to more followers. Stream time (in minutes) has a negative coefficient of –1.352, suggesting that streaming for longer periods is connected to fewer followers. For average viewers, even though the coefficient is small and positive (0.039), it is not statistically significant, with a p-value of 0.99 when tested at an alpha level of 0.05.. Overall, the model shows that these viewership and engagement metrics together explain almost half of the variation in a channel’s follower count.


Model Assumptions and Diagnostics

Linearity check

 crPlots(multiple_model)

The Component + Residual plots show that the linearity assumption is met for almost all predictors. Peak_viewers shows a strong positive linear trend, with the loess line basically sitting right on top of the fitted line across the whole range. Stream_time_min is basically flat, and the loess line follows the fitted line with no noticeable deviation, showing almost perfect linearity. Watch_time_min shows a clear positive trend where the loess line sticks closely to the straight line, with only slight bending at the ends, so linearity looks very good. Avg_viewers is the only predictor with a noticeable deviation, it shows a negative trend overall, but the fitted line flattens and curves downward at the high average viewership values, showing some mild non-linearity in the upper range. Overall, linearity holds well for peak_viewers, stream_time_min, and watch_time_min, and is still acceptable for avg_viewers even with the small deviation.


Independence

plot(resid(multiple_model), type="b",
     main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)

The Residuals vs. Order plot shows a few bursts in the first 0–100 indices, but after index 100 the residuals level out, stay randomly scattered, and cluster around zero. This means that after the early part of the data, there’s no trend or correlation. Overall, the independence assumption is satisfied for this model.

Core diagnostics

par(mfrow=c(2,2)); plot(multiple_model); par(mfrow=c(1,1))

The Residuals vs. Fitted plot shows random scatter around zero with no funnel shape, curves, or patterns, confirming that the homoscedasticity assumption is met.

The Q–Q plot shows that the standardized residuals follow the diagonal line very closely in the middle, with only a slight curve at the right tail. This means the residuals are basically normally distributed, with just a little heavy-tailed behavior on the positive side, which allows us to accept the normality.

The Scale-Location plot shows the red line with points evenly scattered, which supports constant variance of the residuals, meaning homoscedasticity is solid.

The Residuals vs. Leverage plot shows all points within Cook’s distance, with the highest-leverage points (around 0.3–0.4) having small residuals. This means there are no influential outliers, so the model is stable.


Multicollinearity

cor(twitch2[, c("peak_viewers", "stream_time_min", "watch_time_min", "avg_viewers")], use = "complete.obs")
##                 peak_viewers stream_time_min watch_time_min avg_viewers
## peak_viewers       1.0000000      -0.1195403      0.5827966   0.6826373
## stream_time_min   -0.1195403       1.0000000      0.1505879  -0.2492478
## watch_time_min     0.5827966       0.1505879      1.0000000   0.4761650
## avg_viewers        0.6826373      -0.2492478      0.4761650   1.0000000

There is moderate multicollinearity, mainly between peak_viewers, avg_viewers, and watch_time_min. It’s not extreme, so the model is still stable and usable. However, it does make the coefficients for those three variables harder to interpret which one is actually causing the change in follower count. Overall, the level of multicollinearity is fine and the model is still reliable for predicting followers.

Diagnose Model Fit with Metrics

residuals_multiple <- resid(multiple_model)

rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 597454.8

The RMSE is 597,454.8, which means the predictions were off by an average of about 597,455 followers compared to the actual follower counts of these top Twitch streamers. If some streamers have very large residuals, it lowers our confidence in how accurate the model’s predictions really are.

Conclusion and Future Directions

Overall, this regression analysis shows that key engagement metrics especially peak viewers, total watch time, and stream time plays major roles in predicting a Twitch channel’s follower count. Peak viewers and watch time strongly increase follower numbers, while more minutes streaming are linked to fewer followers, suggesting that streaming longer isn’t always more effective. Even though the model explains about 44.6% of the variation in follower count, meaning the fit is reasonably good, there is still a large portion of follower growth that remains unexplained. The model is stable and meets most assumptions, but the moderate multicollinearity and the high RMSE show that predictions can vary a lot, especially for very large streamers.

For future research, this model could be improved by additional predictors like game category or how long the streamer was on twitch for could also help strengthen the model. Overall, this analysis provides a strong starting point, but expanding the model with richer data and more advanced techniques could give a clearer picture of what drives follower growth on Twitch.

References

Martin, Roland. “Twitch | Overview, History, & Facts | Britannica.” Www.britannica.com, 20 Nov. 2023, www.britannica.com/topic/Twitch-service.

Mishra, Aayush. “Top Streamers on Twitch.” Www.kaggle.com, 2020, www.kaggle.com/datasets/aayushmishra1512/twitchdata.

Shear, Emmett. “16 Years of Twitch.” Twitch Blog, 16 Mar. 2023, blog.twitch.tv/en/2023/03/16/16-years-of-twitch/.