Can stream performance metrics predict the number of followers a Twitch streamer has?
Streaming these days is one of the most popular forms of media that people consume. People often live broadcast themselves playing video games, talking, baking, simply doing anything while interacting with their live chat and getting paid for it. Twitch is an online platform that is home to over 8 million streamers. It first started off as Justin.tv in 2007, which was a 24/7 live broadcast of Justin Kan (one of Twitch’s founders). After gaining popularity, they decided to expand and allow other people to live stream, and soon live gaming became the main genre of content on Justin.tv. Because of this, the founders, Justin Kan, Emmett Shear, Michael Seibel, and Kyle Vogt created a similar website called Twitch.
Twitch soon became a source of income for successful streamers when they became site partners and earned revenue through advertising. Later in 2014, Kan shut down Justin.tv to focus more on Twitch. In that same year, Amazon acquired Twitch for $970 million. Over the years, Twitch added features to support their streamers, such as “Bits,” which viewers can purchase and give to streamers, and chat moderators who are able to filter out unwanted comments.
The dataset I am using contains a list of the top 1000 streamers from 2019. It has 11 variables and 1000 observations, but the main variables I am focusing on are “Followers” (the number of followers a streamer had at that time), “Peak.viewers” (the highest number of viewers the streamer had on one stream), “Stream.time.minutes.” (how many minutes the streamer spent streaming), “Watch.time.minutes.” (how many minutes viewers spent watching the streamer), and “Average.viewers” (the average number of viewers on a stream). I chose these variables because together they represent different aspects of a streamer’s performance. Peak viewers shows their reach, watch time reflects how engaged the audience is, stream time represents how much content they produced, and average viewers shows the amount of their typical audience. All of these seem like they could give us an idea on their follower count. I found this dataset on a GitHub server called Awesome Public Datasets, and the original source is sullygnome.com.
My first step in the data analysis was to check out my dataset. I used colnames() to look at the names of my columns and noticed that the column names were capitalized, so I lowercased the names of the columns. Then I created a new dataset called Twitch1 and selected only the variables that I would need and use. After that, I made another dataset called Twitch2 so I could rename the columns, replacing the “.” with “_” to make them easier to read and understand. For example, I changed “stream.time.minutes.” to “stream_time_min”.
Next, I looked at the structure of my Twitch2 dataset to see what class each variable belonged to, and then I checked if there were any NAs in the dataset using colSums(is.na() . Fortunately, there were zero NAs. After that, I used summary() to get the five-number summary for my variables. I also created a dataset called twitch_summary and used summarise() to get the means of the variables. I used head() so I could view the dataset and gain a better understanding of the range my numbers fall into. I converted the data to long format and created a facet plot with trend lines to explore the relationship between each performance metric and follower count.
I am going to use a multiple linear regression model in this project since this model is best for predicting a continuous outcome variable, and in this case I am predicting the amount of followers. This is different from a logistic regression, which is used for binary categorical outcome variables. For the multiple linear regression model, I will need to check linearity, independence of observations, homoscedasticity, normality of residuals, and multicollinearity to make sure the model is valid. To do this, I will look at the Component + Residual plots to see if the points form a straight-line pattern for linearity, use the Residuals vs. Order plot to see if the residuals bounce randomly around zero without patterns to check independence, and examine the Residuals vs. Fitted plot to verify that the residuals are scattered randomly and consistently around zero. I will also review the Scale-Location plot to make sure the red line stays horizontal and that the points are spread evenly, which supports homoscedasticity. I will use a Q–Q plot to see whether the residuals follow the straight diagonal line, and I will check for high correlations between the predictors to detect any multicollinearity. These diagnostics will help confirm whether the model meets the main regression assumptions.
library(car)
## Loading required package: carData
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 4.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some() masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
twitch <- read.csv('twitchdata-update.csv')
colnames(twitch)
## [1] "Channel" "Watch.time.Minutes." "Stream.time.minutes."
## [4] "Peak.viewers" "Average.viewers" "Followers"
## [7] "Followers.gained" "Views.gained" "Partnered"
## [10] "Mature" "Language"
names(twitch) <- tolower(names(twitch))
colnames(twitch)
## [1] "channel" "watch.time.minutes." "stream.time.minutes."
## [4] "peak.viewers" "average.viewers" "followers"
## [7] "followers.gained" "views.gained" "partnered"
## [10] "mature" "language"
twitch1 <- twitch |>
select(followers,peak.viewers,stream.time.minutes., watch.time.minutes., average.viewers)
head(twitch1)
## followers peak.viewers stream.time.minutes. watch.time.minutes.
## 1 3246298 222720 215250 6196161750
## 2 5310163 310998 211845 6091677300
## 3 1767635 387315 515280 5644590915
## 4 3944850 300575 517740 3970318140
## 5 8938903 285644 123660 3671000070
## 6 1563438 263720 82260 3668799075
## average.viewers
## 1 27716
## 2 25610
## 3 10976
## 4 7714
## 5 29602
## 6 42414
twitch2 <- twitch1 |>
rename(
peak_viewers = peak.viewers,
stream_time_min = stream.time.minutes.,
watch_time_min = watch.time.minutes.,
avg_viewers = average.viewers
)
head(twitch2)
## followers peak_viewers stream_time_min watch_time_min avg_viewers
## 1 3246298 222720 215250 6196161750 27716
## 2 5310163 310998 211845 6091677300 25610
## 3 1767635 387315 515280 5644590915 10976
## 4 3944850 300575 517740 3970318140 7714
## 5 8938903 285644 123660 3671000070 29602
## 6 1563438 263720 82260 3668799075 42414
str(twitch2)
## 'data.frame': 1000 obs. of 5 variables:
## $ followers : int 3246298 5310163 1767635 3944850 8938903 1563438 4074287 508816 3530767 2607076 ...
## $ peak_viewers : int 222720 310998 387315 300575 285644 263720 115633 68795 89387 125408 ...
## $ stream_time_min: int 215250 211845 515280 517740 123660 82260 136275 147885 122490 92880 ...
## $ watch_time_min : num 6.20e+09 6.09e+09 5.64e+09 3.97e+09 3.67e+09 ...
## $ avg_viewers : int 27716 25610 10976 7714 29602 42414 24181 18985 22381 12377 ...
colSums(is.na(twitch2))
## followers peak_viewers stream_time_min watch_time_min avg_viewers
## 0 0 0 0 0
summary(twitch2)
## followers peak_viewers stream_time_min watch_time_min
## Min. : 3660 Min. : 496 Min. : 3465 Min. :1.222e+08
## 1st Qu.: 170546 1st Qu.: 9114 1st Qu.: 73759 1st Qu.:1.632e+08
## Median : 318063 Median : 16676 Median :108240 Median :2.350e+08
## Mean : 570054 Mean : 37065 Mean :120515 Mean :4.184e+08
## 3rd Qu.: 624332 3rd Qu.: 37570 3rd Qu.:141844 3rd Qu.:4.337e+08
## Max. :8938903 Max. :639375 Max. :521445 Max. :6.196e+09
## avg_viewers
## Min. : 235
## 1st Qu.: 1458
## Median : 2425
## Mean : 4781
## 3rd Qu.: 4786
## Max. :147643
twitch_summary <- twitch2 |>
summarise(
avg_peak = mean(peak_viewers),
avg_watch = mean(watch_time_min),
avg_followers = mean(followers),
avg_stream= mean(stream_time_min)
)
head(twitch_summary)
## avg_peak avg_watch avg_followers avg_stream
## 1 37065.05 418427930 570054.1 120515.2
twitch_long <- twitch2 |>
pivot_longer(
cols = c(peak_viewers, watch_time_min, avg_viewers, stream_time_min),
names_to = "metric",
values_to = "value"
)
head(twitch_long)
## # A tibble: 6 × 3
## followers metric value
## <int> <chr> <dbl>
## 1 3246298 peak_viewers 222720
## 2 3246298 watch_time_min 6196161750
## 3 3246298 avg_viewers 27716
## 4 3246298 stream_time_min 215250
## 5 5310163 peak_viewers 310998
## 6 5310163 watch_time_min 6091677300
plot <- twitch_long |>
ggplot(aes(x = value, y = followers)) +
geom_point(alpha = 0.6, color = "#86a7bf") +
geom_smooth(color = "#b01a1a") +
facet_wrap(~ metric, scales = "free_x") +
labs(title = "Stream Metrics vs Total Followers",
subtitle = "Top 1000 Twitch streamers (2019)",
x = "Value",
y = "Followers",
caption = "Source: sullygnome.com") +
theme_minimal()
plot
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
multiple_model <- lm(followers ~ peak_viewers + stream_time_min + watch_time_min + avg_viewers , data = twitch2)
multiple_model
##
## Call:
## lm(formula = followers ~ peak_viewers + stream_time_min + watch_time_min +
## avg_viewers, data = twitch2)
##
## Coefficients:
## (Intercept) peak_viewers stream_time_min watch_time_min
## 3.105e+05 2.851e+00 -1.352e+00 7.568e-04
## avg_viewers
## 3.893e-02
summary(multiple_model)
##
## Call:
## lm(formula = followers ~ peak_viewers + stream_time_min + watch_time_min +
## avg_viewers, data = twitch2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3222326 -224595 -105510 112381 5457436
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.105e+05 3.756e+04 8.267 4.38e-16 ***
## peak_viewers 2.851e+00 4.703e-01 6.061 1.92e-09 ***
## stream_time_min -1.352e+00 2.421e-01 -5.586 3.00e-08 ***
## watch_time_min 7.568e-04 4.509e-05 16.783 < 2e-16 ***
## avg_viewers 3.893e-02 3.224e+00 0.012 0.99
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 599000 on 995 degrees of freedom
## Multiple R-squared: 0.4478, Adjusted R-squared: 0.4456
## F-statistic: 201.7 on 4 and 995 DF, p-value: < 2.2e-16
This multiple linear regression model has an adjusted R² of 0.446, which means the model explains about 44.6% of the variation in follower count. The intercept of 310,500 represents the predicted number of followers when all predictors are zero, but that’s just a calculation and not practical or realistic. Among the predictors, three are statistically significant using a p-value of 0.05. Peak viewers has a positive coefficient of 2.851, meaning that for every additional peak viewer, a channel is expected to gain about 2.85 more followers, holding everything else constant. Total watch time (in minutes)has a positive coefficient of 0.0007568, showing that more total minutes watched is linked to more followers. Stream time (in minutes) has a negative coefficient of –1.352, suggesting that streaming for longer periods is connected to fewer followers. For average viewers, even though the coefficient is small and positive (0.039), it is not statistically significant, with a p-value of 0.99 when tested at an alpha level of 0.05.. Overall, the model shows that these viewership and engagement metrics together explain almost half of the variation in a channel’s follower count.
crPlots(multiple_model)
The Component + Residual plots show that the linearity assumption is met for almost all predictors. Peak_viewers shows a strong positive linear trend, with the loess line basically sitting right on top of the fitted line across the whole range. Stream_time_min is basically flat, and the loess line follows the fitted line with no noticeable deviation, showing almost perfect linearity. Watch_time_min shows a clear positive trend where the loess line sticks closely to the straight line, with only slight bending at the ends, so linearity looks very good. Avg_viewers is the only predictor with a noticeable deviation, it shows a negative trend overall, but the fitted line flattens and curves downward at the high average viewership values, showing some mild non-linearity in the upper range. Overall, linearity holds well for peak_viewers, stream_time_min, and watch_time_min, and is still acceptable for avg_viewers even with the small deviation.
plot(resid(multiple_model), type="b",
main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)
The Residuals vs. Order plot shows a few bursts in the first 0–100 indices, but after index 100 the residuals level out, stay randomly scattered, and cluster around zero. This means that after the early part of the data, there’s no trend or correlation. Overall, the independence assumption is satisfied for this model.
par(mfrow=c(2,2)); plot(multiple_model); par(mfrow=c(1,1))
The Residuals vs. Fitted plot shows random scatter around zero with
no funnel shape, curves, or patterns, confirming that the
homoscedasticity assumption is met.
The Q–Q plot shows that the standardized residuals follow the
diagonal line very closely in the middle, with only a slight curve at
the right tail. This means the residuals are basically normally
distributed, with just a little heavy-tailed behavior on the positive
side, which allows us to accept the normality.
The Scale-Location plot shows the red line with points evenly
scattered, which supports constant variance of the residuals, meaning
homoscedasticity is solid.
The Residuals vs. Leverage plot shows all points within Cook’s distance, with the highest-leverage points (around 0.3–0.4) having small residuals. This means there are no influential outliers, so the model is stable.
cor(twitch2[, c("peak_viewers", "stream_time_min", "watch_time_min", "avg_viewers")], use = "complete.obs")
## peak_viewers stream_time_min watch_time_min avg_viewers
## peak_viewers 1.0000000 -0.1195403 0.5827966 0.6826373
## stream_time_min -0.1195403 1.0000000 0.1505879 -0.2492478
## watch_time_min 0.5827966 0.1505879 1.0000000 0.4761650
## avg_viewers 0.6826373 -0.2492478 0.4761650 1.0000000
There is moderate multicollinearity, mainly between peak_viewers, avg_viewers, and watch_time_min. It’s not extreme, so the model is still stable and usable. However, it does make the coefficients for those three variables harder to interpret which one is actually causing the change in follower count. Overall, the level of multicollinearity is fine and the model is still reliable for predicting followers.
residuals_multiple <- resid(multiple_model)
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 597454.8
The RMSE is 597,454.8, which means the predictions were off by an average of about 597,455 followers compared to the actual follower counts of these top Twitch streamers. If some streamers have very large residuals, it lowers our confidence in how accurate the model’s predictions really are.
Overall, this regression analysis shows that key engagement metrics especially peak viewers, total watch time, and stream time plays major roles in predicting a Twitch channel’s follower count. Peak viewers and watch time strongly increase follower numbers, while more minutes streaming are linked to fewer followers, suggesting that streaming longer isn’t always more effective. Even though the model explains about 44.6% of the variation in follower count, meaning the fit is reasonably good, there is still a large portion of follower growth that remains unexplained. The model is stable and meets most assumptions, but the moderate multicollinearity and the high RMSE show that predictions can vary a lot, especially for very large streamers.
For future research, this model could be improved by additional predictors like game category or how long the streamer was on twitch for could also help strengthen the model. Overall, this analysis provides a strong starting point, but expanding the model with richer data and more advanced techniques could give a clearer picture of what drives follower growth on Twitch.
Martin, Roland. “Twitch | Overview, History, & Facts | Britannica.” Www.britannica.com, 20 Nov. 2023, www.britannica.com/topic/Twitch-service.
Mishra, Aayush. “Top Streamers on Twitch.” Www.kaggle.com, 2020, www.kaggle.com/datasets/aayushmishra1512/twitchdata.
Shear, Emmett. “16 Years of Twitch.” Twitch Blog, 16 Mar. 2023, blog.twitch.tv/en/2023/03/16/16-years-of-twitch/.