Analyzing Statistical Factors for Twitch Channel Growth
in a Competitive Streaming Environment
Introduction
Streaming has been on the rise over the pandemic till now. The dominant platform Twitch, has many content creators on there for entertainment. With many content creators come competition among the them. Analyzing factors to understand what separates a streamer from the rest of the competition is our project to determine how to become a successful streamer.
Problem Statement
Our problem is to figure out why some streamers excel compared to the majority. What factor could have elevated their channel's growth to such extent?
Project Goal
The goal of this project is to compare the data from the top 1000 streamers in the past year of 2020 and utilize the variables to see if there are correlations between being a top performing streamer and those factors that they have.
Step 1: Data Visualization
Import Data / Data Summary
#Load CSV file from https://www.kaggle.com/datasets/aayushmishra1512/twitchdata/data
twitch_data <- read.csv(file.choose())
#Data Summary of the Twitch Data
head(twitch_data)
## Channel Watch.time.Minutes. Stream.time.minutes. Peak.viewers
## 1 xQcOW 6196161750 215250 222720
## 2 summit1g 6091677300 211845 310998
## 3 Gaules 5644590915 515280 387315
## 4 ESL_CSGO 3970318140 517740 300575
## 5 Tfue 3671000070 123660 285644
## 6 Asmongold 3668799075 82260 263720
## Average.viewers Followers Followers.gained Views.gained Partnered Mature
## 1 27716 3246298 1734810 93036735 True False
## 2 25610 5310163 1370184 89705964 True False
## 3 10976 1767635 1023779 102611607 True True
## 4 7714 3944850 703986 106546942 True False
## 5 29602 8938903 2068424 78998587 True False
## 6 42414 1563438 554201 61715781 True False
## Language
## 1 English
## 2 English
## 3 Portuguese
## 4 English
## 5 English
## 6 English
summary(twitch_data)
## Channel Watch.time.Minutes. Stream.time.minutes. Peak.viewers
## Length:1000 Min. :1.222e+08 Min. : 3465 Min. : 496
## Class :character 1st Qu.:1.632e+08 1st Qu.: 73759 1st Qu.: 9114
## Mode :character Median :2.350e+08 Median :108240 Median : 16676
## Mean :4.184e+08 Mean :120515 Mean : 37065
## 3rd Qu.:4.337e+08 3rd Qu.:141844 3rd Qu.: 37570
## Max. :6.196e+09 Max. :521445 Max. :639375
## Average.viewers Followers Followers.gained Views.gained
## Min. : 235 Min. : 3660 Min. : -15772 Min. : 175788
## 1st Qu.: 1458 1st Qu.: 170546 1st Qu.: 43758 1st Qu.: 3880602
## Median : 2425 Median : 318063 Median : 98352 Median : 6456324
## Mean : 4781 Mean : 570054 Mean : 205519 Mean : 11668166
## 3rd Qu.: 4786 3rd Qu.: 624332 3rd Qu.: 236131 3rd Qu.: 12196762
## Max. :147643 Max. :8938903 Max. :3966525 Max. :670137548
## Partnered Mature Language
## Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
Pairs Plot of Twitch Data out of Millions
#Check the structure of the original twitch_data
str(twitch_data)
## 'data.frame': 1000 obs. of 11 variables:
## $ Channel : chr "xQcOW" "summit1g" "Gaules" "ESL_CSGO" ...
## $ Watch.time.Minutes. : num 6.20e+09 6.09e+09 5.64e+09 3.97e+09 3.67e+09 ...
## $ Stream.time.minutes.: int 215250 211845 515280 517740 123660 82260 136275 147885 122490 92880 ...
## $ Peak.viewers : int 222720 310998 387315 300575 285644 263720 115633 68795 89387 125408 ...
## $ Average.viewers : int 27716 25610 10976 7714 29602 42414 24181 18985 22381 12377 ...
## $ Followers : int 3246298 5310163 1767635 3944850 8938903 1563438 4074287 508816 3530767 2607076 ...
## $ Followers.gained : int 1734810 1370184 1023779 703986 2068424 554201 1089824 425468 951730 1532689 ...
## $ Views.gained : int 93036735 89705964 102611607 106546942 78998587 61715781 46084211 670137548 51349926 36350662 ...
## $ Partnered : chr "True" "True" "True" "True" ...
## $ Mature : chr "False" "False" "True" "False" ...
## $ Language : chr "English" "English" "Portuguese" "English" ...
#Create a new data frame with all necessary columns
twitch_summary <- data.frame(
average_viewers = twitch_data$Average.viewers / 1e6, # Convert to millions
followers = twitch_data$Followers / 1e6, # Convert to millions
followers_gained = twitch_data$Followers.gained / 1e6, # Convert to millions
peak_viewers = twitch_data$Peak.viewers / 1e6, # Convert to millions
views_gained = twitch_data$Views.gained / 1e6, # Convert to millions
stream_time = twitch_data$Stream.time.minutes. / 1e6, # Convert to millions
watch_time = twitch_data$Watch.time.Minutes. / 1e6 # Convert to millions
)
#Use the modified frame to show the correlation graph of Twitch Data
pairs(twitch_summary)

Correlation Chart of Twitch Data
#Load Libraries
library(ggplot2) # For visualization
library(reshape2) # For reshaping data
#Calculate the correlation matrix using only numeric columns
correlation_matrix <- cor(twitch_data[, c('Watch.time.Minutes.',
'Stream.time.minutes.',
'Followers',
'Peak.viewers',
'Average.viewers',
'Followers.gained',
'Views.gained')],
use = "complete.obs")
#Melt the correlation matrix for ggplot
correlation_melted <- melt(correlation_matrix)
#Define new variable names for the plot
variable_names <- c(
'Watch.time.Minutes' = 'Watch Time (Minutes)',
'Stream.time.minutes' = 'Stream Time (Minutes)',
'Followers' = 'Followers',
'Peak.viewers' = 'Peak Viewers',
'Average.viewers' = 'Average Viewers',
'Followers.gained' = 'Followers Gained',
'Views.gained' = 'Views Gained')
#Replace the variable names in the melted data
correlation_melted$Var1 <- variable_names[correlation_melted$Var1]
correlation_melted$Var2 <- variable_names[correlation_melted$Var2]
#Create the heatmap using ggplot2
ggplot(data = correlation_melted, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
geom_text(aes(label = round(value, 2)), color = "black") +
scale_fill_gradient2(low = "purple", mid = "white", high = "red", midpoint = 0) +
theme_minimal() +
labs(title = "Correlation Chart for Twitch Data Set",
x = " ",
y = " ") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 10),
axis.text.y = element_text(size = 10),
plot.title = element_text(size = 20, hjust = 0.5))

Interpretation: The closer the plots are to red, the higher the correlation is. The closer the plots are to purple, the less the correlation is. White is moderate correlation.
Step 2: Regression Analysis / Result
Interpretation
Dependent Variable: Followers Gained
Independent Variable: Stream Time in Minutes and Maturity
Equation for Followers Gained: ŷ = B0+B1x1+B2x2
ŷ = Followers Gained
x1 = Stream Time in Minutes
x2 = Maturity
model <- lm(Followers.gained ~ Stream.time.minutes. + Mature, data = twitch_data)
summary(model)
##
## Call:
## lm(formula = Followers.gained ~ Stream.time.minutes. + Mature,
## data = twitch_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -273217 -156609 -92118 23846 3695919
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.955e+05 1.902e+04 15.540 < 2e-16 ***
## Stream.time.minutes. -6.143e-01 1.242e-01 -4.948 8.81e-07 ***
## MatureTrue -6.948e+04 2.518e+04 -2.760 0.00589 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 334700 on 997 degrees of freedom
## Multiple R-squared: 0.03241, Adjusted R-squared: 0.03047
## F-statistic: 16.7 on 2 and 997 DF, p-value: 7.373e-08
Assumptions:
Level of Significance (α) = 0.05
The relationship between the independent and dependant variables is linear
Residuals are normally distributed
No Multicollinearity
Result Interpretation: ŷ = 295500 - 0.06143(Stream Time) - 69480(Mature)
The Adjusted R-squared value: of 0.03047 → about 3% of the variation in the dependent variable ("Followers Gained", "Average Viewership") is explained by the independent variables included in our regression model.
Model has limited predictive power; additional or alternative factors may better explain the variation in the dependent variable.
Predicted Followers Gained is 295,500 at the start of the channel.
Highest chances of gaining followers occurs before a channel begins creating content.
Stream time coefficient: - 0.06143
There is a negative correlation between gaining followers and the amount of time a streamer has been streaming.
Maturity status coefficient: - 69,480
There is a negative correlation between the age rating of content and follower count
More inclusive content → higher chance channel will retain their follower growth.
Step 3: Conclusion and Recommendation
Followers Gained:
Our conclusion for followers gained is significantly influenced by Stream Time and Maturity Rating. Though some languages also slightly influence followers gained, it was not all significant enough to be considered in our model.
Business Recommendations:
- Stream inclusively to all audiences
- Advertise and Market your channel before you start streaming
- Focus on gaining followers at start of channel and maintaining your audience