Live streaming has been a major factor in social media, especially with the massive spike in popularity it gained over the COVID-19 quarantine. Because of that, many streamers have consequently grown in popularity along with it, gaining large fan bases and giving people careers in the content creation industry. I want to know what variables factor into being a successful content creator, and which variables would be most impactful in becoming a future content creator. My data set contains data directly pulled from Twitch analytics via sullygnome.com, which is a publicly accessible Twitch statistics tracker that has been running for multiple years. The data is from 2020, at the peak of the COVID-19 pandemic. This is relevant since the pandemic introduced the highest amount of viewers Twitch has seen in forever, since people had so much time on their hands and were able to sit through a full multiple-hour stream. The data pulled was somewhat simplified and reworked on Kaggle by user Aayush Mishra (https://www.kaggle.com/datasets/aayushmishra1512/twitchdata). It includes things such as watch time, stream time, peak viewership, follower count, language spoken, and stream tags (being partnered or a “mature” streamer). Variables I will be focusing on for a majority of this study is average viewers, followers, stream time, and watch time, however there are a few other included for fun:
Categorical Variables
language: The main language the streamer speaks; English, Portuguese, Spanish, German, Korean, French, Russian, Japanese, Chinese, Czech, Turkish, Italian, Polish, Thai, Arabic, Slovak, Hungarian, Greek, Finnish, Swedish, or Other
mature: Either a TRUE or FALSE, represents whether or whether not the streamer chose to mark their stream as mature (18+)
partnered: Either a TRUE or FALSE, represents if the streamer is a part of Twitch’s Partner program, an initiative that allows creators to receive more revenue from earning subscriptions.
Numeric Variables
followers: The active follower count of the streamer (at the time this dataset was made)
stream_time_minutes_: the amount of minutes a streamer has been live for
watch_time_minutes: the sum of the amount of time per individual that a streamer has been watched for
average_viewers: the mean amount of viewers a streamer gets per stream
# Calling the data and libraries
twitch <- read.csv("twitchdata-update.csv")
head(twitch)
## Channel Watch.time.Minutes. Stream.time.minutes. Peak.viewers
## 1 xQcOW 6196161750 215250 222720
## 2 summit1g 6091677300 211845 310998
## 3 Gaules 5644590915 515280 387315
## 4 ESL_CSGO 3970318140 517740 300575
## 5 Tfue 3671000070 123660 285644
## 6 Asmongold 3668799075 82260 263720
## Average.viewers Followers Followers.gained Views.gained Partnered Mature
## 1 27716 3246298 1734810 93036735 True False
## 2 25610 5310163 1370184 89705964 True False
## 3 10976 1767635 1023779 102611607 True True
## 4 7714 3944850 703986 106546942 True False
## 5 29602 8938903 2068424 78998587 True False
## 6 42414 1563438 554201 61715781 True False
## Language
## 1 English
## 2 English
## 3 Portuguese
## 4 English
## 5 English
## 6 English
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales) # For graphing
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
A lot of the data analysis that I am focusing on revolves around conversion and vague exploring. There was a lot of names of variables that needed to be changed for variance as they used periods and uppercase letters, but thankfully no NAs. I also chose to convert some of the variables that were in minutes to hours just so they were more digestible and comprehensible, especially when graphed. A lot of the handfuls of random things I did was to just curb my curiosity. Specifically, looking at follower count in order was incredibly interesting as there were a lot of streamers from other languages outside of English that I was not familiar with, leading into my curiosities regarding language as a variable.
# Reformatting data
names(twitch) <- tolower(names(twitch))
names(twitch) <- gsub("\\.", "_", names(twitch))
# Taking a sneak peak at the data
head(twitch)
## channel watch_time_minutes_ stream_time_minutes_ peak_viewers
## 1 xQcOW 6196161750 215250 222720
## 2 summit1g 6091677300 211845 310998
## 3 Gaules 5644590915 515280 387315
## 4 ESL_CSGO 3970318140 517740 300575
## 5 Tfue 3671000070 123660 285644
## 6 Asmongold 3668799075 82260 263720
## average_viewers followers followers_gained views_gained partnered mature
## 1 27716 3246298 1734810 93036735 True False
## 2 25610 5310163 1370184 89705964 True False
## 3 10976 1767635 1023779 102611607 True True
## 4 7714 3944850 703986 106546942 True False
## 5 29602 8938903 2068424 78998587 True False
## 6 42414 1563438 554201 61715781 True False
## language
## 1 English
## 2 English
## 3 Portuguese
## 4 English
## 5 English
## 6 English
# Check for NAs ! (NONE !! YAY)
colSums(is.na(twitch))
## channel watch_time_minutes_ stream_time_minutes_
## 0 0 0
## peak_viewers average_viewers followers
## 0 0 0
## followers_gained views_gained partnered
## 0 0 0
## mature language
## 0 0
twitchmod <- twitch |>
mutate(hours_streamed = stream_time_minutes_ / 60) # Convert minutes to hours
twitchmod <- twitchmod |>
mutate(watch_time_hours = watch_time_minutes_/60) # convert minutes to hours
twitchmod <- twitchmod |>
arrange(desc(followers)) # View data in order from most -> least follower count (popularity factor)
head(twitchmod)
## channel watch_time_minutes_ stream_time_minutes_ peak_viewers
## 1 Tfue 3671000070 123660 285644
## 2 shroud 888505170 30240 471281
## 3 Myth 1479214575 134760 122552
## 4 Rubius 2588632635 58275 240096
## 5 pokimane 964334055 56505 112160
## 6 summit1g 6091677300 211845 310998
## average_viewers followers followers_gained views_gained partnered mature
## 1 29602 8938903 2068424 78998587 True False
## 2 29612 7744066 833587 30621257 True False
## 3 9396 6726893 1421811 37384058 True False
## 4 42948 5751354 3820532 58599449 True False
## 5 16026 5367605 2085831 45579002 True False
## 6 25610 5310163 1370184 89705964 True False
## language hours_streamed watch_time_hours
## 1 English 2061.00 61183335
## 2 English 504.00 14808420
## 3 English 2246.00 24653576
## 4 Spanish 971.25 43143877
## 5 English 941.75 16072234
## 6 English 3530.75 101527955
# Collecting my numeric variables (for correlation)
numtwitch <- twitchmod |>
select(average_viewers, followers, watch_time_minutes_, stream_time_minutes_)
# Just looking at language statistics for fun !
langtwitch <- twitchmod |>
arrange(desc(language))
# Scatter plot for hours watched vs. hours streamed
ggplot(twitchmod, aes(x = watch_time_hours, y = hours_streamed)) +
geom_point(alpha = 0.5, color = "#6441a5") + # Fun to note- this color is directly picked off of the Twitch logo
labs(title = "Hours Watched vs. Hours Streamed", x = "Hours Watched", y = "Hours Streamed") +
theme_minimal() +
scale_x_continuous(labels = label_number())
There is the slight expression that more hours watched is correlated to more hours streamed, but not explicitly so. It is just a large condensation of points between 0-4250 on the y-axis and 0-17500000 on the x-axis. There are also many outliers, expressing streamers who have a strong amount of one without the other- Could imply someone who does short streams with high viewership or long streams with lower viewership, as well as people with different career pacing. If you have been a streamer for the longest time but only just recently started gaining viewers, you may have a large amount of streamed hours but the hours watched you have is just now picking up in pace.
# Scatter plot for viewership vs. followers
ggplot(twitchmod, aes(x = average_viewers, y = followers)) +
geom_point(alpha = 0.5, color = "#6441a5") +
labs(title = "Average Viewership vs. Followers", x = "Avg Viewers", y = "Followers") +
theme_minimal()
This does appear like there is a potential for there to be a direct relationship, because even though the points are clustered, they imply a very positive slope. Most points remain clustered in the very low bottom left corner, implying majority streamers form some sort of general average, but there are others with completely outlandish numbers. With Twitch channels though, you must take into account things such as event channels. There are channels such as the Olympics, presidential inaugurations, and certain event channels that would absolutely have lower follower count but high viewers, because those are not channels viewers are frequently going back to, but for the short times they are active, many people will flock to them.
# Stacked bar graph analyzing maturity against partnered status
ggplot(twitchmod, aes(x = mature, fill = partnered)) +
geom_bar(position = "stack", color = "black") +
labs(title = "Streamer Partner Status Distribution by Maturity Status",
x = "Mature Status",
y = "Count of Streamers",
fill = "Partnered status") +
scale_fill_manual(
values = c("#9bbe5a", "#6441a5")) + # Green is the direct inverse color of the twitch logo purple!
theme_grey()
Partnered status is something I find interesting, just because it has been shows to boost analytics and the algorithim within Twitch. I wanted to see if partnered status distribution was in some way related to maturity status as set by the streamers, but I think with the uneven sample of mature v. family friendly, it is hard to tell.
cor(numtwitch)
## average_viewers followers watch_time_minutes_
## average_viewers 1.0000000 0.42830322 0.4761650
## followers 0.4283032 1.00000000 0.6202339
## watch_time_minutes_ 0.4761650 0.62023388 1.0000000
## stream_time_minutes_ -0.2492478 -0.09129851 0.1505879
## stream_time_minutes_
## average_viewers -0.24924779
## followers -0.09129851
## watch_time_minutes_ 0.15058790
## stream_time_minutes_ 1.00000000
strongest correlation: followers and watch_time_minutes_ with 0.620 -> Makes sense, as when your follower count grows, you will likely have more people watching you and you will accumulate more minutes of people watching you in return.
weakest correlation: stream_time_minutes_ and followers with -0.091 -> Makes sense, you can be an incredibly well-known and adored streamer while still only streaming for very little time.
#linear regression model
lreg <- lm(followers ~ average_viewers + watch_time_minutes_ + stream_time_minutes_, data = twitchmod)
# Display summary stats
summary(lreg)
##
## Call:
## lm(formula = followers ~ average_viewers + watch_time_minutes_ +
## stream_time_minutes_, data = twitchmod)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2821931 -239130 -117028 122856 6380544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.326e+05 3.805e+04 8.741 < 2e-16 ***
## average_viewers 1.030e+01 2.792e+00 3.691 0.000236 ***
## watch_time_minutes_ 8.661e-04 4.206e-05 20.590 < 2e-16 ***
## stream_time_minutes_ -1.446e+00 2.459e-01 -5.879 5.62e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 609600 on 996 degrees of freedom
## Multiple R-squared: 0.4274, Adjusted R-squared: 0.4257
## F-statistic: 247.8 on 3 and 996 DF, p-value: < 2.2e-16
followers = 3.326e+5 + 1.03e+1(average viewers) + 8.661e-4(watch time in minutes) - 1.446(stream time in minutes)
P-value = < 2.2e-16 (Almost zero)
Adjusted R-squared = 0.4257
It can be clearly seen that these variables do hold significance, as represented by the incredibly small p-value. The adjusted r-squared stands at 42.57%, which, all things considered is not a bad r-squared. It means about 42.57% of follower account can be actively represented by my variables, as per this data set, which is a moderate amount. To get into the finer details, it is interesting to note that the p-value for average viewers is larger than that of the other two variables. No, this does not make this variable insignificant, as it is still an extremely small p-value, it just means that although it provides reasonable amounts of contribution, it is not the most definitely important variable to exist. It generally will just hold less weight, which does make sense as having high viewers does not mean all of those viewers will end up following you. You may only have high viewers due to large scale events.
twitchmod$predicted <- predict(lreg)
ggplot(twitchmod, aes(x = predicted, y = followers)) +
geom_point() +
geom_smooth(method="lm", col="#6441a5") +
labs(
title = "Predicted LM v. Follower Count",
x = "Predicted LM",
y = "Total Follower Count"
)
## `geom_smooth()` using formula = 'y ~ x'
Ah yes, yet another graph where a majority of the points are clustered into the bottom left corner. You can see with this graph, however, the actual positive slope and projected increase. There is a positive correlation, absolutely, and a somewhat strong one at that. It may be a correlation of about 50% if I had to estimate, which is good for what needs to be understood regarding this data. It expresses that yes, these variables do have a positive impact on growth in follower amount as twitch streamers.
Overall, average viewers, minutes watched, and minutes streaming all do have an affect on my metric of success for streamers, follower count. Average viewership is a bit less impactful, but has an impact nonetheless. If, on the off chance, that someone desires to become an insanely popular streamer on Twitch, it would be good to focus on longer streams as a good priority, as this will also boost the minutes watched. Viewers may also flock in with longer streams, catching certain parts and coming back in the future. If I wished to do a study on Twitch analytics again, I would love to look into some categorical variables such as language and partner status. Communities across different languages vary so massively, so that would likely procure different results. Partner status helps streamers rise further in the algorithm and places more incentive on streaming, as viewers are able to gain a cut of profits from joining the partner program.
Sources:
“Top Streamers on Twitch.” Kaggle, 24 Aug. 2020, www.kaggle.com/datasets/aayushmishra1512/twitchdata.
Twitch Stats and Analytics - Games and Channels - SullyGnome. sullygnome.com.
INTRODUCTION:
Live streaming has been a major factor in social media, especially with the massive spike in popularity it gained over the COVID-19 quarantine. Because of that, many streamers have consequently grown in popularity along with it, gaining large fan bases and giving people careers in the content creation industry. I want to know what variables factor into being a successful content creator, and which variables would be most impactful in becoming a future content creator. My data set contains data directly pulled from Twitch analytics via sullygnome.com, which is a publicly accessible Twitch statistics tracker that has been running for multiple years. The data is from 2020, at the peak of the COVID-19 pandemic. This is relevant since the pandemic introduced the highest amount of viewers Twitch has seen in forever, since people had so much time on their hands and were able to sit through a full multiple-hour stream. The data pulled was somewhat simplified and reworked on Kaggle by user Aayush Mishra (https://www.kaggle.com/datasets/aayushmishra1512/twitchdata). It includes things such as watch time, stream time, peak viewership, follower count, language spoken, and stream tags (being partnered or a “mature” streamer). Variables I will be focusing on for a majority of this study is average viewers, followers, stream time, and watch time, however there are a few other included for fun:
Categorical Variables
language: The main language the streamer speaks; English, Portuguese, Spanish, German, Korean, French, Russian, Japanese, Chinese, Czech, Turkish, Italian, Polish, Thai, Arabic, Slovak, Hungarian, Greek, Finnish, Swedish, or Other
mature: Either a TRUE or FALSE, represents whether or whether not the streamer chose to mark their stream as mature (18+)
partnered: Either a TRUE or FALSE, represents if the streamer is a part of Twitch’s Partner program, an initiative that allows creators to receive more revenue from earning subscriptions.
Numeric Variables
followers: The active follower count of the streamer (at the time this dataset was made)
stream_time_minutes_: the amount of minutes a streamer has been live for
watch_time_minutes: the sum of the amount of time per individual that a streamer has been watched for
average_viewers: the mean amount of viewers a streamer gets per stream
DATA ANALYSIS:
A lot of the data analysis that I am focusing on revolves around conversion and vague exploring. There was a lot of names of variables that needed to be changed for variance as they used periods and uppercase letters, but thankfully no NAs. I also chose to convert some of the variables that were in minutes to hours just so they were more digestible and comprehensible, especially when graphed. A lot of the handfuls of random things I did was to just curb my curiosity. Specifically, looking at follower count in order was incredibly interesting as there were a lot of streamers from other languages outside of English that I was not familiar with, leading into my curiosities regarding language as a variable.
(Scatterplot 1) There is the slight expression that more hours watched is correlated to more hours streamed, but not explicitly so. It is just a large condensation of points between 0-4250 on the y-axis and 0-17500000 on the x-axis. There are also many outliers, expressing streamers who have a strong amount of one without the other- Could imply someone who does short streams with high viewership or long streams with lower viewership, as well as people with different career pacing. If you have been a streamer for the longest time but only just recently started gaining viewers, you may have a large amount of streamed hours but the hours watched you have is just now picking up in pace.
(Scatterplot 2) This does appear like there is a potential for there to be a direct relationship, because even though the points are clustered, they imply a very positive slope. Most points remain clustered in the very low bottom left corner, implying majority streamers form some sort of general average, but there are others with completely outlandish numbers. With Twitch channels though, you must take into account things such as event channels. There are channels such as the Olympics, presidential inaugurations, and certain event channels that would absolutely have lower follower count but high viewers, because those are not channels viewers are frequently going back to, but for the short times they are active, many people will flock to them.
(Bar Graph) Partnered status is something I find interesting, just because it has been shows to boost analytics and the algorithim within Twitch. I wanted to see if partnered status distribution was in some way related to maturity status as set by the streamers, but I think with the uneven sample of mature v. family friendly, it is hard to tell.
(Correlation)
Strongest correlation: followers and watch_time_minutes_ with 0.620 -> Makes sense, as when your follower count grows, you will likely have more people watching you and you will accumulate more minutes of people watching you in return.
Weakest correlation: stream_time_minutes_ and followers with -0.091 -> Makes sense, you can be an incredibly well-known and adored streamer while still only streaming for very little time.
LINEAR REGRESSION MODEL:
(Linear Model Equation) followers = 3.326e+5 + 1.03e+1(average viewers) + 8.661e-4(watch time in minutes) - 1.446(stream time in minutes)
(Analysis) It can be clearly seen that these variables do hold significance, as represented by the incredibly small p-value. The adjusted r-squared stands at 42.57%, which, all things considered is not a bad r-squared. It means about 42.57% of follower account can be actively represented by my variables, as per this data set, which is a moderate amount. To get into the finer details, it is interesting to note that the p-value for average viewers is larger than that of the other two variables. No, this does not make this variable insignificant, as it is still an extremely small p-value, it just means that although it provides reasonable amounts of contribution, it is not the most definitely important variable to exist. It generally will just hold less weight, which does make sense as having high viewers does not mean all of those viewers will end up following you. You may only have high viewers due to large scale events.
(Prediction model analysis) Ah yes, yet another graph where a majority of the points are clustered into the bottom left corner. You can see with this graph, however, the actual positive slope and projected increase. There is a positive correlation, absolutely, and a somewhat strong one at that. It may be a correlation of about 50% if I had to estimate, which is good for what needs to be understood regarding this data. It expresses that yes, these variables do have a positive impact on growth in follower amount as twitch streamers.
CONCLUSION:
Overall, average viewers, minutes watched, and minutes streaming all do have an affect on my metric of success for streamers, follower count. Average viewership is a bit less impactful, but has an impact nonetheless. If, on the off chance, that someone desires to become an insanely popular streamer on Twitch, it would be good to focus on longer streams as a good priority, as this will also boost the minutes watched. Viewers may also flock in with longer streams, catching certain parts and coming back in the future. If I wished to do a study on Twitch analytics again, I would love to look into some categorical variables such as language and partner status. Communities across different languages vary so massively, so that would likely procure different results. Partner status helps streamers rise further in the algorithm and places more incentive on streaming, as viewers are able to gain a cut of profits from joining the partner program.
Sources:
“Top Streamers on Twitch.” Kaggle, 24 Aug. 2020, www.kaggle.com/datasets/aayushmishra1512/twitchdata.
Twitch Stats and Analytics - Games and Channels - SullyGnome. sullygnome.com.