DATA 101 Final Project

Introduction

Live streaming has been a major factor in social media, especially with the massive spike in popularity it gained over the COVID-19 quarantine. Because of that, many streamers have consequently grown in popularity along with it, gaining large fan bases and giving people careers in the content creation industry. I want to know what variables factor into being a successful content creator, and which variables would be most impactful in becoming a future content creator. My data set contains data directly pulled from Twitch analytics via sullygnome.com, which is a publicly accessible Twitch statistics tracker that has been running for multiple years. The data is from 2020, at the peak of the COVID-19 pandemic. This is relevant since the pandemic introduced the highest amount of viewers Twitch has seen in forever, since people had so much time on their hands and were able to sit through a full multiple-hour stream. The data pulled was somewhat simplified and reworked on Kaggle by user Aayush Mishra (https://www.kaggle.com/datasets/aayushmishra1512/twitchdata). It includes things such as watch time, stream time, peak viewership, follower count, language spoken, and stream tags (being partnered or a “mature” streamer). Variables I will be focusing on for a majority of this study is average viewers, followers, stream time, and watch time, however there are a few other included for fun:

Categorical Variables

language: The main language the streamer speaks; English, Portuguese, Spanish, German, Korean, French, Russian, Japanese, Chinese, Czech, Turkish, Italian, Polish, Thai, Arabic, Slovak, Hungarian, Greek, Finnish, Swedish, or Other

mature: Either a TRUE or FALSE, represents whether or whether not the streamer chose to mark their stream as mature (18+)

partnered: Either a TRUE or FALSE, represents if the streamer is a part of Twitch’s Partner program, an initiative that allows creators to receive more revenue from earning subscriptions.

Numeric Variables

followers: The active follower count of the streamer (at the time this dataset was made)

stream_time_minutes_: the amount of minutes a streamer has been live for

watch_time_minutes: the sum of the amount of time per individual that a streamer has been watched for

average_viewers: the mean amount of viewers a streamer gets per stream

# Calling the data and libraries
twitch <- read.csv("twitchdata-update.csv")
head(twitch)

##     Channel Watch.time.Minutes. Stream.time.minutes. Peak.viewers
## 1     xQcOW          6196161750               215250       222720
## 2  summit1g          6091677300               211845       310998
## 3    Gaules          5644590915               515280       387315
## 4  ESL_CSGO          3970318140               517740       300575
## 5      Tfue          3671000070               123660       285644
## 6 Asmongold          3668799075                82260       263720
##   Average.viewers Followers Followers.gained Views.gained Partnered Mature
## 1           27716   3246298          1734810     93036735      True  False
## 2           25610   5310163          1370184     89705964      True  False
## 3           10976   1767635          1023779    102611607      True   True
## 4            7714   3944850           703986    106546942      True  False
## 5           29602   8938903          2068424     78998587      True  False
## 6           42414   1563438           554201     61715781      True  False
##     Language
## 1    English
## 2    English
## 3 Portuguese
## 4    English
## 5    English
## 6    English

library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(scales) # For graphing

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

Data Analysis

A lot of the data analysis that I am focusing on revolves around conversion and vague exploring. There was a lot of names of variables that needed to be changed for variance as they used periods and uppercase letters, but thankfully no NAs. I also chose to convert some of the variables that were in minutes to hours just so they were more digestible and comprehensible, especially when graphed. A lot of the handfuls of random things I did was to just curb my curiosity. Specifically, looking at follower count in order was incredibly interesting as there were a lot of streamers from other languages outside of English that I was not familiar with, leading into my curiosities regarding language as a variable.

# Reformatting data
names(twitch) <- tolower(names(twitch))
names(twitch) <- gsub("\\.", "_", names(twitch)) 

# Taking a sneak peak at the data
head(twitch)

##     channel watch_time_minutes_ stream_time_minutes_ peak_viewers
## 1     xQcOW          6196161750               215250       222720
## 2  summit1g          6091677300               211845       310998
## 3    Gaules          5644590915               515280       387315
## 4  ESL_CSGO          3970318140               517740       300575
## 5      Tfue          3671000070               123660       285644
## 6 Asmongold          3668799075                82260       263720
##   average_viewers followers followers_gained views_gained partnered mature
## 1           27716   3246298          1734810     93036735      True  False
## 2           25610   5310163          1370184     89705964      True  False
## 3           10976   1767635          1023779    102611607      True   True
## 4            7714   3944850           703986    106546942      True  False
## 5           29602   8938903          2068424     78998587      True  False
## 6           42414   1563438           554201     61715781      True  False
##     language
## 1    English
## 2    English
## 3 Portuguese
## 4    English
## 5    English
## 6    English

# Check for NAs ! (NONE !! YAY)
colSums(is.na(twitch))

##              channel  watch_time_minutes_ stream_time_minutes_ 
##                    0                    0                    0 
##         peak_viewers      average_viewers            followers 
##                    0                    0                    0 
##     followers_gained         views_gained            partnered 
##                    0                    0                    0 
##               mature             language 
##                    0                    0

twitchmod <- twitch |>
  mutate(hours_streamed = stream_time_minutes_ / 60)  # Convert minutes to hours

twitchmod <- twitchmod |>
  mutate(watch_time_hours = watch_time_minutes_/60) # convert minutes to hours

twitchmod <- twitchmod |> 
  arrange(desc(followers)) # View data in order from most -> least follower count (popularity factor)

head(twitchmod)

##    channel watch_time_minutes_ stream_time_minutes_ peak_viewers
## 1     Tfue          3671000070               123660       285644
## 2   shroud           888505170                30240       471281
## 3     Myth          1479214575               134760       122552
## 4   Rubius          2588632635                58275       240096
## 5 pokimane           964334055                56505       112160
## 6 summit1g          6091677300               211845       310998
##   average_viewers followers followers_gained views_gained partnered mature
## 1           29602   8938903          2068424     78998587      True  False
## 2           29612   7744066           833587     30621257      True  False
## 3            9396   6726893          1421811     37384058      True  False
## 4           42948   5751354          3820532     58599449      True  False
## 5           16026   5367605          2085831     45579002      True  False
## 6           25610   5310163          1370184     89705964      True  False
##   language hours_streamed watch_time_hours
## 1  English        2061.00         61183335
## 2  English         504.00         14808420
## 3  English        2246.00         24653576
## 4  Spanish         971.25         43143877
## 5  English         941.75         16072234
## 6  English        3530.75        101527955

# Collecting my numeric variables (for correlation)
numtwitch <- twitchmod |>
  select(average_viewers, followers, watch_time_minutes_, stream_time_minutes_)

# Just looking at language statistics for fun !
langtwitch <- twitchmod |> 
  arrange(desc(language))

# Scatter plot for hours watched vs. hours streamed
ggplot(twitchmod, aes(x = watch_time_hours, y = hours_streamed)) +
  geom_point(alpha = 0.5, color = "#6441a5") + # Fun to note- this color is directly picked off of the Twitch logo
  labs(title = "Hours Watched vs. Hours Streamed", x = "Hours Watched", y = "Hours Streamed") +
  theme_minimal() +
  scale_x_continuous(labels = label_number())

There is the slight expression that more hours watched is correlated to more hours streamed, but not explicitly so. It is just a large condensation of points between 0-4250 on the y-axis and 0-17500000 on the x-axis. There are also many outliers, expressing streamers who have a strong amount of one without the other- Could imply someone who does short streams with high viewership or long streams with lower viewership, as well as people with different career pacing. If you have been a streamer for the longest time but only just recently started gaining viewers, you may have a large amount of streamed hours but the hours watched you have is just now picking up in pace.

# Scatter plot for viewership vs. followers
ggplot(twitchmod, aes(x = average_viewers, y = followers)) +
  geom_point(alpha = 0.5, color = "#6441a5") +
  labs(title = "Average Viewership vs. Followers", x = "Avg Viewers", y = "Followers") +
  theme_minimal()

This does appear like there is a potential for there to be a direct relationship, because even though the points are clustered, they imply a very positive slope. Most points remain clustered in the very low bottom left corner, implying majority streamers form some sort of general average, but there are others with completely outlandish numbers. With Twitch channels though, you must take into account things such as event channels. There are channels such as the Olympics, presidential inaugurations, and certain event channels that would absolutely have lower follower count but high viewers, because those are not channels viewers are frequently going back to, but for the short times they are active, many people will flock to them.

# Stacked bar graph analyzing maturity against partnered status
ggplot(twitchmod, aes(x = mature, fill = partnered)) +
  geom_bar(position = "stack", color = "black") +
  labs(title = "Streamer Partner Status Distribution by Maturity Status",
       x = "Mature Status",
       y = "Count of Streamers",
       fill = "Partnered status") +
  scale_fill_manual(
    values = c("#9bbe5a", "#6441a5")) + # Green is the direct inverse color of the twitch logo purple!
  theme_grey()

Partnered status is something I find interesting, just because it has been shows to boost analytics and the algorithim within Twitch. I wanted to see if partnered status distribution was in some way related to maturity status as set by the streamers, but I think with the uneven sample of mature v. family friendly, it is hard to tell.

cor(numtwitch)

##                      average_viewers   followers watch_time_minutes_
## average_viewers            1.0000000  0.42830322           0.4761650
## followers                  0.4283032  1.00000000           0.6202339
## watch_time_minutes_        0.4761650  0.62023388           1.0000000
## stream_time_minutes_      -0.2492478 -0.09129851           0.1505879
##                      stream_time_minutes_
## average_viewers               -0.24924779
## followers                     -0.09129851
## watch_time_minutes_            0.15058790
## stream_time_minutes_           1.00000000

strongest correlation: followers and watch_time_minutes_ with 0.620 -> Makes sense, as when your follower count grows, you will likely have more people watching you and you will accumulate more minutes of people watching you in return.

weakest correlation: stream_time_minutes_ and followers with -0.091 -> Makes sense, you can be an incredibly well-known and adored streamer while still only streaming for very little time.

#linear regression model
lreg <- lm(followers ~ average_viewers + watch_time_minutes_ + stream_time_minutes_, data = twitchmod)

# Display summary stats
summary(lreg)

## 
## Call:
## lm(formula = followers ~ average_viewers + watch_time_minutes_ + 
##     stream_time_minutes_, data = twitchmod)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2821931  -239130  -117028   122856  6380544 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.326e+05  3.805e+04   8.741  < 2e-16 ***
## average_viewers       1.030e+01  2.792e+00   3.691 0.000236 ***
## watch_time_minutes_   8.661e-04  4.206e-05  20.590  < 2e-16 ***
## stream_time_minutes_ -1.446e+00  2.459e-01  -5.879 5.62e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 609600 on 996 degrees of freedom
## Multiple R-squared:  0.4274, Adjusted R-squared:  0.4257 
## F-statistic: 247.8 on 3 and 996 DF,  p-value: < 2.2e-16

followers = 3.326e+5 + 1.03e+1(average viewers) + 8.661e-4(watch time in minutes) - 1.446(stream time in minutes)

P-value = < 2.2e-16 (Almost zero)

Adjusted R-squared = 0.4257

It can be clearly seen that these variables do hold significance, as represented by the incredibly small p-value. The adjusted r-squared stands at 42.57%, which, all things considered is not a bad r-squared. It means about 42.57% of follower account can be actively represented by my variables, as per this data set, which is a moderate amount. To get into the finer details, it is interesting to note that the p-value for average viewers is larger than that of the other two variables. No, this does not make this variable insignificant, as it is still an extremely small p-value, it just means that although it provides reasonable amounts of contribution, it is not the most definitely important variable to exist. It generally will just hold less weight, which does make sense as having high viewers does not mean all of those viewers will end up following you. You may only have high viewers due to large scale events.

twitchmod$predicted <- predict(lreg)

ggplot(twitchmod, aes(x = predicted, y = followers)) + 
  geom_point() + 
  geom_smooth(method="lm", col="#6441a5") + 
  labs(
    title = "Predicted LM v. Follower Count",
    x = "Predicted LM",
    y = "Total Follower Count"
  )

## `geom_smooth()` using formula = 'y ~ x'

Ah yes, yet another graph where a majority of the points are clustered into the bottom left corner. You can see with this graph, however, the actual positive slope and projected increase. There is a positive correlation, absolutely, and a somewhat strong one at that. It may be a correlation of about 50% if I had to estimate, which is good for what needs to be understood regarding this data. It expresses that yes, these variables do have a positive impact on growth in follower amount as twitch streamers.

Conclusion

Overall, average viewers, minutes watched, and minutes streaming all do have an affect on my metric of success for streamers, follower count. Average viewership is a bit less impactful, but has an impact nonetheless. If, on the off chance, that someone desires to become an insanely popular streamer on Twitch, it would be good to focus on longer streams as a good priority, as this will also boost the minutes watched. Viewers may also flock in with longer streams, catching certain parts and coming back in the future. If I wished to do a study on Twitch analytics again, I would love to look into some categorical variables such as language and partner status. Communities across different languages vary so massively, so that would likely procure different results. Partner status helps streamers rise further in the algorithm and places more incentive on streaming, as viewers are able to gain a cut of profits from joining the partner program.

Sources:

“Top Streamers on Twitch.” Kaggle, 24 Aug. 2020, www.kaggle.com/datasets/aayushmishra1512/twitchdata.

Twitch Stats and Analytics - Games and Channels - SullyGnome. sullygnome.com.

Full organization of all writing pieces:

INTRODUCTION:

Live streaming has been a major factor in social media, especially with the massive spike in popularity it gained over the COVID-19 quarantine. Because of that, many streamers have consequently grown in popularity along with it, gaining large fan bases and giving people careers in the content creation industry. I want to know what variables factor into being a successful content creator, and which variables would be most impactful in becoming a future content creator. My data set contains data directly pulled from Twitch analytics via sullygnome.com, which is a publicly accessible Twitch statistics tracker that has been running for multiple years. The data is from 2020, at the peak of the COVID-19 pandemic. This is relevant since the pandemic introduced the highest amount of viewers Twitch has seen in forever, since people had so much time on their hands and were able to sit through a full multiple-hour stream. The data pulled was somewhat simplified and reworked on Kaggle by user Aayush Mishra (https://www.kaggle.com/datasets/aayushmishra1512/twitchdata). It includes things such as watch time, stream time, peak viewership, follower count, language spoken, and stream tags (being partnered or a “mature” streamer). Variables I will be focusing on for a majority of this study is average viewers, followers, stream time, and watch time, however there are a few other included for fun:

Categorical Variables

language: The main language the streamer speaks; English, Portuguese, Spanish, German, Korean, French, Russian, Japanese, Chinese, Czech, Turkish, Italian, Polish, Thai, Arabic, Slovak, Hungarian, Greek, Finnish, Swedish, or Other

mature: Either a TRUE or FALSE, represents whether or whether not the streamer chose to mark their stream as mature (18+)

partnered: Either a TRUE or FALSE, represents if the streamer is a part of Twitch’s Partner program, an initiative that allows creators to receive more revenue from earning subscriptions.

Numeric Variables

followers: The active follower count of the streamer (at the time this dataset was made)

stream_time_minutes_: the amount of minutes a streamer has been live for

watch_time_minutes: the sum of the amount of time per individual that a streamer has been watched for

average_viewers: the mean amount of viewers a streamer gets per stream

DATA ANALYSIS:

A lot of the data analysis that I am focusing on revolves around conversion and vague exploring. There was a lot of names of variables that needed to be changed for variance as they used periods and uppercase letters, but thankfully no NAs. I also chose to convert some of the variables that were in minutes to hours just so they were more digestible and comprehensible, especially when graphed. A lot of the handfuls of random things I did was to just curb my curiosity. Specifically, looking at follower count in order was incredibly interesting as there were a lot of streamers from other languages outside of English that I was not familiar with, leading into my curiosities regarding language as a variable.

(Scatterplot 1) There is the slight expression that more hours watched is correlated to more hours streamed, but not explicitly so. It is just a large condensation of points between 0-4250 on the y-axis and 0-17500000 on the x-axis. There are also many outliers, expressing streamers who have a strong amount of one without the other- Could imply someone who does short streams with high viewership or long streams with lower viewership, as well as people with different career pacing. If you have been a streamer for the longest time but only just recently started gaining viewers, you may have a large amount of streamed hours but the hours watched you have is just now picking up in pace.

(Scatterplot 2) This does appear like there is a potential for there to be a direct relationship, because even though the points are clustered, they imply a very positive slope. Most points remain clustered in the very low bottom left corner, implying majority streamers form some sort of general average, but there are others with completely outlandish numbers. With Twitch channels though, you must take into account things such as event channels. There are channels such as the Olympics, presidential inaugurations, and certain event channels that would absolutely have lower follower count but high viewers, because those are not channels viewers are frequently going back to, but for the short times they are active, many people will flock to them.

(Bar Graph) Partnered status is something I find interesting, just because it has been shows to boost analytics and the algorithim within Twitch. I wanted to see if partnered status distribution was in some way related to maturity status as set by the streamers, but I think with the uneven sample of mature v. family friendly, it is hard to tell.

(Correlation)

Strongest correlation: followers and watch_time_minutes_ with 0.620 -> Makes sense, as when your follower count grows, you will likely have more people watching you and you will accumulate more minutes of people watching you in return.

Weakest correlation: stream_time_minutes_ and followers with -0.091 -> Makes sense, you can be an incredibly well-known and adored streamer while still only streaming for very little time.

LINEAR REGRESSION MODEL:

(Linear Model Equation) followers = 3.326e+5 + 1.03e+1(average viewers) + 8.661e-4(watch time in minutes) - 1.446(stream time in minutes)

(Analysis) It can be clearly seen that these variables do hold significance, as represented by the incredibly small p-value. The adjusted r-squared stands at 42.57%, which, all things considered is not a bad r-squared. It means about 42.57% of follower account can be actively represented by my variables, as per this data set, which is a moderate amount. To get into the finer details, it is interesting to note that the p-value for average viewers is larger than that of the other two variables. No, this does not make this variable insignificant, as it is still an extremely small p-value, it just means that although it provides reasonable amounts of contribution, it is not the most definitely important variable to exist. It generally will just hold less weight, which does make sense as having high viewers does not mean all of those viewers will end up following you. You may only have high viewers due to large scale events.

(Prediction model analysis) Ah yes, yet another graph where a majority of the points are clustered into the bottom left corner. You can see with this graph, however, the actual positive slope and projected increase. There is a positive correlation, absolutely, and a somewhat strong one at that. It may be a correlation of about 50% if I had to estimate, which is good for what needs to be understood regarding this data. It expresses that yes, these variables do have a positive impact on growth in follower amount as twitch streamers.

CONCLUSION:

Overall, average viewers, minutes watched, and minutes streaming all do have an affect on my metric of success for streamers, follower count. Average viewership is a bit less impactful, but has an impact nonetheless. If, on the off chance, that someone desires to become an insanely popular streamer on Twitch, it would be good to focus on longer streams as a good priority, as this will also boost the minutes watched. Viewers may also flock in with longer streams, catching certain parts and coming back in the future. If I wished to do a study on Twitch analytics again, I would love to look into some categorical variables such as language and partner status. Communities across different languages vary so massively, so that would likely procure different results. Partner status helps streamers rise further in the algorithm and places more incentive on streaming, as viewers are able to gain a cut of profits from joining the partner program.

Sources: