Introduction

Gaming is a very big industry now, with streamers and professional eSports players dedicating not only their careers, but their lives to the content that they produce. Every year there are millions of dollars invested in eSports and as a result many new companies want to invest in the eSports scene on a yearly basis. One way that eSports organizations and teams recruit is through watching people who stream video games on the website known as twitch. Twitch is the world’s leading live streaming platform for gamers and the things gamers love. Twitch features brands, content creators, and video game developers who live stream themselves playing video games, producing music, or even just chatting with their followers/viewers. The organizations and teams use streamed content by a creator to see the creators tendencies in game, their professionalism, and their knowledge of in game mechanics. With all of this being said, it is reasonable to see that playing video games on twitch can be a not only reasonble career, but also an obtainable one. We came to the conclusion that for a streamer to be able to make money, they need to be able to focus on getting the most amount of followers in a given year to not only reach a partnered status, but to also keep their watch time values high. With that being brought forward it raised the question. “What gives a streamer the most followers gained in a 1 year period?”

Data

The data that we collected was from Kaggle, which is an online community of data scientists and machine learning practitioners with powerful tools and resources to help you achieve your data science goals. The data was specifficaly grabbed from a user named Aayush Mishra who gathered the data from Twitch. The data consists of 11 different columns that cover different things related to each row. Channel is a character vector that contains the top 1000 twitch streamers channel names. Watch time (minutes) represents the amount of time that the streamers were watched by viewers. For example if a streamer was watched by one person for 5 minutes, and another for 10, the total watch time is 15 minutes. Stream time (minutes) is the total amount of time that a streamer has streamed. Peak viewers is the maximum amount of viewers that a streamer has had at a given point. Average viewers is the normal amount of viewers that a streamer has on any given stream. Followers is the amount of followers that the streamer had at the end of 2020. Followers gained is the amount of followers the streamer gained between Jan. 1 2020 and Dec. 31 2020. Views gained is the amount of views that a streamer had between Jan. 1 2020 and Dec. 31 2020. Partnered describes the partnership status of a streamer. “The Twitch Partnership Program is for those who are committed to streaming and are ready to level up from Affiliate. Twitch Partners are creators who stream a variety of content, from games, music, talk shows, art, to just about anything else you can imagine. Twitch Partners can earn revenue by accepting subscriptions from their viewers. Bits are a virtual good viewers can buy to Cheer on your channel, allowing them to support you without having to use an external website or third party. Twitch provides participating Partners a share of the revenue Twitch receives from Bits equal to 1 cent per Bit used to Cheer for them. Partners earn a share of the revenue generated from any ads played on their channel. Partners can determine the length and frequency of mid-roll advertisements through their dashboard.” Mature tells whether a streamer streams mature content or not. Finally we have language, and this is the streamers primary language that they stream in.

#Data Expploration

Firstly before we start graphing any kind of data, we first ran some simple commands in order to gather more information about the variables in the dataset.

Rows: 1,000
Columns: 11
$ channel             <chr> "xQcOW", "summit1g", "Gaules", "ESL_CSGO", "Tfue", "Asmongold", "NICKM…
$ watch_time_minutes  <dbl> 6196161750, 6091677300, 5644590915, 3970318140, 3671000070, 3668799075…
$ stream_time_minutes <dbl> 215250, 211845, 515280, 517740, 123660, 82260, 136275, 147885, 122490,…
$ peak_viewers        <dbl> 222720, 310998, 387315, 300575, 285644, 263720, 115633, 68795, 89387, …
$ average_viewers     <dbl> 27716, 25610, 10976, 7714, 29602, 42414, 24181, 18985, 22381, 12377, 2…
$ followers           <dbl> 3246298, 5310163, 1767635, 3944850, 8938903, 1563438, 4074287, 508816,…
$ followers_gained    <dbl> 1734810, 1370184, 1023779, 703986, 2068424, 554201, 1089824, 425468, 9…
$ views_gained        <dbl> 93036735, 89705964, 102611607, 106546942, 78998587, 61715781, 46084211…
$ partnered           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ mature              <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, T…
$ language            <chr> "English", "English", "Portuguese", "English", "English", "English", "…

The glimpse command was run in order to see what the start of each column looked like and to gather the type of each column. No major observations were made here but it allowed us to familiarize ourselves with the data and the data types we would be working with throughout the project. One thing we did gather from this is that we have four columns of data that are non-numeric and 7 columns of data that are numeric, which will be very helpful to us later on. Another command we also ran was is.na() in order to find any missing values in the data set, but there was none present. Finally, we ran summary() on our dataset in order to look at some simple statistics for our data set.

   channel          watch_time_minutes  stream_time_minutes  peak_viewers    average_viewers 
 Length:1000        Min.   :1.222e+08   Min.   :  3465      Min.   :   496   Min.   :   235  
 Class :character   1st Qu.:1.632e+08   1st Qu.: 73759      1st Qu.:  9114   1st Qu.:  1458  
 Mode  :character   Median :2.350e+08   Median :108240      Median : 16676   Median :  2425  
                    Mean   :4.184e+08   Mean   :120515      Mean   : 37065   Mean   :  4781  
                    3rd Qu.:4.337e+08   3rd Qu.:141844      3rd Qu.: 37570   3rd Qu.:  4786  
                    Max.   :6.196e+09   Max.   :521445      Max.   :639375   Max.   :147643  
   followers       followers_gained   views_gained       partnered         mature       
 Min.   :   3660   Min.   : -15772   Min.   :   175788   Mode :logical   Mode :logical  
 1st Qu.: 170546   1st Qu.:  43758   1st Qu.:  3880602   FALSE:22        FALSE:770      
 Median : 318063   Median :  98352   Median :  6456324   TRUE :978       TRUE :230      
 Mean   : 570054   Mean   : 205519   Mean   : 11668166                                  
 3rd Qu.: 624332   3rd Qu.: 236131   3rd Qu.: 12196762                                  
 Max.   :8938903   Max.   :3966525   Max.   :670137548                                  
   language        
 Length:1000       
 Class :character  
 Mode  :character  
                   
                   
                   

As we began going into our data, we decided that it would be best to start with out non numeric based data points, being Channel name, Language, Partnered, and Mature. To start we have below a plot for the top 10 languages among the top 1000 streamers.

Not surprisingly the largest occurring language on Twitch was English, but what was surprising to us was that the next two were Korean and Russian, as we expected the next two to be Spanish and French. After we plotted the top languages and decided that the bulk of the top 1000 streamers were English speaking we went on to partnership and maturity statuses.

Just from watching a lot of twitch we both expected that partnered and non mature streamers would be the one’s who had more followers gained, and based on the graphs that was pretty accurate. Having a non mature stream means that you are able to broaden your stream and reach a more diverse group of viewers and pairing that with being partnered, and subsequently being featured more on the Twitch home page, means that streamers are more likely to be recognized and have a higher number of followers than those who are not partnered and non mature. Now we know that your best shot at increasing followers in a year, for our non numeric statistics, is to be an English speaking non mature and partnered streamer. Now we moved on to our quantitative values, and we started by doing a correlation Heatmap. This would show us the correlation, or the degree to which two variables move in coordination with one-another, between the variables on each axis. Values closer to zero means there is no linear trend between the two variables while the close to 1 the correlation is the more positively correlated they are.

The heat map shows us the correlation between all of our variables in the dataset but we were mostly interested in the correlation between followers gained and the other numeric variables. We wanted to select the variables that had the highest correlation with followers gained so we could perform a linear regression using followers gained as our dependent variable and the three highest correlated variables as our independent variables. Looking at the heat map, we can conclude that the three most correlated variables, with respect to followers gained, are followers, watch time(minutes), and peak viewers. After we found that these three have the highest correlation to the followers gained we decided to start doing some simple linear regression to figure out which of these would have the highest rate of change.

#Linear Regression

Initially, when we started our linear regression model, the data was not a good fit for linear regression. As shown in the graph above the relationship between followers gained and followers was not visually linear, which would have lead to a very bad linear model and prediction. This can especially be seen in the plot of the residuals. The residuals are mainly clustered to the left of the graph with some starting to fan off to the right. In order to improve the model we decided to transform out data. We found that the best transformation for our data was a logarithmic transformation, which resulted in the following plot for followers gained vs followers.

The logarithmic transformation of our data greatly improved our model as shown above. A linear relationship between followers gained and followers can now clearly be seen. The transformation had a similar effect on the residual, fixing the fanning and clustering issue they had prior to the transformation.

The same process was followed for the remaining variables producing the following graphs.

Table of R^2 values for linear regression models
Model 1 Model 2 Model 3
0.4662537 0.2083597 0.3396691

Next, we looked at how well our models performed by evaluating their R^2 values. Model 1 performed the best, Model 2 performed the second best and then finally Model 3 performed the worst. None of the models had an R^2 value above .50 which tells us that none of the models fit 50% of the data. This is most likely due to the fact that the variables we used in our models did not have a very strong linear relationship. Residuals for Model 1 and 3 were not that bad but in Model 2 there was still some definite clustering to the left, with some slight fanning, which would most likely be the cause for its low R^2 score. In order to improve the models more, we could have further subsetted the data before performing our linear analysis. In our initial data exploration, we found the most popular of the non-numeric statistics, so we could have further subsetted our data by filter streamers whose language was not English, streamers that were not partnered, and streams that had thier channel set to mature audiences. This would have removed some potential outliers in the data and improved our model.

Finally, we finish off our linear model by graphing the prediction generated by our linear models. As you can see from the graph above the variable that has the most influence on followers gained is peak viewers.

Conclusion

To start the conclusion, I’d like to restate our question, which was “What gives a streamer the most followers gained in a 1 year period?” After going through and deciding which qualitative factor were most important, which turned out to be English speaking, partnered, and Not mature, we compared the correlation, or relationship, of all quantitative values. We did this by creating a correlation heat map, and picking a variable from which to draw a conclusion. The variable that we decided would most fit the question was the followers gained variable. After that, we decided on choosing the 3 highest correlated values with respect to followers gained. These 3 values were followers, watch time(minutes) and peak viewers. We then performed simple linear regression to decide which variable had the highest rate of change. After linear regression, we come to the final conclusion that the followers variable gives a streamer the most followers gained in a 1 year period. This means that if a streamer wants to gain the most amount of followers in a year, he or she should already have a followe base. I (Logan) want to continue to look at this updated data set for the next few years to come and see if this is repeatable, or if this happens on a year to year basis.

Bibliography

Twitch, www.twitch.tv/.

Twitch Partner Program Overview, help.twitch.tv/s/article/partner-program-overview?language=en_US#:~:text=The%20Twitch%20Partnership%20Program%20is,anything%20else%20you%20can%20imagine.

Kaggle, https://www.kaggle.com/

Top Streamers on Twitch, https://www.kaggle.com/aayushmishra1512/twitchdata