Data Preparation

# load data
library(data.table)
library(tidyverse)

url <- "https://raw.githubusercontent.com/folushoa/Data-Science/Data-606/twitchdata.csv"
proposal_data <- fread(url, header = TRUE)

head(proposal_data, 10)

Research question

You should phrase your research question in a way that matches up with the scope of inference your data set allows for. Does the average viewer growth vary between streamers under the age of 18 and those who are 18 years or older?

Cases

What are the cases, and how many are there?

glimpse(proposal_data)
## Rows: 1,000
## Columns: 11
## $ Channel                <chr> "xQcOW", "summit1g", "Gaules", "ESL_CSGO", "Tfu…
## $ `Watch time(Minutes)`  <int64> 6196161750, 6091677300, 5644590915, 397031814…
## $ `Stream time(minutes)` <int> 215250, 211845, 515280, 517740, 123660, 82260, …
## $ `Peak viewers`         <int> 222720, 310998, 387315, 300575, 285644, 263720,…
## $ `Average viewers`      <int> 27716, 25610, 10976, 7714, 29602, 42414, 24181,…
## $ Followers              <int> 3246298, 5310163, 1767635, 3944850, 8938903, 15…
## $ `Followers gained`     <int> 1734810, 1370184, 1023779, 703986, 2068424, 554…
## $ `Views gained`         <int> 93036735, 89705964, 102611607, 106546942, 78998…
## $ Partnered              <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ Mature                 <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,…
## $ Language               <chr> "English", "English", "Portuguese", "English", …

There are 1000 cases. The cases are the channels of the twitch streamers

Data collection

Describe the method of data collection. The method of data collection is web scraping

Type of study

What type of study is this (observational/experiment)? This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link. This is the link to the data https://www.kaggle.com/datasets/aayushmishra1512/twitchdata/

Dependent Variable

What is the response variable? Is it quantitative or qualitative? The response variable is the Views Gained. It is quantitative.

Independent Variable(s)

The independent variable is Mature. It is qualitative

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

summary(proposal_data)
##    Channel          Watch time(Minutes)  Stream time(minutes)  Peak viewers   
##  Length:1000        Min.   : 122192850   Min.   :  3465       Min.   :   496  
##  Class :character   1st Qu.: 163208925   1st Qu.: 73759       1st Qu.:  9114  
##  Mode  :character   Median : 234726075   Median :108240       Median : 16676  
##                     Mean   : 418427930   Mean   :120515       Mean   : 37065  
##                     3rd Qu.: 433329345   3rd Qu.:141844       3rd Qu.: 37570  
##                     Max.   :6196161750   Max.   :521445       Max.   :639375  
##  Average viewers    Followers       Followers gained   Views gained      
##  Min.   :   235   Min.   :   3660   Min.   : -15772   Min.   :   175788  
##  1st Qu.:  1458   1st Qu.: 170546   1st Qu.:  43758   1st Qu.:  3880602  
##  Median :  2425   Median : 318063   Median :  98352   Median :  6456324  
##  Mean   :  4781   Mean   : 570054   Mean   : 205519   Mean   : 11668166  
##  3rd Qu.:  4786   3rd Qu.: 624332   3rd Qu.: 236131   3rd Qu.: 12196762  
##  Max.   :147643   Max.   :8938903   Max.   :3966525   Max.   :670137548  
##  Partnered         Mature          Language        
##  Mode :logical   Mode :logical   Length:1000       
##  FALSE:22        FALSE:770       Class :character  
##  TRUE :978       TRUE :230       Mode  :character  
##                                                    
##                                                    
## 
proposal_data <- proposal_data %>% 
  mutate(Scaled_Views_Gained = `Views gained`/10e6)

proposal_data %>% 
  ggplot(aes(x = `Mature`)) +
  geom_boxplot(aes(y = Scaled_Views_Gained), color = c('tomato', 'blue')) +
  labs(title = "")

#Checking condition for inference

proposal_data %>% 
  group_by(Mature) %>% 
  summarise(count = n())
proposal_data %>% 
  drop_na(Scaled_Views_Gained, Mature) %>% 
  ggplot(aes(x = Scaled_Views_Gained, fill = Mature)) +
  geom_histogram(position = 'identity', bins = 40) +
  #xlim(0, 15) +
  facet_wrap(~Mature)

  1. The groups are independent because their n < 10% of all the streamers on twitch
  2. Need to check on this but if we ignore the one extreme point over 600,000 there are no extreme skews in the histogram.
proposal_data <- proposal_data %>% 
  mutate(Log_Views_Gained = log(`Views gained` + 1))

proposal_data %>% 
  drop_na(Log_Views_Gained, Mature) %>% 
  ggplot(aes(x = Log_Views_Gained, fill = Mature)) +
  geom_histogram(position = 'identity', bins = 30) +
  #xlim(0, 15) +
  facet_wrap(~Mature)