# load data
library(data.table)
library(tidyverse)
url <- "https://raw.githubusercontent.com/folushoa/Data-Science/Data-606/twitchdata.csv"
proposal_data <- fread(url, header = TRUE)
head(proposal_data, 10)
You should phrase your research question in a way that matches up with the scope of inference your data set allows for. Does the average viewer growth vary between streamers under the age of 18 and those who are 18 years or older?
What are the cases, and how many are there?
glimpse(proposal_data)
## Rows: 1,000
## Columns: 11
## $ Channel <chr> "xQcOW", "summit1g", "Gaules", "ESL_CSGO", "Tfu…
## $ `Watch time(Minutes)` <int64> 6196161750, 6091677300, 5644590915, 397031814…
## $ `Stream time(minutes)` <int> 215250, 211845, 515280, 517740, 123660, 82260, …
## $ `Peak viewers` <int> 222720, 310998, 387315, 300575, 285644, 263720,…
## $ `Average viewers` <int> 27716, 25610, 10976, 7714, 29602, 42414, 24181,…
## $ Followers <int> 3246298, 5310163, 1767635, 3944850, 8938903, 15…
## $ `Followers gained` <int> 1734810, 1370184, 1023779, 703986, 2068424, 554…
## $ `Views gained` <int> 93036735, 89705964, 102611607, 106546942, 78998…
## $ Partnered <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ Mature <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,…
## $ Language <chr> "English", "English", "Portuguese", "English", …
There are 1000 cases. The cases are the channels of the twitch streamers
Describe the method of data collection. The method of data collection is web scraping
What type of study is this (observational/experiment)? This is an observational study.
If you collected the data, state self-collected. If not, provide a citation/link. This is the link to the data https://www.kaggle.com/datasets/aayushmishra1512/twitchdata/
What is the response variable? Is it quantitative or
qualitative? The response variable is the
Views Gained. It is quantitative.
The independent variable is Mature. It is
qualitative
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
summary(proposal_data)
## Channel Watch time(Minutes) Stream time(minutes) Peak viewers
## Length:1000 Min. : 122192850 Min. : 3465 Min. : 496
## Class :character 1st Qu.: 163208925 1st Qu.: 73759 1st Qu.: 9114
## Mode :character Median : 234726075 Median :108240 Median : 16676
## Mean : 418427930 Mean :120515 Mean : 37065
## 3rd Qu.: 433329345 3rd Qu.:141844 3rd Qu.: 37570
## Max. :6196161750 Max. :521445 Max. :639375
## Average viewers Followers Followers gained Views gained
## Min. : 235 Min. : 3660 Min. : -15772 Min. : 175788
## 1st Qu.: 1458 1st Qu.: 170546 1st Qu.: 43758 1st Qu.: 3880602
## Median : 2425 Median : 318063 Median : 98352 Median : 6456324
## Mean : 4781 Mean : 570054 Mean : 205519 Mean : 11668166
## 3rd Qu.: 4786 3rd Qu.: 624332 3rd Qu.: 236131 3rd Qu.: 12196762
## Max. :147643 Max. :8938903 Max. :3966525 Max. :670137548
## Partnered Mature Language
## Mode :logical Mode :logical Length:1000
## FALSE:22 FALSE:770 Class :character
## TRUE :978 TRUE :230 Mode :character
##
##
##
proposal_data <- proposal_data %>%
mutate(Scaled_Views_Gained = `Views gained`/10e6)
proposal_data %>%
ggplot(aes(x = `Mature`)) +
geom_boxplot(aes(y = Scaled_Views_Gained), color = c('tomato', 'blue')) +
labs(title = "")
#Checking condition for inference
proposal_data %>%
group_by(Mature) %>%
summarise(count = n())
proposal_data %>%
drop_na(Scaled_Views_Gained, Mature) %>%
ggplot(aes(x = Scaled_Views_Gained, fill = Mature)) +
geom_histogram(position = 'identity', bins = 40) +
#xlim(0, 15) +
facet_wrap(~Mature)
proposal_data <- proposal_data %>%
mutate(Log_Views_Gained = log(`Views gained` + 1))
proposal_data %>%
drop_na(Log_Views_Gained, Mature) %>%
ggplot(aes(x = Log_Views_Gained, fill = Mature)) +
geom_histogram(position = 'identity', bins = 30) +
#xlim(0, 15) +
facet_wrap(~Mature)