Rows: 1000 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Channel, Language
dbl (7): Watch time(Minutes), Stream time(minutes), Peak viewers, Average vi...
lgl (2): Partnered, Mature
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
followers average_viewers
Min. : 3660 Min. : 235
1st Qu.: 170546 1st Qu.: 1458
Median : 318063 Median : 2425
Mean : 570054 Mean : 4781
3rd Qu.: 624332 3rd Qu.: 4786
Max. :8938903 Max. :147643
Describe the results in a few words. Does anything capture your attention?
I noticed that the maximum and minimum values of followers and views are very far from their Q1 and Q3, so they are probably outliers.
twitch_data |>ggplot(aes(x = followers, y = average_viewers)) +geom_point(alpha =0.5) +labs(x ="Followers",y ="Average Viewers",title ="Followers vs. Average Viewers for Twitch Streamers")
twitch_data |>ggplot(aes(x = followers, y = average_viewers)) +geom_point(alpha =0.5) +scale_x_log10() +scale_y_log10() +labs(x ="Followers (log scale)",y ="Average Viewers (log scale)",title ="Followers vs. Average Viewers for Twitch Streamers")
What I noticed:
The graph is “zoomed in” a lot, which makes the relationship (now it looks like there is certain postive correlations between the variables) clearer and visually more comfortable.
pred_data |>ggplot(aes(x = log_followers, y = log_viewers)) +geom_jitter(alpha =0.4) +geom_line(aes(x = log_followers, y = .fitted), col ="orange") +theme_minimal() +labs(subtitle ="Fitted Model and Raw Data", title ="Followers & Average Viewership", x ="log(followers)", y ="log(viewers)")
The model is indeed pretty good, as it captures the basic trend of the scatterplot and passes through were the dots are most clustered.
pred_data |>ggplot(aes(x = log_followers, y = .resid)) +geom_point(alpha =0.4) +theme_minimal() +labs(title ="Residuals vs. log(followers)",x ="log(followers)",y ="Residuals")
Most of the points are dense around residual = 0, which is good. But I do see some big residuals above 1. The largest residuals appear when x is between 5 and 6. Also, in the range of x from 6 to 7, the residuals are generally larger than before, even though they don’t go over 1.
# A tibble: 10 × 2
language average_viewers
<chr> <dbl>
1 Other 1149
2 Portuguese 1736
3 English 2675
4 Russian 3969
5 Russian 19753
6 English 2401
7 Arabic 2726
8 Portuguese 2655
9 English 1713
10 Swedish 1027
Summaries of the variable
twitch_data |>count(language, sort =TRUE)
# A tibble: 21 × 2
language n
<chr> <int>
1 English 485
2 Korean 77
3 Russian 74
4 Spanish 68
5 French 66
6 Portuguese 61
7 German 49
8 Chinese 30
9 Turkish 22
10 Italian 17
# ℹ 11 more rows
According to the model, the estimated standard deviations for Russian and Arabic are higher than for English. While English is indeed very popular, it may not be the most dominant, as also suggested by the average viewer bar graph in Question 4.
Question 6
library(broom)pred_data <-augment(fit1)ggplot(pred_data, aes(x = .fitted, y = .std.resid)) +geom_point(alpha =0.4) +theme_minimal() +labs(title ="Residuals vs Fitted Values",x ="Fitted values (Predicted viewers)",y ="Standardized Residuals") +geom_hline(yintercept =0, linetype ="dashed", color ="red")
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).
In the residual plot, there are some outliers around channels with about 5,000–6,000 predicted viewers. Their standardized residuals are greater than 10, meaning that the model fails to predict these values accurately.