DATA 605 Week 12 Discussion
INSTRUCTIONS
Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
LOAD PACKAGES
pkges <- c("tidyverse", "readr", "dplyr")
# Loop through the packages
for (p in pkges) {
# Check if package is installed
if (!requireNamespace(p, quietly = TRUE)) {
install.packages(p) #If the package is not installed, install the package
library(p, character.only = TRUE) #Load the package
} else {
library(p, character.only = TRUE) #If the package is already installed, load the package
}
}Import Dataset from Github and load in R
#The read_csv from the readr package was used to read the dataset from Github.
sug_users_vids <- read.csv("https://raw.githubusercontent.com/BeshkiaKvarnstrom/MSDS-DATA-605/main/sug_users_vids_all.csv", check.names = FALSE)
head(sug_users_vids)## id create_time user_name
## 1 6.892428e+18 1604768557 john.cena10
## 2 6.891790e+18 1604619960 john.cena10
## 3 6.891265e+18 1604497592 john.cena10
## 4 6.891050e+18 1604447622 john.cena10
## 5 6.890886e+18 1604409445 john.cena10
## 6 6.890500e+18 1604319627 john.cena10
## hashtags
## 1 ['johncena', 'love', 'tiktok', 'fyp', 'foryoupage', 'vibes']
## 2 ['johncena', 'love', 'tiktok', 'halloween', 'queen', 'roblox', 'bts', 'comedy']
## 3 ['johncena', 'love', 'fyp', 'foryoupage']
## 4 ['johncena', 'fyp', 'foryoupage', 'viral', 'comedy', 'charlidamelio', 'i']
## 5 ['johncena', 'foryoupage', 'fyp', 'viral']
## 6 ['johncena', 'foryoupage', 'fyp', 'viral', 'comedy', 'vibes']
## song video_length n_likes n_shares n_comments n_plays
## 1 الصوت الأصلي 8 1984 3 18 12800
## 2 الصوت الأصلي 6 7372 9 51 52800
## 3 The Time Is Now (John Cena) 5 4623 11 27 37700
## 4 الصوت الأصلي 6 7931 6 24 51200
## 5 الصوت الأصلي 15 3229 9 14 24700
## 6 الصوت الأصلي 23 8021 24 54 49600
## n_followers n_total_likes n_total_vids
## 1 1000000 4700000 211
## 2 1000000 4700000 211
## 3 1000000 4700000 211
## 4 1000000 4700000 211
## 5 1000000 4700000 211
## 6 1000000 4700000 211
Build the multiple regression model
The following code builds the multiple regression model using the lm() function in R. The formula for the model will include the dependent variable (n_plays) and the independent variables (n_followers, n_likes,n_comments, n_shares). We will use the I() function to include the quadratic terms for followers and likes. We will also include an interaction term between users and likes.
sug_model <- lm(n_plays ~ n_followers + n_likes + n_shares + n_comments + I(n_followers^2) + I(n_likes^2) + n_followers*n_likes, data = sug_users_vids)Interpret the coefficients
summary(sug_model)##
## Call:
## lm(formula = n_plays ~ n_followers + n_likes + n_shares + n_comments +
## I(n_followers^2) + I(n_likes^2) + n_followers * n_likes,
## data = sug_users_vids)
##
## Residuals:
## Min 1Q Median 3Q Max
## -128755658 -141213 -50531 55924 288915553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.990e+04 2.378e+04 2.519 0.0118 *
## n_followers -2.921e-02 6.017e-03 -4.854 1.21e-06 ***
## n_likes 6.080e+00 5.457e-02 111.420 < 2e-16 ***
## n_shares -1.036e+01 1.752e+00 -5.914 3.37e-09 ***
## n_comments 1.188e+02 1.353e+00 87.838 < 2e-16 ***
## I(n_followers^2) 2.038e-09 2.031e-10 10.032 < 2e-16 ***
## I(n_likes^2) 1.610e-07 5.218e-09 30.861 < 2e-16 ***
## n_followers:n_likes -5.898e-08 2.311e-09 -25.525 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3513000 on 41694 degrees of freedom
## Multiple R-squared: 0.7022, Adjusted R-squared: 0.7022
## F-statistic: 1.405e+04 on 7 and 41694 DF, p-value: < 2.2e-16
Create the plot
To conduct residual analysis, we can use the plot() function to create a residual plot, which plots the residuals (the differences between the actual values of views and the predicted values from the model) against the predicted values.
par(mfrow = c(2,2))
plot(sug_model,col="purple")Determine whether the linear model is appropriate
Now let’s interpret the coefficients of the model:
The intercept represents the average value of views when all other independent variables are zero. In this case, it represents the predicted value of views when users, likes, dislikes, and comments are all zero. However, this interpretation is not practical or meaningful in this context.
The coefficient for users represents the change in views for a one unit increase in users, holding all other variables constant.
The coefficient for likes represents the change in views for a one unit increase in likes, holding all other variables constant.
The coefficient for comments represents the change in views for a one unit increase in comments, holding all other variables constant.
The coefficient for followers_sq represents the change in the rate of change of views as followers increases. In other words, it represents the curvature of the relationship between followers and views. If the coefficient is positive, it means that the rate of change of views increases as followers increases (the curve is concave up). If the coefficient is negative, it means that the rate of change of views decreases as followers increases (the curve is concave down).
The coefficient for likes_sq represents the change in the rate of change of views as likes increases. In other words, it represents the curvature of the relationship between likes and views. If the coefficient is positive, it means that the rate of change of views increases as likes increases (the curve is concave up). If the coefficient is negative, it means that the rate of change of views decreases as likes increases (the curve is concave down).
The coefficient for followers*likes represents the interaction effect between followers and likes. It represents the change in the effect of followers on views for a one unit increase in likes. If the coefficient is positive, it means that the effect of followers on views is stronger for videos with more likes. If the coefficient is negative, it means that the effect of followers on views is weaker for videos with more likes.