DATA 605 Week11 Discussion
Instructions
Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
# Load required libraries
library(dplyr)
library(ggplot2)Including Plots
You can also embed plots, for example:
# Load the data from the URL
data <- read.csv("https://raw.githubusercontent.com/BeshkiaKvarnstrom/MSDS-DATA-605/main/sug_users_vids4.csv")
# View the structure of the data
str(data)## 'data.frame': 5201 obs. of 13 variables:
## $ id : num 6.89e+18 6.89e+18 6.89e+18 6.89e+18 6.89e+18 ...
## $ create_time : int 1604701219 1604625377 1604259328 1603982672 1603844741 1603664756 1603482959 1603409337 1603234334 1603212421 ...
## $ user_name : chr "thetoiley" "thetoiley" "thetoiley" "thetoiley" ...
## $ hashtags : chr "['2prettybestfriends']" "['balloons']" "['duet']" "['dinosaur', 'scary', 'chase']" ...
## $ song : chr "original sound" "positions" "The mummy" "original sound" ...
## $ video_length : int 14 21 37 8 4 14 11 12 10 15 ...
## $ n_likes : int 765 812 1955 13100 22500 21600 3412 16000 2767 26000 ...
## $ n_shares : int 2 4 8 16 17 189 9 15 13 84 ...
## $ n_comments : int 54 132 97 150 429 521 112 169 148 268 ...
## $ n_plays : int 4353 5797 12500 66600 106800 304100 27400 76000 23000 166100 ...
## $ n_followers : int 2000000 2000000 2000000 2000000 2000000 2000000 2000000 2000000 2000000 2000000 ...
## $ n_total_likes: int 22400000 22400000 22400000 22400000 22400000 22400000 22400000 22400000 22400000 22400000 ...
## $ n_total_vids : int 577 577 577 577 577 577 577 577 577 577 ...
The following code will load the data from Github and display its structure.
names(data) #print the names of the columns in the data frame## [1] "id" "create_time" "user_name" "hashtags"
## [5] "song" "video_length" "n_likes" "n_shares"
## [9] "n_comments" "n_plays" "n_followers" "n_total_likes"
## [13] "n_total_vids"
Now I will build the regression model. I want to predict the
n_total_vids variable based on the
n_total_likes and
n_followers variables. The
data argument specifies the dataset to
use. The summary() function provides an
overview of the model’s statistics.
# Create a linear regression model
model <- lm(n_total_vids ~ n_total_likes + n_followers, data = data)
# Display the summary of the model
summary(model)##
## Call:
## lm(formula = n_total_vids ~ n_total_likes + n_followers, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1560.55 -190.79 -87.90 84.45 2328.80
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.889e+02 7.380e+00 39.15 <2e-16 ***
## n_total_likes 8.099e-06 1.288e-07 62.88 <2e-16 ***
## n_followers -8.043e-05 3.257e-06 -24.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 406 on 5198 degrees of freedom
## Multiple R-squared: 0.4744, Adjusted R-squared: 0.4742
## F-statistic: 2346 on 2 and 5198 DF, p-value: < 2.2e-16
To conduct the residual analysis, I will examine the residuals’ distribution and check for any patterns or deviations using the following code to create a residual plot:
# Create a residual plot
residuals <- residuals(model)
plot(residuals, pch = 16, ylab = "Residuals", main = "Residual Plot")
abline(h = 0, col = "purple")model_summary <- summary(model)
# Retrieve R-squared value
r_squared <- model_summary$r.squared
# Retrieve p-values for independent variables
p_values <- model_summary$coefficients[, "Pr(>|t|)"]
print(r_squared)## [1] 0.4743827
print(p_values)## (Intercept) n_total_likes n_followers
## 4.848131e-294 0.000000e+00 2.167512e-127
residuals <- residuals(model)
# Histogram of residuals
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals")# Q-Q plot of residuals
qqnorm(residuals)
qqline(residuals)The code above creates a scatter plot of the model’s residuals against the observation indices. The residuals() function extracts the residuals from the model object. The abline() function adds a purple horizontal line at zero to indicate the ideal position for residuals.
The residual plot displayed a random pattern with constant variance and the residuals approximate to a normal distribution. This suggests that the above linear model is appropriate for the data.