DATA 605 Week11 Discussion

Instructions

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

# Load required libraries
library(dplyr)
library(ggplot2)

Including Plots

You can also embed plots, for example:

# Load the data from the URL
data <- read.csv("https://raw.githubusercontent.com/BeshkiaKvarnstrom/MSDS-DATA-605/main/sug_users_vids4.csv")

# View the structure of the data
str(data)

## 'data.frame':    5201 obs. of  13 variables:
##  $ id           : num  6.89e+18 6.89e+18 6.89e+18 6.89e+18 6.89e+18 ...
##  $ create_time  : int  1604701219 1604625377 1604259328 1603982672 1603844741 1603664756 1603482959 1603409337 1603234334 1603212421 ...
##  $ user_name    : chr  "thetoiley" "thetoiley" "thetoiley" "thetoiley" ...
##  $ hashtags     : chr  "['2prettybestfriends']" "['balloons']" "['duet']" "['dinosaur', 'scary', 'chase']" ...
##  $ song         : chr  "original sound" "positions" "The mummy" "original sound" ...
##  $ video_length : int  14 21 37 8 4 14 11 12 10 15 ...
##  $ n_likes      : int  765 812 1955 13100 22500 21600 3412 16000 2767 26000 ...
##  $ n_shares     : int  2 4 8 16 17 189 9 15 13 84 ...
##  $ n_comments   : int  54 132 97 150 429 521 112 169 148 268 ...
##  $ n_plays      : int  4353 5797 12500 66600 106800 304100 27400 76000 23000 166100 ...
##  $ n_followers  : int  2000000 2000000 2000000 2000000 2000000 2000000 2000000 2000000 2000000 2000000 ...
##  $ n_total_likes: int  22400000 22400000 22400000 22400000 22400000 22400000 22400000 22400000 22400000 22400000 ...
##  $ n_total_vids : int  577 577 577 577 577 577 577 577 577 577 ...

The following code will load the data from Github and display its structure.

names(data) #print the names of the columns in the data frame

##  [1] "id"            "create_time"   "user_name"     "hashtags"     
##  [5] "song"          "video_length"  "n_likes"       "n_shares"     
##  [9] "n_comments"    "n_plays"       "n_followers"   "n_total_likes"
## [13] "n_total_vids"

Now I will build the regression model. I want to predict the n_total_vids variable based on the n_total_likes and n_followers variables. The data argument specifies the dataset to use. The summary() function provides an overview of the model’s statistics.

# Create a linear regression model
model <- lm(n_total_vids ~ n_total_likes + n_followers, data = data) 

# Display the summary of the model
summary(model)

## 
## Call:
## lm(formula = n_total_vids ~ n_total_likes + n_followers, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1560.55  -190.79   -87.90    84.45  2328.80 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.889e+02  7.380e+00   39.15   <2e-16 ***
## n_total_likes  8.099e-06  1.288e-07   62.88   <2e-16 ***
## n_followers   -8.043e-05  3.257e-06  -24.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 406 on 5198 degrees of freedom
## Multiple R-squared:  0.4744, Adjusted R-squared:  0.4742 
## F-statistic:  2346 on 2 and 5198 DF,  p-value: < 2.2e-16

To conduct the residual analysis, I will examine the residuals’ distribution and check for any patterns or deviations using the following code to create a residual plot:

# Create a residual plot
residuals <- residuals(model)
plot(residuals, pch = 16, ylab = "Residuals", main = "Residual Plot")
abline(h = 0, col = "purple")

model_summary <- summary(model)

# Retrieve R-squared value
r_squared <- model_summary$r.squared

# Retrieve p-values for independent variables
p_values <- model_summary$coefficients[, "Pr(>|t|)"]

print(r_squared)

## [1] 0.4743827

print(p_values)

##   (Intercept) n_total_likes   n_followers 
## 4.848131e-294  0.000000e+00 2.167512e-127

residuals <- residuals(model)

# Histogram of residuals
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals")

# Q-Q plot of residuals
qqnorm(residuals)
qqline(residuals)

The code above creates a scatter plot of the model’s residuals against the observation indices. The residuals() function extracts the residuals from the model object. The abline() function adds a purple horizontal line at zero to indicate the ideal position for residuals.

The residual plot displayed a random pattern with constant variance and the residuals approximate to a normal distribution. This suggests that the above linear model is appropriate for the data.