library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(pwr)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
df <- read.csv("~/Documents/STAT 2024/udemy_courses.csv")
linear_model <- lm(num_subscribers ~ price, data = df)
df$is_paid_numeric <- ifelse(df$is_paid == TRUE, 1, 0)
df <- na.omit(df[, c("num_subscribers", "price", "num_reviews", "content_duration", "is_paid_numeric")])
linear_model_extended <- lm(num_subscribers ~ price + num_reviews + content_duration + price:content_duration, data = df)
model_summary <- summary(linear_model_extended)
print(model_summary)
##
## Call:
## lm(formula = num_subscribers ~ price + num_reviews + content_duration +
## price:content_duration, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61574 -2130 -1590 -217 208677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2127.4500 215.4661 9.874 <2e-16 ***
## price -2.4222 2.3977 -1.010 0.3125
## num_reviews 6.6220 0.1325 49.963 <2e-16 ***
## content_duration 93.1035 39.8908 2.334 0.0197 *
## price:content_duration -0.4918 0.2854 -1.723 0.0850 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7218 on 3673 degrees of freedom
## Multiple R-squared: 0.4239, Adjusted R-squared: 0.4232
## F-statistic: 675.6 on 4 and 3673 DF, p-value: < 2.2e-16
vif_data <- vif(linear_model_extended, type = "predictor")
## GVIFs computed for predictors
print(vif_data)
## GVIF Df GVIF^(1/(2*Df)) Interacts With
## price 1.084911 3 1.013676 content_duration
## num_reviews 1.084911 1 1.041590 --
## content_duration 1.084911 3 1.013676 price
## Other Predictors
## price num_reviews
## num_reviews price, content_duration
## content_duration num_reviews
cor_matrix <- cor(df[, c("price", "num_reviews", "content_duration")])
print(cor_matrix)
## price num_reviews content_duration
## price 1.0000000 0.1136959 0.2934496
## num_reviews 0.1136959 1.0000000 0.2288893
## content_duration 0.2934496 0.2288893 1.0000000
Price is a key factor for any course on a subscription-based platform. It could influence demand directly; a lower price might attract more subscribers, while a higher price could discourage them.
The number of reviews often serves as a proxy for course popularity or credibility, which can influence a user’s decision to subscribe
Course length or total content duration could be a selling point, as longer courses might appear more comprehensive or valuable to subscribers.
The interaction between price and content_duration explores whether the impact of price on subscriber count changes with the length of the course
is_paid_numeric was removed from this final model since it was either constant or unnecessary.
GVIF for price, num_reviews, and content_duration: All are close to 1, indicating almost no multicollinearity and that each predictor provides unique information in the model
GVIF for price, num_reviews, and content_duration: All are approximately 1.08, which is very low.
The low GVIF values suggest that each predictor operates independently without excessive overlap.
linear_model_interaction <- lm(num_subscribers ~ price + num_reviews + content_duration + price:content_duration, data = df)
par(mfrow = c(2, 2))
plot(linear_model_interaction)
Residuals vs Fitted- To check if linearity and equal variance of residuals. The residuals are scattered around the zero line, but there’s some deviation as fitted values increase, especially with a few points showing high residuals. The red line trends slightly away from zero as fitted values grow, indicating potential heteroscedasticity (unequal variance)
Q-Q Residuals To check if residuals are normally distributed. The residuals show mild to moderate deviation from normality, primarily in the tails.
Scale-Location To verify homoscedasticity (constant variance of residuals). The spread of the residuals appears to increase slightly as the fitted values increase, as indicated by the upward trend in the red line
Residuals vs Leverage To identify influential observations. Observations like 2828 have high leverage and large residuals, meaning they may have a significant influence on the model.