library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(pwr)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
df <- read.csv("~/Documents/STAT 2024/udemy_courses.csv")

Previous Linear regression model

linear_model <- lm(num_subscribers ~ price, data = df)

Adding is_paid, num_reviews, Content_duration, Price an interaction term to linear regression model

df$is_paid_numeric <- ifelse(df$is_paid == TRUE, 1, 0)
df <- na.omit(df[, c("num_subscribers", "price", "num_reviews", "content_duration", "is_paid_numeric")])

linear_model_extended <- lm(num_subscribers ~ price + num_reviews + content_duration + price:content_duration, data = df)

model_summary <- summary(linear_model_extended)
print(model_summary)
## 
## Call:
## lm(formula = num_subscribers ~ price + num_reviews + content_duration + 
##     price:content_duration, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -61574  -2130  -1590   -217 208677 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2127.4500   215.4661   9.874   <2e-16 ***
## price                    -2.4222     2.3977  -1.010   0.3125    
## num_reviews               6.6220     0.1325  49.963   <2e-16 ***
## content_duration         93.1035    39.8908   2.334   0.0197 *  
## price:content_duration   -0.4918     0.2854  -1.723   0.0850 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7218 on 3673 degrees of freedom
## Multiple R-squared:  0.4239, Adjusted R-squared:  0.4232 
## F-statistic: 675.6 on 4 and 3673 DF,  p-value: < 2.2e-16
vif_data <- vif(linear_model_extended, type = "predictor")
## GVIFs computed for predictors
print(vif_data)
##                      GVIF Df GVIF^(1/(2*Df))   Interacts With
## price            1.084911  3        1.013676 content_duration
## num_reviews      1.084911  1        1.041590             --  
## content_duration 1.084911  3        1.013676            price
##                         Other Predictors
## price                        num_reviews
## num_reviews      price, content_duration
## content_duration             num_reviews
cor_matrix <- cor(df[, c("price", "num_reviews", "content_duration")])
print(cor_matrix)
##                      price num_reviews content_duration
## price            1.0000000   0.1136959        0.2934496
## num_reviews      0.1136959   1.0000000        0.2288893
## content_duration 0.2934496   0.2288893        1.0000000

Price is a key factor for any course on a subscription-based platform. It could influence demand directly; a lower price might attract more subscribers, while a higher price could discourage them.

The number of reviews often serves as a proxy for course popularity or credibility, which can influence a user’s decision to subscribe

Course length or total content duration could be a selling point, as longer courses might appear more comprehensive or valuable to subscribers.

The interaction between price and content_duration explores whether the impact of price on subscriber count changes with the length of the course

is_paid_numeric was removed from this final model since it was either constant or unnecessary.

GVIF for price, num_reviews, and content_duration: All are close to 1, indicating almost no multicollinearity and that each predictor provides unique information in the model

GVIF for price, num_reviews, and content_duration: All are approximately 1.08, which is very low.

The low GVIF values suggest that each predictor operates independently without excessive overlap.

Plots

linear_model_interaction <- lm(num_subscribers ~ price + num_reviews + content_duration + price:content_duration, data = df)

par(mfrow = c(2, 2))  
plot(linear_model_interaction)

Residuals vs Fitted- To check if linearity and equal variance of residuals. The residuals are scattered around the zero line, but there’s some deviation as fitted values increase, especially with a few points showing high residuals. The red line trends slightly away from zero as fitted values grow, indicating potential heteroscedasticity (unequal variance)

Q-Q Residuals To check if residuals are normally distributed. The residuals show mild to moderate deviation from normality, primarily in the tails.

Scale-Location To verify homoscedasticity (constant variance of residuals). The spread of the residuals appears to increase slightly as the fitted values increase, as indicated by the upward trend in the red line

Residuals vs Leverage To identify influential observations. Observations like 2828 have high leverage and large residuals, meaning they may have a significant influence on the model.