This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.
library(readr)
library(car)
## Loading required package: carData
library(ggplot2)
udemy_data <- read_csv("/Users/umakasi/Documents/STAT 2024/udemy_courses.csv")
## Rows: 3678 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): course_title, url, level, subject
## dbl (6): course_id, price, num_subscribers, num_reviews, num_lectures, cont...
## lgl (1): is_paid
## dttm (1): published_timestamp
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
model <- lm(num_subscribers ~ price + num_reviews + num_lectures + content_duration, data = udemy_data)
summary(model)
##
## Call:
## lm(formula = num_subscribers ~ price + num_reviews + num_lectures +
## content_duration, data = udemy_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62243 -2177 -1577 -184 209158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2363.9529 184.1364 12.838 <2e-16 ***
## price -4.2269 2.0715 -2.041 0.0414 *
## num_reviews 6.5993 0.1315 50.180 <2e-16 ***
## num_lectures -4.2707 4.0271 -1.060 0.2890
## content_duration 61.6613 32.9959 1.869 0.0617 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7220 on 3673 degrees of freedom
## Multiple R-squared: 0.4236, Adjusted R-squared: 0.4229
## F-statistic: 674.8 on 4 and 3673 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(model)
vif_values <- vif(model)
print(vif_values)
## price num_reviews num_lectures content_duration
## 1.126544 1.067660 2.904128 2.814724
outlier_test <- outlierTest(model)
print(outlier_test)
## rstudent unadjusted p-value Bonferroni p
## 2828 33.523023 3.4333e-215 1.2628e-211
## 3033 23.255478 1.0242e-111 3.7671e-108
## 1897 13.051399 4.3373e-38 1.5953e-34
## 2784 11.160087 1.8258e-28 6.7154e-25
## 3231 -9.972215 3.9598e-23 1.4564e-19
## 2620 9.095868 1.5017e-19 5.5232e-16
## 3205 -8.571546 1.4819e-17 5.4506e-14
## 3666 8.308587 1.3467e-16 4.9532e-13
## 493 7.917220 3.1909e-15 1.1736e-11
## 2590 7.510612 7.3481e-14 2.7026e-10
Residual Standard Error: Shows the average deviation of observed values from predictions.
R-squared (0.4236): Model explains about 42% of the variance in subscriber numbers.
Variance Inflation Factors (VIFs): Values around 1-3 indicate low multicollinearity, which is not a concern here Number of reviews has the largest, most significant positive impact on subscribers.
Residuals vs Fitted: Shows some deviation from randomness, with potential non-linearity and a few outliers (e.g., points 3033, 2828, and 1897), suggesting that the model may not fully capture the data’s pattern.
Q-Q Plot: The residuals deviate from the line at the extremes, indicating potential non-normality, especially due to large outliers (e.g., 2828 and 3033).
Scale-Location: Shows an upward trend in the spread of residuals, suggesting heteroscedasticity (non-constant variance), which may impact model reliability.
Residuals vs Leverage: Highlights influential points (like 2828), which have high leverage and can disproportionately impact the model. This suggests that these points should be reviewed as they may distort results.
coefficients <- coef(model)
num_reviews_coeff <- coefficients["num_reviews"]
cat("The coefficient for num_reviews is:", num_reviews_coeff, "\n")
## The coefficient for num_reviews is: 6.599299
cat("Interpretation: For each additional review, the model predicts an increase of", round(num_reviews_coeff, 2),
"subscribers, holding all other factors constant.")
## Interpretation: For each additional review, the model predicts an increase of 6.6 subscribers, holding all other factors constant.
Coefficient represents the predicted increase in subscribers for each additional review, assuming other variables remain the same.
By summerizing we can say that if num_reviews has a coefficient of 6.6, it means that each additional review is expected to bring approximately 6.6 more subscribers, all else being equal. This interpretation helps us understand how reviews impact the number of subscribers in the context of the model.