This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
#1. Why are we concerned with multicollinearity?
# Multicollinearity happens when two or more predictors have a strong correlation with each other. This complicates the model's ability to disentangle individual impacts on outcomes, leading to unstable coefficient estimates, inflated standard errors, and unpredictable findings with minor data changes. In summary, even if the model appears to fit well, it can become unreliable.
#We detect it with the Variance Inflation Factor (VIF). VIF > 5 (or > 10) indicates an issue.
#2. If we run an ANOVA(model1, model2) and the p-value is greater than .05 what does this mean?
#ANOVA(model1, model2) compares two nested models to determine if adding predictors to the bigger model improves fit. If the p-value is greater than.05, we cannot reject the null hypothesis, indicating that the additional predictors do not significantly increase the variance. We should favor the simple model.
#3.
#What combination of ad type, viewer age, and sex best predicts a viewer's rating for an advertisement?
#H0: None of the predictors (ad, age, or sex) have a significant effect on rating.
#H1: At least one predictor (or combination) strongly predicts Rating
library(readxl)
library(car)
## Loading required package: carData
library(MASS)
library(corrplot)
## corrplot 0.95 loaded
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
data <- read_excel("/Users/sahilahlawat/Documents/University/Principles and Application 2/RegressionExample.xlsx")
data$Sex <- ifelse(data$Sex == "male", 1, 0)
data$Ad <- as.numeric(as.character(data$Ad))
str(data)
## tibble [504 × 4] (S3: tbl_df/tbl/data.frame)
## $ Ad : num [1:504] 0 0 1 1 2 2 0 0 1 1 ...
## $ Age : num [1:504] 55 56 56 54 36 36 75 42 61 41 ...
## $ Sex : num [1:504] 1 0 1 0 1 0 1 0 1 0 ...
## $ Rating: num [1:504] 4.5 19.4 36.4 49.6 75.6 81.6 5.5 28.2 35.9 60.1 ...
cor_matrix <- cor(data)
print(round(cor_matrix, 3))
## Ad Age Sex Rating
## Ad 1.000 -0.041 0.000 0.907
## Age -0.041 1.000 -0.013 -0.176
## Sex 0.000 -0.013 1.000 -0.276
## Rating 0.907 -0.176 -0.276 1.000
corrplot(cor_matrix, method = "number", type = "upper",
addCoef.col = "black", tl.col = "black", tl.srt = 45,
title = "Correlation Matrix", mar = c(0,0,2,0),
col = colorRampPalette(c("#D73027","white","#1A6FB5"))(200))
# P-values
rcorr_result <- rcorr(as.matrix(data))
print(round(rcorr_result$P, 4))
## Ad Age Sex Rating
## Ad NA 0.3561 1.0000 0e+00
## Age 0.3561 NA 0.7792 1e-04
## Sex 1.0000 0.7792 NA 0e+00
## Rating 0.0000 0.0001 0.0000 NA
par(mfrow = c(1, 3))
plot(density(data$Ad), main = "Ad Type", xlab = "Ad (0/1/2)", lwd = 2, col = "#E6550D")
plot(density(data$Age), main = "Viewer Age", xlab = "Age (yrs)", lwd = 2, col = "#3182BD")
plot(density(data$Rating), main = "Rating", xlab = "Rating", lwd = 2, col = "#31A354")
par(mfrow = c(1, 1))
# Check variance
cat("Variances:\n")
## Variances:
cat("Ad:", round(var(data$Ad), 2), "\n")
## Ad: 0.67
cat("Sex:", round(var(data$Sex), 2), "\n")
## Sex: 0.25
#linearity and homoscedasticity
par(mfrow = c(1, 3))
plot(data$Ad, data$Rating, main = "Ad vs Rating", xlab = "Ad", ylab = "Rating", pch = 19, col = "#E6550D")
abline(lm(Rating ~ Ad, data = data), col = "black", lwd = 2)
plot(data$Age, data$Rating, main = "Age vs Rating", xlab = "Age", ylab = "Rating", pch = 19, col = "#3182BD")
abline(lm(Rating ~ Age, data = data), col = "black", lwd = 2)
plot(data$Sex, data$Rating, main = "Sex vs Rating", xlab = "Sex (1=M)", ylab = "Rating", pch = 19, col = "#9E9AC8")
abline(lm(Rating ~ Sex, data = data), col = "black", lwd = 2)
full_model <- lm(Rating ~ Ad + Age + Sex, data = data)
vif(full_model)
## Ad Age Sex
## 1.001700 1.001857 1.000157
model1 <- lm(Rating ~ Ad, data = data) # simpler
model2 <- lm(Rating ~ Ad + Age + Sex, data = data) # more complex
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: Rating ~ Ad
## Model 2: Rating ~ Ad + Age + Sex
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 502 63184
## 2 500 28904 2 34280 296.49 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
step_model <- stepAIC(full_model, direction = "both", trace = FALSE)
summary(step_model)
##
## Call:
## lm(formula = Rating ~ Ad + Age + Sex, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.932 -4.075 0.204 4.633 18.569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.82141 1.17140 29.73 <2e-16 ***
## Ad 29.32176 0.41514 70.63 <2e-16 ***
## Age -0.23426 0.02097 -11.17 <2e-16 ***
## Sex -14.75157 0.67740 -21.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.603 on 500 degrees of freedom
## Multiple R-squared: 0.9188, Adjusted R-squared: 0.9183
## F-statistic: 1885 on 3 and 500 DF, p-value: < 2.2e-16
#compare all models
m1 <- lm(Rating ~ Ad, data = data)
m2 <- lm(Rating ~ Ad + Age, data = data)
m3 <- lm(Rating ~ Ad + Sex, data = data)
m4 <- lm(Rating ~ Ad + Age + Sex, data = data)
AIC(m1, m2, m3, m4)
## df AIC
## m1 3 3871.232
## m2 4 3815.258
## m3 4 3591.348
## m4 5 3481.072
#Model 4 (Ad + Age + Sex) appears optimal, balancing fit and complexity.
# Final model
final_model <- m4
# Extract coefficients
coefs <- coef(final_model)
cat("Final Regression Equation:\n")
## Final Regression Equation:
cat("Final Regression Equation:\n")
## Final Regression Equation:
cat("Rating =", round(coefs[1], 2),
"+", round(coefs[2], 2), "* Ad",
"-", abs(round(coefs[3], 4)), "* Age",
"-", abs(round(coefs[4], 2)), "* Sex\n\n")
## Rating = 34.82 + 29.32 * Ad - 0.2343 * Age - 14.75 * Sex
par(mfrow = c(1, 1))
plot(step_model)
shapiro_result <- shapiro.test(residuals(step_model))
cat("W =", round(shapiro_result$statistic, 4), "\n")
## W = 0.8876
cat("p-value =", round(shapiro_result$p.value, 4), "\n\n")
## p-value = 0
if(shapiro_result$p.value > 0.05) {
cat("Conclusion: Residuals appear normally distributed (p > 0.05)\n")
} else {
cat("Conclusion: Some evidence of non-normality (p < 0.05)\n")
}
## Conclusion: Some evidence of non-normality (p < 0.05)
#Summary: A multiple linear regression was used to see if ad type, viewer age, and viewer sex predicted advertisement ratings. Prior to modeling, a correlation study indicated that all three predictors were strongly associated with Rating (all p <.001). However, the predictors themselves were essentially uncorrelated, ruling out multicollinearity.
#A significant regression equation was found, F(3, 500) = 1885 with an R-square of.92, suggesting that the model explains 92% of the variance in evaluations. Ad type was the most significant predictor, followed by sex and age (p <.001). Male viewers rated advertisements lower on average, while older viewers provided somewhat lower scores. Each of the predictions is significant. According to the results, each extra ad level raises a viewer's rating by 29.32 points, while each additional year of age lowers it by 0.23 points, and male viewers rate commercials 14.75 points lower than female viewers, assuming all other factors remain constant.