Multiple Regression

Including Plots

You can also embed plots, for example:

#1. Why are we concerned with multicollinearity?
# Multicollinearity happens when two or more predictors have a strong correlation with each other. This complicates the model's ability to disentangle individual impacts on outcomes, leading to unstable coefficient estimates, inflated standard errors, and unpredictable findings with minor data changes. In summary, even if the model appears to fit well, it can become unreliable.

#We detect it with the Variance Inflation Factor (VIF). VIF > 5 (or > 10) indicates an issue.  

#2. If we run an ANOVA(model1, model2) and the p-value is greater than .05 what does this mean?
#ANOVA(model1, model2) compares two nested models to determine if adding predictors to the bigger model improves fit. If the p-value is greater than.05, we cannot reject the null hypothesis, indicating that the additional predictors do not significantly increase the variance. We should favor the simple model.

#3.
#What combination of ad type, viewer age, and sex best predicts a viewer's rating for an advertisement?

#H0: None of the predictors (ad, age, or sex) have a significant effect on rating.
#H1: At least one predictor (or combination) strongly predicts Rating

library(readxl)
library(car)

## Loading required package: carData

library(MASS)   
library(corrplot)

## corrplot 0.95 loaded

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

data <- read_excel("/Users/sahilahlawat/Documents/University/Principles and Application 2/RegressionExample.xlsx")
data$Sex <- ifelse(data$Sex == "male", 1, 0)
data$Ad  <- as.numeric(as.character(data$Ad))
str(data)

## tibble [504 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Ad    : num [1:504] 0 0 1 1 2 2 0 0 1 1 ...
##  $ Age   : num [1:504] 55 56 56 54 36 36 75 42 61 41 ...
##  $ Sex   : num [1:504] 1 0 1 0 1 0 1 0 1 0 ...
##  $ Rating: num [1:504] 4.5 19.4 36.4 49.6 75.6 81.6 5.5 28.2 35.9 60.1 ...

cor_matrix <- cor(data)
print(round(cor_matrix, 3))

##            Ad    Age    Sex Rating
## Ad      1.000 -0.041  0.000  0.907
## Age    -0.041  1.000 -0.013 -0.176
## Sex     0.000 -0.013  1.000 -0.276
## Rating  0.907 -0.176 -0.276  1.000

corrplot(cor_matrix, method = "number", type = "upper",
         addCoef.col = "black", tl.col = "black", tl.srt = 45,
         title = "Correlation Matrix", mar = c(0,0,2,0),
         col = colorRampPalette(c("#D73027","white","#1A6FB5"))(200))

# P-values
rcorr_result <- rcorr(as.matrix(data))
print(round(rcorr_result$P, 4))

##            Ad    Age    Sex Rating
## Ad         NA 0.3561 1.0000  0e+00
## Age    0.3561     NA 0.7792  1e-04
## Sex    1.0000 0.7792     NA  0e+00
## Rating 0.0000 0.0001 0.0000     NA

par(mfrow = c(1, 3))
plot(density(data$Ad),     main = "Ad Type",    xlab = "Ad (0/1/2)", lwd = 2, col = "#E6550D")
plot(density(data$Age),    main = "Viewer Age",  xlab = "Age (yrs)", lwd = 2, col = "#3182BD")
plot(density(data$Rating), main = "Rating",      xlab = "Rating",    lwd = 2, col = "#31A354")

par(mfrow = c(1, 1))

# Check variance
cat("Variances:\n")

## Variances:

cat("Ad:", round(var(data$Ad), 2), "\n")

## Ad: 0.67

cat("Sex:", round(var(data$Sex), 2), "\n")

## Sex: 0.25

#linearity and homoscedasticity
par(mfrow = c(1, 3))
plot(data$Ad,  data$Rating, main = "Ad vs Rating",  xlab = "Ad",         ylab = "Rating", pch = 19, col = "#E6550D")
abline(lm(Rating ~ Ad,  data = data), col = "black", lwd = 2)

plot(data$Age, data$Rating, main = "Age vs Rating", xlab = "Age",        ylab = "Rating", pch = 19, col = "#3182BD")
abline(lm(Rating ~ Age, data = data), col = "black", lwd = 2)

plot(data$Sex, data$Rating, main = "Sex vs Rating", xlab = "Sex (1=M)",  ylab = "Rating", pch = 19, col = "#9E9AC8")
abline(lm(Rating ~ Sex, data = data), col = "black", lwd = 2)

full_model <- lm(Rating ~ Ad + Age + Sex, data = data)
vif(full_model)

##       Ad      Age      Sex 
## 1.001700 1.001857 1.000157

model1 <- lm(Rating ~ Ad,              data = data)   # simpler
model2 <- lm(Rating ~ Ad + Age + Sex,  data = data)   # more complex

anova(model1, model2)

## Analysis of Variance Table
## 
## Model 1: Rating ~ Ad
## Model 2: Rating ~ Ad + Age + Sex
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    502 63184                                  
## 2    500 28904  2     34280 296.49 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

step_model <- stepAIC(full_model, direction = "both", trace = FALSE)
summary(step_model)

## 
## Call:
## lm(formula = Rating ~ Ad + Age + Sex, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.932  -4.075   0.204   4.633  18.569 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  34.82141    1.17140   29.73   <2e-16 ***
## Ad           29.32176    0.41514   70.63   <2e-16 ***
## Age          -0.23426    0.02097  -11.17   <2e-16 ***
## Sex         -14.75157    0.67740  -21.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.603 on 500 degrees of freedom
## Multiple R-squared:  0.9188, Adjusted R-squared:  0.9183 
## F-statistic:  1885 on 3 and 500 DF,  p-value: < 2.2e-16

#compare all models
m1 <- lm(Rating ~ Ad,             data = data)
m2 <- lm(Rating ~ Ad + Age,       data = data)
m3 <- lm(Rating ~ Ad + Sex,       data = data)
m4 <- lm(Rating ~ Ad + Age + Sex, data = data)
AIC(m1, m2, m3, m4)

##    df      AIC
## m1  3 3871.232
## m2  4 3815.258
## m3  4 3591.348
## m4  5 3481.072

#Model 4 (Ad + Age + Sex) appears optimal, balancing fit and complexity.

# Final model
final_model <- m4

# Extract coefficients
coefs <- coef(final_model)

cat("Final Regression Equation:\n")

## Final Regression Equation:

cat("Final Regression Equation:\n")

## Final Regression Equation:

cat("Rating =", round(coefs[1], 2),
    "+", round(coefs[2], 2), "* Ad",
    "-", abs(round(coefs[3], 4)), "* Age",
    "-", abs(round(coefs[4], 2)), "* Sex\n\n")

## Rating = 34.82 + 29.32 * Ad - 0.2343 * Age - 14.75 * Sex

par(mfrow = c(1, 1))
plot(step_model)

shapiro_result <- shapiro.test(residuals(step_model))

cat("W =", round(shapiro_result$statistic, 4), "\n")

## W = 0.8876

cat("p-value =", round(shapiro_result$p.value, 4), "\n\n")

## p-value = 0

if(shapiro_result$p.value > 0.05) {
  cat("Conclusion: Residuals appear normally distributed (p > 0.05)\n")
} else {
  cat("Conclusion: Some evidence of non-normality (p < 0.05)\n")
}

## Conclusion: Some evidence of non-normality (p < 0.05)

#Summary: A multiple linear regression was used to see if ad type, viewer age, and viewer sex predicted advertisement ratings. Prior to modeling, a correlation study indicated that all three predictors were strongly associated with Rating (all p <.001). However, the predictors themselves were essentially uncorrelated, ruling out multicollinearity.
#A significant regression equation was found, F(3, 500) = 1885 with an R-square of.92, suggesting that the model explains 92% of the variance in evaluations.  Ad type was the most significant predictor, followed by sex and age (p <.001). Male viewers rated advertisements lower on average, while older viewers provided somewhat lower scores. Each of the predictions is significant. According to the results, each extra ad level raises a viewer's rating by 29.32 points, while each additional year of age lowers it by 0.23 points, and male viewers rate commercials 14.75 points lower than female viewers, assuming all other factors remain constant.

Multiple Regression

Sahil

2026-03-05

R Markdown

Including Plots