library(ggplot2)
url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)
str(alumni)
## 'data.frame':    48 obs. of  5 variables:
##  $ school                     : chr  "Boston College" "Brandeis University " "Brown University" "California Institute of Technology" ...
##  $ percent_of_classes_under_20: int  39 68 60 65 67 52 45 69 72 61 ...
##  $ student_faculty_ratio      : int  13 8 8 3 10 8 12 7 13 10 ...
##  $ alumni_giving_rate         : int  25 33 40 46 28 31 27 31 35 53 ...
##  $ private                    : int  1 1 1 1 1 1 1 1 1 1 ...

Question 1. (10 points) Alumni Donation Data (Multiple Linear Regression). Continuing with the alumni donation data, fit a multiple linear regression model to the data, where the alumni giving rate (alumni_giving_rate) is the response variable (Y), and the percentage of classes with fewer than 20 students (percent_of_classes_under_20) and student/faculty ratio (student_faculty_ratio) as the predictors

  1. Does the assumption of error normality appear to be violated?
res <- rstandard(multiple_model)

qqnorm(res, main = "Normal Q-Q Plot of Standardized Residuals")
qqline(res, col = 2)

The assumption of error normailty is not violated since the points shown in the visualization Q-Q plot are along the straight line. If there was a strong S-shaped pattern, it is an indication of skewness or heavier tails proving there is violation of the error normailty assumption.

  1. Does the assumption of constant error variance appear to be violated? There is a fair amount of randomness occuring with no clear pattern. Therefore there is no strong evidence for the constant variance violated. But if there were residuals increasing or decreasing more than the fitted values that will be a indication of heteroscedasticity. This is a violation for the constant variance assumption.

  2. Does there appear to be any Y or X outliers or influential points?

res_std <- rstandard(multiple_model)
which(abs(res_std) > 2)
## 10 21 37 
## 10 21 37
which(abs(res_std) >3)
## named integer(0)
lev <- hatvalues(multiple_model)
which(lev > 2 * mean(lev))
##  9 33 34 
##  9 33 34
cook <- cooks.distance(multiple_model)
which(cook > 4 / length(cook))
## 21 33 43 
## 21 33 43
plot(res_std, ylab = "Student Residuals", main = "Student residuals")
abline(h = c(-2, 2), lty = 2)

plot(lev, ylab = "Leverage", main = "Leverage values")
p <- length(coef(multiple_model))
n <- nrow(alumni)
abline(h = 2 * p/n, lty = 2)

plot(cook, ylab = "Cook's distance", main = "Cook's distance")
abline(h = 4 / n, lty =2)

The points were there is enough weight to have a noticeable change in the fitted regression line. These points occur above 4/n.

  1. is there any concern about multicollinearity?

-There is no concern about multicollinearity in this model. The predictors are not high and the VIF values are small so no concern for the predictors to have a strong relation.

The two predictos percent of classes under 20 and student faculty ratio show no concern about multicollinearity. The VIF values are within 2.6, a normal range between 1 and 5. This shows the model’s coefficient estiamtes are stable.

Question 2. (10 points) Simulation Study (Simple Linear Regression). Assume mean function E(Y|X)=10+5X-2X^2

  1. Generate data with X_1∼N(μ=3,σ=0.5), sample size n=100, and error term ϵ∼N(μ=0,σ=0.5).
set.seed(763)

n  <- 100
mean_u <- 3
std_x <- 0.5
std_e <- 0.5

x1 <- rnorm(n, mean = mean_u, sd = std_e)
eps <- rnorm(n, mean = 0, sd = std_e)

y <- 10 + 5 * x1 - 2 * x1^2 + eps

data <- data.frame(x1, y)
head(data)
##         x1           y
## 1 3.760477 -0.02238723
## 2 2.969125  7.17280974
## 3 3.527328  1.92402641
## 4 2.591405  9.26482119
## 5 3.519658  2.61638487
## 6 3.371328  4.75517652
  1. Fit a simple linear regression using just X. What is the estimated regression equation? Please conduct model estimation, inference, and residual diagnostics. What do you conclude?
simple_fit_regress <- lm(y ~ x1, data = data)
summary(simple_fit_regress)
## 
## Call:
## lm(formula = y ~ x1, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9736 -0.3826  0.2150  0.5902  1.4070 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.5297     0.5126   51.75   <2e-16 ***
## x1           -6.6608     0.1725  -38.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8487 on 98 degrees of freedom
## Multiple R-squared:  0.9383, Adjusted R-squared:  0.9377 
## F-statistic:  1491 on 1 and 98 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2), mar = c(4, 4, 2, 1))
plot(simple_fit_regress)

  1. Update the model from part b) by adding a quadratic term. Conduct model estimation, inference, and residual diagnostics. What do you conclude? Does this model seem to fit the data better? Please explain.
simple_fit_regress_2 <- lm(y ~ x1 + I(x1^2), data = data)
summary(simple_fit_regress_2)
## 
## Call:
## lm(formula = y ~ x1 + I(x1^2), data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92744 -0.32269 -0.02659  0.40100  0.90015 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.1659     1.2657   6.452 4.35e-09 ***
## x1            6.2403     0.8717   7.158 1.57e-10 ***
## I(x1^2)      -2.2020     0.1479 -14.889  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4706 on 97 degrees of freedom
## Multiple R-squared:  0.9812, Adjusted R-squared:  0.9808 
## F-statistic:  2535 on 2 and 97 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2), mar = c(4, 4, 2, 1))
plot(simple_fit_regress_2)

  1. What is the variance inflation factor (VIF) for the quadratic model? Any concern of multicollinearity?
library(car)
## Loading required package: carData
car::vif(simple_fit_regress_2)
##       x1  I(x1^2) 
## 83.03021 83.03021

Yes very large VIF value. We know that above a 10 will show case multicollinearity. This high value up to 83 is a severe case of multicollinearity.

  1. Now center the X variable and compare the VIF from d). What did you find? Which VIF is smaller? Please briefly explain the reason. Hint: you can center the X variable in R using the following pseudocode:
fit <- lm(y ~ x1 + I(x1^2), data = data)
vif(fit)
##       x1  I(x1^2) 
## 83.03021 83.03021
fit_centered <- lm(y ~ I(x1 - mean(x1)) + I((x1 - mean(x1))^2), data = data)
vif(fit_centered)
##     I(x1 - mean(x1)) I((x1 - mean(x1))^2) 
##             1.000007             1.000007
data$x1_centered <- data$x1 - mean(data$x1)
fit_centered <- lm(y ~ x1_centered +I(x1_centered^2), data = data)
vif(fit_centered)
##      x1_centered I(x1_centered^2) 
##         1.000007         1.000007

Question 3. (10 points)

  1. What are the assumptions of the (normal) linear regression model? (Be sure to include the assumptions of the random error term ϵ.) Please carefully state using bulleted points.

We need to assume that all errors are even with normal distribution with the correct conditions to find the model’s estimates and tests are reliable and valid.

  1. Are the residuals from linear regression uncorrelated in general? Please explain. What is the distribution of residuals?

-Residuals from linear regression are not uncorrelated in general. The fitted regression line depends on the diagnoal entries of the hat martix for the high leverage points. Each hi from the hat matrix is the degree by which the ith obserbation influences the ith fitted value. If adding to zero they’ll have smaller variance with higher leverage points. In the case of the distribution of residuals, we have to see if the errors are normally distributed. The residuals in that case will have a mean of 0 and variance o^2(1-hii).

  1. Why do we plot the residuals against the fitted values (Y ̂_i) rather than against the observed response values (Y_i)? Please explain.

-Because the fitted values when residuals are plotted against it will show the model’s predicted results than the actual data in comparison. We can see when the fit has patterns and changes that skew the model. However if we plot against the observed instead of the fitted values, too much random noise will make it difficult to spot these issues causing us to find random patterns and changes we can’t determine the outcomes.

  1. What is multicollinearity? Computationally, what kind of problems can multicollinearity cause when using ordinary least squares? Please explain.

-Multicollinearity is when predictors are highly correlated with each other. The problem when this occurs will cause the change in Y to have unstable coefficient estiamtes. Meaning the estimates will cause values to have inconsistent result and the standard errors become large. Thus creating a weak reliabilty of the results