library(ggplot2)
url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)
str(alumni)
## 'data.frame': 48 obs. of 5 variables:
## $ school : chr "Boston College" "Brandeis University " "Brown University" "California Institute of Technology" ...
## $ percent_of_classes_under_20: int 39 68 60 65 67 52 45 69 72 61 ...
## $ student_faculty_ratio : int 13 8 8 3 10 8 12 7 13 10 ...
## $ alumni_giving_rate : int 25 33 40 46 28 31 27 31 35 53 ...
## $ private : int 1 1 1 1 1 1 1 1 1 1 ...
Question 1. (10 points) Alumni Donation Data (Multiple Linear Regression). Continuing with the alumni donation data, fit a multiple linear regression model to the data, where the alumni giving rate (alumni_giving_rate) is the response variable (Y), and the percentage of classes with fewer than 20 students (percent_of_classes_under_20) and student/faculty ratio (student_faculty_ratio) as the predictors
res <- rstandard(multiple_model)
qqnorm(res, main = "Normal Q-Q Plot of Standardized Residuals")
qqline(res, col = 2)
The assumption of error normailty is not violated since the points shown
in the visualization Q-Q plot are along the straight line. If there was
a strong S-shaped pattern, it is an indication of skewness or heavier
tails proving there is violation of the error normailty assumption.
Does the assumption of constant error variance appear to be
violated?
There is a fair amount of randomness occuring with no clear pattern.
Therefore there is no strong evidence for the constant variance
violated. But if there were residuals increasing or decreasing more than
the fitted values that will be a indication of heteroscedasticity. This
is a violation for the constant variance assumption.
Does there appear to be any Y or X outliers or influential points?
res_std <- rstandard(multiple_model)
which(abs(res_std) > 2)
## 10 21 37
## 10 21 37
which(abs(res_std) >3)
## named integer(0)
lev <- hatvalues(multiple_model)
which(lev > 2 * mean(lev))
## 9 33 34
## 9 33 34
cook <- cooks.distance(multiple_model)
which(cook > 4 / length(cook))
## 21 33 43
## 21 33 43
plot(res_std, ylab = "Student Residuals", main = "Student residuals")
abline(h = c(-2, 2), lty = 2)
plot(lev, ylab = "Leverage", main = "Leverage values")
p <- length(coef(multiple_model))
n <- nrow(alumni)
abline(h = 2 * p/n, lty = 2)
plot(cook, ylab = "Cook's distance", main = "Cook's distance")
abline(h = 4 / n, lty =2)
The points were there is enough weight to have a noticeable change in
the fitted regression line. These points occur above 4/n.
-There is no concern about multicollinearity in this model. The predictors are not high and the VIF values are small so no concern for the predictors to have a strong relation.
The two predictos percent of classes under 20 and student faculty ratio show no concern about multicollinearity. The VIF values are within 2.6, a normal range between 1 and 5. This shows the model’s coefficient estiamtes are stable.
Question 2. (10 points) Simulation Study (Simple Linear Regression). Assume mean function E(Y|X)=10+5X-2X^2
set.seed(763)
n <- 100
mean_u <- 3
std_x <- 0.5
std_e <- 0.5
x1 <- rnorm(n, mean = mean_u, sd = std_e)
eps <- rnorm(n, mean = 0, sd = std_e)
y <- 10 + 5 * x1 - 2 * x1^2 + eps
data <- data.frame(x1, y)
head(data)
## x1 y
## 1 3.760477 -0.02238723
## 2 2.969125 7.17280974
## 3 3.527328 1.92402641
## 4 2.591405 9.26482119
## 5 3.519658 2.61638487
## 6 3.371328 4.75517652
simple_fit_regress <- lm(y ~ x1, data = data)
summary(simple_fit_regress)
##
## Call:
## lm(formula = y ~ x1, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9736 -0.3826 0.2150 0.5902 1.4070
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.5297 0.5126 51.75 <2e-16 ***
## x1 -6.6608 0.1725 -38.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8487 on 98 degrees of freedom
## Multiple R-squared: 0.9383, Adjusted R-squared: 0.9377
## F-statistic: 1491 on 1 and 98 DF, p-value: < 2.2e-16
par(mfrow = c(2,2), mar = c(4, 4, 2, 1))
plot(simple_fit_regress)
simple_fit_regress_2 <- lm(y ~ x1 + I(x1^2), data = data)
summary(simple_fit_regress_2)
##
## Call:
## lm(formula = y ~ x1 + I(x1^2), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.92744 -0.32269 -0.02659 0.40100 0.90015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.1659 1.2657 6.452 4.35e-09 ***
## x1 6.2403 0.8717 7.158 1.57e-10 ***
## I(x1^2) -2.2020 0.1479 -14.889 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4706 on 97 degrees of freedom
## Multiple R-squared: 0.9812, Adjusted R-squared: 0.9808
## F-statistic: 2535 on 2 and 97 DF, p-value: < 2.2e-16
par(mfrow = c(2,2), mar = c(4, 4, 2, 1))
plot(simple_fit_regress_2)
library(car)
## Loading required package: carData
car::vif(simple_fit_regress_2)
## x1 I(x1^2)
## 83.03021 83.03021
Yes very large VIF value. We know that above a 10 will show case multicollinearity. This high value up to 83 is a severe case of multicollinearity.
fit <- lm(y ~ x1 + I(x1^2), data = data)
vif(fit)
## x1 I(x1^2)
## 83.03021 83.03021
fit_centered <- lm(y ~ I(x1 - mean(x1)) + I((x1 - mean(x1))^2), data = data)
vif(fit_centered)
## I(x1 - mean(x1)) I((x1 - mean(x1))^2)
## 1.000007 1.000007
data$x1_centered <- data$x1 - mean(data$x1)
fit_centered <- lm(y ~ x1_centered +I(x1_centered^2), data = data)
vif(fit_centered)
## x1_centered I(x1_centered^2)
## 1.000007 1.000007
Question 3. (10 points)
We need to assume that all errors are even with normal distribution with the correct conditions to find the model’s estimates and tests are reliable and valid.
-Residuals from linear regression are not uncorrelated in general. The fitted regression line depends on the diagnoal entries of the hat martix for the high leverage points. Each hi from the hat matrix is the degree by which the ith obserbation influences the ith fitted value. If adding to zero they’ll have smaller variance with higher leverage points. In the case of the distribution of residuals, we have to see if the errors are normally distributed. The residuals in that case will have a mean of 0 and variance o^2(1-hii).
-Because the fitted values when residuals are plotted against it will show the model’s predicted results than the actual data in comparison. We can see when the fit has patterns and changes that skew the model. However if we plot against the observed instead of the fitted values, too much random noise will make it difficult to spot these issues causing us to find random patterns and changes we can’t determine the outcomes.
-Multicollinearity is when predictors are highly correlated with each other. The problem when this occurs will cause the change in Y to have unstable coefficient estiamtes. Meaning the estimates will cause values to have inconsistent result and the standard errors become large. Thus creating a weak reliabilty of the results