library(ggplot2)
url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)
str(alumni) # print structure of the alumni data frame
## 'data.frame': 48 obs. of 5 variables:
## $ school : chr "Boston College" "Brandeis University " "Brown University" "California Institute of Technology" ...
## $ percent_of_classes_under_20: int 39 68 60 65 67 52 45 69 72 61 ...
## $ student_faculty_ratio : int 13 8 8 3 10 8 12 7 13 10 ...
## $ alumni_giving_rate : int 25 33 40 46 28 31 27 31 35 53 ...
## $ private : int 1 1 1 1 1 1 1 1 1 1 ...
Question 1 Alumni Donation Data (Multiple Linear Regression). Continuing with the alumni donation data, fit a multiple linear regression model to the data, where the alumni giving rate (alumni_giving_rate) is the response variable (Y), and the percentage of classes with fewer than 20 students (percent_of_classes_under_20) and student/faculty ratio (student_faculty_ratio) as the predictors.
mutiple_model <- lm(alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio, data = alumni)
mutiple_summary <- summary(mutiple_model)
print(mutiple_summary)
##
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20 +
## student_faculty_ratio, data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.00 -6.57 -1.95 4.42 24.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6556 13.5076 2.936 0.005225 **
## percent_of_classes_under_20 0.1662 0.1626 1.022 0.312128
## student_faculty_ratio -1.7021 0.4421 -3.850 0.000371 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.098 on 45 degrees of freedom
## Multiple R-squared: 0.5613, Adjusted R-squared: 0.5418
## F-statistic: 28.79 on 2 and 45 DF, p-value: 8.869e-09
get_eq <- function(mutiple_model) {
eq <- as.character(paste("Y =", round(coef(mutiple_model)[1], 2), "+", round(coef(mutiple_model)[2], 2), "* X"))
return(eq)
}
students_faculty_alumni_ratio_rate <- ggplot(alumni, aes(x = percent_of_classes_under_20, y = alumni_giving_rate)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, col = "blue") +
labs(title = "Alumni giving rate vs. % of classes under 20",
x = "% of Classes Under 20",
y = "Alumni Giving Rate") +
geom_text(x = 40, y = 50, label = get_eq(mutiple_model), color = "blue", parse = TRUE)
print(students_faculty_alumni_ratio_rate)
## `geom_smooth()` using formula = 'y ~ x'
students_faculty_alumni_ratio_rate <- ggplot(alumni, aes(x = student_faculty_ratio, y = alumni_giving_rate)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, col = "blue") +
labs(title = "Alumni giving rate vs. Student Faculty Ratio",
x = "Student Faculty Ratio",
y = "Alumni Giving Rate") +
geom_text(x = 10, y = 50, label = get_eq(mutiple_model), color = "blue", parse = TRUE)
print(students_faculty_alumni_ratio_rate)
## `geom_smooth()` using formula = 'y ~ x'
student_faculty_data <- rbind(
transform(alumni, variable = "% of Classes Under 20"),
transform(alumni, variable = "Student Faculty Ratio", percent_of_classes_under_20 = student_faculty_ratio)
)
students_faculty_alumni_ratio_rate <- ggplot(student_faculty_data, aes(x = percent_of_classes_under_20, y = alumni_giving_rate)) +
geom_point(aes(color = variable)) +
geom_smooth(method = "lm", se = FALSE, aes(group = variable, color = variable), show.legend = FALSE) +
facet_wrap(~ variable, scales = "free") +
labs(title = "Alumni Giving Rate",
x = "% of Classes Under 20",
y = "Alumni Giving Rate")
print(students_faculty_alumni_ratio_rate)
## `geom_smooth()` using formula = 'y ~ x'
library(plotly)
## Warning: package 'plotly' was built under R version 4.5.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(
data = alumni,
x = ~percent_of_classes_under_20,
y= ~student_faculty_ratio,
z= ~alumni_giving_rate,
type = "scatter3d",
mode = "markers"
) %>%
layout(
scene = list(
xaxis = list(title = "% of Classes < 20"),
yaxis = list(title = "Student-Faculty Ratio"),
zaxis = list(title = "Alumni Giving Rate")
))
The final regression is Y= 39.6556 + 0.1662X1 - 1.7021X2. Lets keep in mind that Y is the alumni giving rate while X1 and x2 are % of classes under 20 students and X2 is student faculty ratio respectively. The model shows R^2 equals 0.5613 showing X2 is significant but sadly X1 does not hold significance.
combined_model <- lm(alumni_giving_rate ~ percent_of_classes_under_20 + student_faculty_ratio,
data = alumni)
cat(sprintf( "Predicted Alumni Giving Rate: %.2f\n",
predict(combined_model,
newdata = data.frame(percent_of_classes_under_20 = 50,
student_faculty_ratio = 10))
))
## Predicted Alumni Giving Rate: 30.94
We estimated the predicted alumni giving rate is 30.94 based on percent of classes under 20 is at 50% and the student faculty ratio was 10.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6555835 13.5075774 2.935803 0.0052247868
## percent_of_classes_under_20 0.1661686 0.1625520 1.022249 0.3121275033
## student_faculty_ratio -1.7021103 0.4421271 -3.849821 0.0003709425
Where we are at α=0.05, we test the null hypothesis to equal close to zero for each coefficient. According to the statistical t test we have the p results as 0.31 when it comes to the percent of students under 20 and student faculty ratio with the p results as 0.00037. The percent of classes under 20 is not significant so it fails to reject the null hypothesis but student faculty ratio is significant therefore rejects the null hypothesis. In conclusion the student faculty ration shows a significant relationship with alumni rate than the percent of classes under 20.
f_statistic_mutiple_model <- summary(mutiple_model)$fstatistic
p_value_f_Stat_multiple_model <- pf(f_statistic_mutiple_model[1], f_statistic_mutiple_model[2], f_statistic_mutiple_model[3], lower.tail = FALSE)
f_statistic_mutiple_model
## value numdf dendf
## 28.79264 2.00000 45.00000
p_value_f_Stat_multiple_model
## value
## 8.868867e-09
The F-test shows null is close to zero. We calculated the F test equals 28.79 with two degrees of freedom over 45 and the p-value is 8.868*10^-9.We can conclude with these results we reject the null hypothesis since p < 0.05 and the results are statistically significant.
r_squared_mutiple_model <- summary(mutiple_model)$r.squared
r_squared_mutiple_model
## [1] 0.5613406
We tested the R^2 result is 0.5613 so we can interpret this as 56.13% of the variance in alumni giving rates using the % of classes under 20 and the student faculty ratio. We can also say that the remaining 43.87% is unpredictable. In conclusion because of the higher R^2 we can say this is a better fit.
cor_percent_agr_poc_20 <- cor(alumni$percent_of_classes_under_20, alumni$alumni_giving_rate)
cor_sfr_agr <- cor(alumni$student_faculty_ratio, alumni$alumni_giving_rate)
cor_percent_agr_poc_20
## [1] 0.6456504
cor_sfr_agr
## [1] -0.7423975
The correlation coefficient r1 is 0.6457 for the percent of classes under 20 and the alumni giving rate, r2= -0.7442 for student faculty ration and alumni giving. R^2 remember is 0.5613 so the R^2 reflects the predictors joint contribution and does not equal any simple function of the individual correlation because the predictors share explanatory power.
Question 2 Simulation Study (Simple Linear Regression).
set.seed(123)
N<-1000
X1 <- rnorm(N, mean=2, sd=0.1)
X2 <- rnorm(N, mean=0, sd=0.4)
epsilon <- rnorm(N, mean=0, sd=0.5)
Y <- 10+5*X1 - 2*X2 + epsilon
sim_study_a <- lm(Y ~ X1 + X2)
sim_study_b <- summary(sim_study_a )
print(sim_study_a )
##
## Call:
## lm(formula = Y ~ X1 + X2)
##
## Coefficients:
## (Intercept) X1 X2
## 10.204 4.893 -1.966
set.seed(123)
N<-1000
X1 <- rnorm(N, mean=2, sd=0.1)
X2 <- rnorm(N, mean=0, sd=0.4)
epsilon <- rnorm(N, mean=0, sd=0.5)
Y <- 10+5*X1 - 2*X2 + epsilon
sim_study_a <- lm(Y ~ X1 + X2)
sim_study_b <- summary(sim_study_a )
print(sim_study_a )
##
## Call:
## lm(formula = Y ~ X1 + X2)
##
## Coefficients:
## (Intercept) X1 X2
## 10.204 4.893 -1.966
To answer both a and b, we can say the estimated equation is Y = 10.204 + 4.893X1 - 1.966X2. The coefficient intercept’s standard error 0.314, t value is 32.50 and the p value is 2 x 10^-16. X1’s results for standard error, t value and p value are 0.156, 31.22 and 2x10^-16 respectively. X2’s results for standard error, t value, p-value are 0.03848, -51.08 and 2x10^-16 respectively. This all means we can reject the null hypothesis for all terms. X1 is positive. X2 is negative with respect to Y. The MSE is 0.4894
epsilon_new <- rnorm(N, mean = 0, sd = 1)
Y_new <- 10 + 5 * X1 - 2 * X2 + epsilon_new
sim_model_c <- lm(Y_new ~ X1+X2)
summary(sim_model_c)
##
## Call:
## lm(formula = Y_new ~ X1 + X2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.11699 -0.62664 0.00437 0.66561 2.90740
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.03929 0.63737 15.75 <2e-16 ***
## X1 4.97594 0.31810 15.64 <2e-16 ***
## X2 -2.01676 0.07811 -25.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9933 on 997 degrees of freedom
## Multiple R-squared: 0.4596, Adjusted R-squared: 0.4585
## F-statistic: 423.9 on 2 and 997 DF, p-value: < 2.2e-16
N_new <- 400
X1_new <- rnorm(N_new, mean = 2, sd = 0.1)
X2_new <- rnorm(N_new, mean = 0, sd = 0.4)
epsilon_new <- rnorm(N_new, mean = 0, sd = 0.5)
Y_new <- 10 + 5 * X1_new - 2 * X2_new + epsilon_new
sim_model_d <- lm(Y_new ~ X1_new + X2_new)
summary(sim_model_d)
##
## Call:
## lm(formula = Y_new ~ X1_new + X2_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.52153 -0.31078 0.02841 0.32596 1.08097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.5572 0.4768 20.04 <2e-16 ***
## X1_new 5.2195 0.2386 21.87 <2e-16 ***
## X2_new -1.9189 0.0605 -31.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4772 on 397 degrees of freedom
## Multiple R-squared: 0.7931, Adjusted R-squared: 0.7921
## F-statistic: 761 on 2 and 397 DF, p-value: < 2.2e-16
the estimate model is Y = 9.5572 + 5.2195X1-1.9189X2 with all coefficients high significance with p-value 2x10^-16 and R^2 equals 0.7931. When error variance decreases, the standard errors shrink and predictors become more accurate with the lower MSE at 0.228. The sample size increases, parameters estimates approach their true values, standard errors decrease and the model reliability improves.
Question 3
beta <- coefficients(combined_model)
x <- model.matrix(~percent_of_classes_under_20 + student_faculty_ratio, data = alumni)
message("Multiple Linear Regression Model with Normal Errors:")
## Multiple Linear Regression Model with Normal Errors:
print(beta)
## (Intercept) percent_of_classes_under_20
## 39.6555835 0.1661686
## student_faculty_ratio
## -1.7021103
The multiple linear regression model’s assumptions are a linear relationship between predictors and the response and can also assume that residuals are normal and no signficiant outliers are affecting the overall fit.
message("Model Matrix (X):")
## Model Matrix (X):
print(x)
## (Intercept) percent_of_classes_under_20 student_faculty_ratio
## 1 1 39 13
## 2 1 68 8
## 3 1 60 8
## 4 1 65 3
## 5 1 67 10
## 6 1 52 8
## 7 1 45 12
## 8 1 69 7
## 9 1 72 13
## 10 1 61 10
## 11 1 68 8
## 12 1 65 7
## 13 1 54 10
## 14 1 73 8
## 15 1 64 9
## 16 1 55 11
## 17 1 65 6
## 18 1 63 13
## 19 1 66 8
## 20 1 32 19
## 21 1 68 5
## 22 1 62 8
## 23 1 69 7
## 24 1 67 9
## 25 1 56 12
## 26 1 58 17
## 27 1 32 19
## 28 1 42 20
## 29 1 41 18
## 30 1 48 19
## 31 1 45 20
## 32 1 65 4
## 33 1 31 23
## 34 1 29 15
## 35 1 51 15
## 36 1 40 16
## 37 1 53 13
## 38 1 65 7
## 39 1 63 10
## 40 1 53 13
## 41 1 39 21
## 42 1 44 13
## 43 1 37 12
## 44 1 37 13
## 45 1 68 9
## 46 1 59 11
## 47 1 73 7
## 48 1 77 7
## attr(,"assign")
## [1] 0 1 2
beta<-coef(combined_model)
print(beta)
## (Intercept) percent_of_classes_under_20
## 39.6555835 0.1661686
## student_faculty_ratio
## -1.7021103
The least squared estimate of beta is 39.655 with percent of classes under 20 0.166 and student faculty ratio -1.70211 e. What does it mean for the estimate in part d) to be unbiased?
An estimator is unbiased when its expectation equals the true population parameter. With repeated samples, the estimated coefficients will match true values and providing all the model assumptions are aligned with all variables. Thus the estimates are unbiased in part d showing the samples are accurate and representing the population and model yield in parameter estimates