##Question 1. (12 points) Alumni Donation Data (Simple Linear Regression). Continue with the same data from homework 1 and fit a simple linear regression on the data, where the alumni giving rate is the response variable Y of interest and the percentage of classes with fewer than 20 students as the predictor variable X.
alumni <- read.csv("alumni.csv")
alumni
## school percent_of_classes_under_20
## 1 Boston College 39
## 2 Brandeis University 68
## 3 Brown University 60
## 4 California Institute of Technology 65
## 5 Carnegie Mellon University 67
## 6 Case Western Reserve Univ. 52
## 7 College of William and Mary 45
## 8 Columbia University 69
## 9 Cornell University 72
## 10 Dartmouth College 61
## 11 Duke University 68
## 12 Emory University 65
## 13 Georgetown University 54
## 14 Harvard University 73
## 15 John Hopkins University 64
## 16 Lehigh University 55
## 17 Massachusetts Inst. of Technology 65
## 18 New York University 63
## 19 Northwestern University 66
## 20 Pennsylvania State Univ. 32
## 21 Princeton University 68
## 22 Rice University 62
## 23 Stanford University 69
## 24 Tufts University 67
## 25 Tulane University 56
## 26 U. of California-Berleley 58
## 27 U. of California-Davis 32
## 28 U. of California-Irvine 42
## 29 U. of California-Los Angeles 41
## 30 U. of California-San Diego 48
## 31 U. of California-Santa Barbara 45
## 32 U. of Chicago 65
## 33 U. of Florida 31
## 34 U. of Illinois-Urbana Champaign 29
## 35 U. of Michigan-Ann Arbor 51
## 36 U. of North Carolina-Chapel Hill 40
## 37 U. of Notre Dame 53
## 38 U. of Pennsylvania 65
## 39 U. of Rochester 63
## 40 U. of Southern California 53
## 41 U. of Texas-Austin 39
## 42 U. of Virginia 44
## 43 U. of Washington 37
## 44 U. of Wisconsin-Madison 37
## 45 Vanderbuilt University 68
## 46 Wake Forest University 59
## 47 Washington University-St. Louis 73
## 48 Yale University 77
## student_faculty_ratio alumni_giving_rate private
## 1 13 25 1
## 2 8 33 1
## 3 8 40 1
## 4 3 46 1
## 5 10 28 1
## 6 8 31 1
## 7 12 27 1
## 8 7 31 1
## 9 13 35 1
## 10 10 53 1
## 11 8 45 1
## 12 7 37 1
## 13 10 29 1
## 14 8 46 1
## 15 9 27 1
## 16 11 40 1
## 17 6 44 1
## 18 13 13 1
## 19 8 30 1
## 20 19 21 0
## 21 5 67 1
## 22 8 40 1
## 23 7 34 1
## 24 9 29 1
## 25 12 17 1
## 26 17 18 0
## 27 19 7 0
## 28 20 9 0
## 29 18 13 0
## 30 19 8 0
## 31 20 12 0
## 32 4 36 1
## 33 23 19 0
## 34 15 23 0
## 35 15 13 0
## 36 16 26 0
## 37 13 49 1
## 38 7 41 1
## 39 10 23 1
## 40 13 22 1
## 41 21 13 0
## 42 13 28 0
## 43 12 12 0
## 44 13 13 0
## 45 9 31 1
## 46 11 38 1
## 47 7 33 1
## 48 7 50 1
linear_model <- lm(alumni_giving_rate ~ percent_of_classes_under_20, data = alumni)
t_test <- summary(linear_model)
t_test
##
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20,
## data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.053 -7.158 -1.660 6.734 29.658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.3861 6.5655 -1.125 0.266
## percent_of_classes_under_20 0.6578 0.1147 5.734 7.23e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.38 on 46 degrees of freedom
## Multiple R-squared: 0.4169, Adjusted R-squared: 0.4042
## F-statistic: 32.88 on 1 and 46 DF, p-value: 7.228e-07
##a. What is the estimated slope? Is it significant at the α=0.05 level? Clearly write out the null and alternative hypotheses, observed t-statistic, p-value, and interpret the estimate and test results.
##Estimated Slope = 0.6578(Not equal to 0). The slope is significant at alpha=0.05. ##The null hypothesis states that there is no relation between X and Y. The alternate is that there is a relation between the two variables. ##According to the results the p-value is 7.228 x 10^-7, since the value is less than 0.05 the null hypothesis is rejected. Meaning that there is is a relation between the percentage of classes with less than 20 students and the alumni giving rate. The null hypothesis is true. ##t-statistic : 5.734 and p-value: 7.228 x 10^-7.
##b. Repeat part a. But test whether the slope significantly differs from 1 at the α=0.1 level.
c=1
m=48
b1 = 0.6578
sb1 = 0.1147
zb1 <- (b1 - c)/sb1 # number of standard errors away from the null
print(zb1)
## [1] -2.983435
print('pvalue:')
## [1] "pvalue:"
2 * pt(zb1, m-2)
## [1] 0.004549874
##From the above results, the p-value(0.004549) is less than that of 0.1 meaning that the null hypothesis is true(There no relation between the variables X and Y). ##The slope differs significantly from 1 at α = 0.1.
##c.What is the value of R2? Please interpret.
##The R Squared value is 0.4042. Since the value is closer to 0 means that the model does not explain much of the variation of Y w.r.t X and the predicted variable has less explanatory power.
##d. What is the correlation coefficient r between X and Y? What is the relationship between r and R2?
cor(alumni$percent_of_classes_under_20, alumni$alumni_giving_rate)
## [1] 0.6456504
sqrt(0.4169)
## [1] 0.6456779
##The relationship between r and R-squared (R²) is “The correlation coefficient (r) is the square root of R-squared (R²). In other words, r= sqrt(R²)”.
##Here sqrt(0.4169) = 0.6456
##The corelation coefficient r determines how the variables X and Y are related to eachother. Its value ranges from -1 to +1, if the r values is negative it means that both variables are inversly related(If X increases then Y decreases), if its 0 then there is no relation and if its positive it means that the variables are directly related(If X increases then Y increases)
##If the r values is squared then it becomes equal to R^2. Meaning, if r = -1 or +1 and then it is squared the outcome is positive which explains the entire variance of Y(direct or indirect) with respect to X.
##e.Plot the training data (i.e., the data used to fit the model) with the fitted regression line and include a 95% (pointwise) confidence band for the mean responses. What do you observe about the confidence band at the point (X¯,Y¯)? Is it narrower or wider compared to the rest?
predictions <- predict(linear_model, interval = "confidence", level = 0.95)
plot(alumni$percent_of_classes_under_20, alumni$alumni_giving_rate,xlab = "Percent of Classes with < 20 Students", ylab = "Alumni Giving Rate", main = "Training Data and Fitted Regression Line")
abline(linear_model, col = "blue")
lines(alumni$percent_of_classes_under_20, predictions[, "fit"], col = "blue")
lines(alumni$percent_of_classes_under_20, predictions[, "upr"], col = "red", lty = 2)
lines(alumni$percent_of_classes_under_20, predictions[, "lwr"], col = "red", lty = 2)
X_bar <- mean(alumni$percent_of_classes_under_20)
Y_bar <- mean(alumni$alumni_giving_rate)
points(X_bar, Y_bar, pch = 19, col = "green")
legend("topleft",
legend = c("Fitted Line", "95% Confidence Band", "Point (X¯, Y¯)"),
col = c("blue", "red", "green"),
lty = c(1, 2, NA))
##From the above scatter plot we can see that the 95% confidence
interval of the mean responses is narrower at (X¯,Y¯) than at the rest
of the ponts.
##Question 2.Assume mean function E(Y|X)=10+5∗X. For this exercise, use set.seed(7052) to ensure reproducibility.
#a.Generate data with X∼N(μ=2,σ=0.1), sample size n=100, and error term ϵ∼N(μ=0,σ=0.5).
set.seed(7052)
n <- 100
mu_X <- 2
sigma_X <- 0.1
mu_epsilon <- 0
sigma_epsilon <- 0.5
X <- rnorm(n, mean = mu_X, sd = sigma_X)
epsilon <- rnorm(n, mean = mu_epsilon, sd = sigma_epsilon)
Y <- 10 + 5 * X + epsilon
simulated_data <- data.frame(X, Y)
head(simulated_data)
## X Y
## 1 1.907630 20.46643
## 2 1.949162 20.08051
## 3 2.029797 20.32815
## 4 2.101782 20.39166
## 5 2.000072 20.16090
## 6 1.926412 19.51731
##b. Fit a simple linear regression to the simulated data from part a. What is the estimated prediction equation? Report the estimated coefficients and their standard errors. Are they significant? Clearly write out the null and alternative hypotheses, observed t-statistic(s), p-value(s), and interpret the estimates and test results. What is fitted model’s MSE?
model <- lm(Y ~ X, data = simulated_data)
summary_model <- summary(model)
summary_model
##
## Call:
## lm(formula = Y ~ X, data = simulated_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2073 -0.3029 0.0093 0.3033 1.3545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.0218 0.8336 10.82 <2e-16 ***
## X 5.5652 0.4155 13.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4509 on 98 degrees of freedom
## Multiple R-squared: 0.6468, Adjusted R-squared: 0.6432
## F-statistic: 179.4 on 1 and 98 DF, p-value: < 2.2e-16
sse <- sum(model$residuals^2)
mse <- sse/(n-2)
mse
## [1] 0.2032934
##The estimated prediction equation is Y <- 9.0218 + 5.5652 * X
##The estimated coefficients are 9.0218 and 5.5652. The standard errors are 0.8336 and 0.4155.
##The p-value is 2.2 x 10^-16. The t-statistic is 13.39. Fitted model’s MSE is 0.2032.
##The null hypothesis states that the X value does not affect the value of Y(Meaning the coefficient is zero). The alternate hypothesis states otherwise, the value of Y is dependent on X(Meaning the coefficient is not zero).
##Since the p-values is less than 0.05 means that the null hypothesis can be rejected.
##c.Repeat part b), but re-simulate the data and change the error term to ϵ∼N(0,σ=1)
set.seed(7052)
n1 <- 100
mu_X1 <- 2
sigma_X1 <- 0.1
mu_epsilon1 <- 0
sigma_epsilon1 <- 1
X <- rnorm(n, mean = mu_X1, sd = sigma_X1)
epsilon1 <- rnorm(n1, mean = mu_epsilon1, sd = sigma_epsilon1)
Y <- 10 + 5 * X + epsilon1
simulated_data1 <- data.frame(X, Y)
head(simulated_data1)
## X Y
## 1 1.907630 21.39470
## 2 1.949162 20.41522
## 3 2.029797 20.50732
## 4 2.101782 20.27442
## 5 2.000072 20.32145
## 6 1.926412 19.40257
model1 <- lm(Y ~ X, data = simulated_data1)
summary_model1 <- summary(model1)
summary_model1
##
## Call:
## lm(formula = Y ~ X, data = simulated_data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4146 -0.6058 0.0186 0.6066 2.7090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.0436 1.6673 4.824 5.16e-06 ***
## X 6.1303 0.8309 7.378 5.25e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9018 on 98 degrees of freedom
## Multiple R-squared: 0.3571, Adjusted R-squared: 0.3505
## F-statistic: 54.43 on 1 and 98 DF, p-value: 5.253e-11
sse1 <- sum(model1$residuals^2)
mse1 <- sse1/(n1-2)
mse1
## [1] 0.8131737
##The estimated prediction equation is Y <- 8.0436 + 1.6673 * X
##The estimated coefficients are 8.0436 and 6.1303. The standard errors are 1.6673 and 0.8309.
##The p-value is 5.253 x 10^-11. The t-statistic is 7.378. Fitted model’s MSE is 0.8132.
##The null hypothesis states that the X value does not affect the value of Y(Meaning the coefficient is zero). The alternate hypothesis states otherwise, the value of Y is dependent on X(Meaning the coefficient is not zero).
##Since the p-values is less than 0.05 means that the null hypothesis can be rejected.
##d. Repeat parts a)–c) using n=400. What do you conclude? What is the effect on the model parameter estimates when error variance gets smaller? What is the effect when sample size gets bigger?
set.seed(7052)
n2 <- 400
mu_X2 <- 2
sigma_X2 <- 0.1
mu_epsilon2 <- 0
sigma_epsilon2 <- 0.5
X <- rnorm(n2, mean = mu_X2, sd = sigma_X2)
epsilon2 <- rnorm(n2, mean = mu_epsilon2, sd = sigma_epsilon2)
Y <- 10 + 5 * X + epsilon2
simulated_data2 <- data.frame(X, Y)
head(simulated_data2)
## X Y
## 1 1.907630 18.94756
## 2 1.949162 19.78539
## 3 2.029797 20.37686
## 4 2.101782 21.28714
## 5 2.000072 19.61935
## 6 1.926412 19.60554
model2 <- lm(Y ~ X, data = simulated_data2)
summary_model2 <- summary(model2)
summary_model2
##
## Call:
## lm(formula = Y ~ X, data = simulated_data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.76214 -0.33740 0.03615 0.32077 1.63021
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.7466 0.5015 19.44 <2e-16 ***
## X 5.1177 0.2490 20.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4887 on 398 degrees of freedom
## Multiple R-squared: 0.5149, Adjusted R-squared: 0.5137
## F-statistic: 422.5 on 1 and 398 DF, p-value: < 2.2e-16
sse2 <- sum(model2$residuals^2)
mse2 <- sse2/(n2-2)
mse2
## [1] 0.2388269
##The estimated prediction equation is Y <- 9.7466 + 5.1177 * X
##The estimated coefficients are 9.7466 and 5.1177. The standard errors are 0.5015 and 0.2490.
##The p-value is 2.2 x 10^-16. The t-statistic is 20.55. Fitted model’s MSE is 0.24.
##The null hypothesis states that the X value does not affect the value of Y(Meaning the coefficient is zero). The alternate hypothesis states otherwise, the value of Y is dependent on X(Meaning the coefficient is not zero).
##Since the p-values is less than 0.05 means that the null hypothesis can be rejected.
set.seed(7052)
n3 <- 400
mu_X3 <- 2
sigma_X3 <- 0.1
mu_epsilon3 <- 0
sigma_epsilon3 <- 1
X <- rnorm(n3, mean = mu_X3, sd = sigma_X3)
epsilon3 <- rnorm(n3, mean = mu_epsilon3, sd = sigma_epsilon3)
Y <- 10 + 5 * X + epsilon3
simulated_data3 <- data.frame(X, Y)
head(simulated_data3)
## X Y
## 1 1.907630 18.35697
## 2 1.949162 19.82497
## 3 2.029797 20.60473
## 4 2.101782 22.06537
## 5 2.000072 19.23834
## 6 1.926412 19.57903
model3 <- lm(Y ~ X, data = simulated_data3)
summary_model3 <- summary(model3)
summary_model3
##
## Call:
## lm(formula = Y ~ X, data = simulated_data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5243 -0.6748 0.0723 0.6415 3.2604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.4933 1.0029 9.466 <2e-16 ***
## X 5.2355 0.4979 10.514 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9774 on 398 degrees of freedom
## Multiple R-squared: 0.2174, Adjusted R-squared: 0.2154
## F-statistic: 110.5 on 1 and 398 DF, p-value: < 2.2e-16
sse3 <- sum(model3$residuals^2)
mse3 <- sse3/(n3-2)
mse3
## [1] 0.9553077
##The estimated prediction equation is Y <- 9.4933 + 5.2355 * X + epsilon.
##The estimated coefficients are 9.4933 and 5.2355. The standard errors are 1.0029 and 0.4979.
##The p-value is 2.2 x 10^-16. The t-statistic is 10.514. Fitted model’s MSE is 0.955.
##The null hypothesis states that the X value does not affect the value of Y(Meaning the coefficient is zero). The alternate hypothesis states otherwise, the value of Y is dependent on X(Meaning the coefficient is not zero).
##Since the p-values is less than 0.05 means that the null hypothesis can be rejected.
##When the error variance is increased from 0.5 to 1 estimated intercept value decreased and the estimated slope value got increased. The change of significant when the sample size was small(100), it only changed by a few decimal values when the sample size is increased to 400.
##When the sample size gets bigger, the standard errors got smaller.
##e. What about the MSE from each model?
mse
## [1] 0.2032934
mse1
## [1] 0.8131737
mse2
## [1] 0.2388269
mse3
## [1] 0.9553077
#The increase in variance of error from 0.5 to 1 (In 2 different cases where sample sizes are 100 & 400) has drastically impacted the value of MSE in a directly proportional manner. Where as, the formula states that the sample size is indirectly proportional to MSE but the values from the above result did not differ much.