Homework 2

##Question 1. (12 points) Alumni Donation Data (Simple Linear Regression). Continue with the same data from homework 1 and fit a simple linear regression on the data, where the alumni giving rate is the response variable Y of interest and the percentage of classes with fewer than 20 students as the predictor variable X.

alumni <- read.csv("alumni.csv")

alumni

##                                school percent_of_classes_under_20
## 1                      Boston College                          39
## 2                Brandeis University                           68
## 3                    Brown University                          60
## 4  California Institute of Technology                          65
## 5          Carnegie Mellon University                          67
## 6          Case Western Reserve Univ.                          52
## 7         College of William and Mary                          45
## 8                 Columbia University                          69
## 9                  Cornell University                          72
## 10                  Dartmouth College                          61
## 11                    Duke University                          68
## 12                   Emory University                          65
## 13              Georgetown University                          54
## 14                 Harvard University                          73
## 15            John Hopkins University                          64
## 16                  Lehigh University                          55
## 17  Massachusetts Inst. of Technology                          65
## 18                New York University                          63
## 19            Northwestern University                          66
## 20           Pennsylvania State Univ.                          32
## 21               Princeton University                          68
## 22                    Rice University                          62
## 23                Stanford University                          69
## 24                  Tufts University                           67
## 25                  Tulane University                          56
## 26          U. of California-Berleley                          58
## 27             U. of California-Davis                          32
## 28            U. of California-Irvine                          42
## 29       U. of California-Los Angeles                          41
## 30         U. of California-San Diego                          48
## 31     U. of California-Santa Barbara                          45
## 32                      U. of Chicago                          65
## 33                      U. of Florida                          31
## 34    U. of Illinois-Urbana Champaign                          29
## 35           U. of Michigan-Ann Arbor                          51
## 36   U. of North Carolina-Chapel Hill                          40
## 37                   U. of Notre Dame                          53
## 38                 U. of Pennsylvania                          65
## 39                    U. of Rochester                          63
## 40          U. of Southern California                          53
## 41                 U. of Texas-Austin                          39
## 42                     U. of Virginia                          44
## 43                   U. of Washington                          37
## 44            U. of Wisconsin-Madison                          37
## 45             Vanderbuilt University                          68
## 46             Wake Forest University                          59
## 47    Washington University-St. Louis                          73
## 48                    Yale University                          77
##    student_faculty_ratio alumni_giving_rate private
## 1                     13                 25       1
## 2                      8                 33       1
## 3                      8                 40       1
## 4                      3                 46       1
## 5                     10                 28       1
## 6                      8                 31       1
## 7                     12                 27       1
## 8                      7                 31       1
## 9                     13                 35       1
## 10                    10                 53       1
## 11                     8                 45       1
## 12                     7                 37       1
## 13                    10                 29       1
## 14                     8                 46       1
## 15                     9                 27       1
## 16                    11                 40       1
## 17                     6                 44       1
## 18                    13                 13       1
## 19                     8                 30       1
## 20                    19                 21       0
## 21                     5                 67       1
## 22                     8                 40       1
## 23                     7                 34       1
## 24                     9                 29       1
## 25                    12                 17       1
## 26                    17                 18       0
## 27                    19                  7       0
## 28                    20                  9       0
## 29                    18                 13       0
## 30                    19                  8       0
## 31                    20                 12       0
## 32                     4                 36       1
## 33                    23                 19       0
## 34                    15                 23       0
## 35                    15                 13       0
## 36                    16                 26       0
## 37                    13                 49       1
## 38                     7                 41       1
## 39                    10                 23       1
## 40                    13                 22       1
## 41                    21                 13       0
## 42                    13                 28       0
## 43                    12                 12       0
## 44                    13                 13       0
## 45                     9                 31       1
## 46                    11                 38       1
## 47                     7                 33       1
## 48                     7                 50       1

linear_model <- lm(alumni_giving_rate ~ percent_of_classes_under_20, data = alumni)

t_test <- summary(linear_model)
t_test

## 
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20, 
##     data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.053  -7.158  -1.660   6.734  29.658 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -7.3861     6.5655  -1.125    0.266    
## percent_of_classes_under_20   0.6578     0.1147   5.734 7.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.38 on 46 degrees of freedom
## Multiple R-squared:  0.4169, Adjusted R-squared:  0.4042 
## F-statistic: 32.88 on 1 and 46 DF,  p-value: 7.228e-07

##a. What is the estimated slope? Is it significant at the α=0.05 level? Clearly write out the null and alternative hypotheses, observed t-statistic, p-value, and interpret the estimate and test results.

##Estimated Slope = 0.6578(Not equal to 0). The slope is significant at alpha=0.05. ##The null hypothesis states that there is no relation between X and Y. The alternate is that there is a relation between the two variables. ##According to the results the p-value is 7.228 x 10^-7, since the value is less than 0.05 the null hypothesis is rejected. Meaning that there is is a relation between the percentage of classes with less than 20 students and the alumni giving rate. The null hypothesis is true. ##t-statistic : 5.734 and p-value: 7.228 x 10^-7.

##b. Repeat part a. But test whether the slope significantly differs from 1 at the α=0.1 level.

c=1
m=48
b1 = 0.6578 
sb1 = 0.1147
zb1 <-  (b1 - c)/sb1 # number of standard errors away from the null
print(zb1)

## [1] -2.983435

print('pvalue:')

## [1] "pvalue:"

2 * pt(zb1, m-2)

## [1] 0.004549874

##From the above results, the p-value(0.004549) is less than that of 0.1 meaning that the null hypothesis is true(There no relation between the variables X and Y). ##The slope differs significantly from 1 at α = 0.1.

##c.What is the value of R2? Please interpret.

##The R Squared value is 0.4042. Since the value is closer to 0 means that the model does not explain much of the variation of Y w.r.t X and the predicted variable has less explanatory power.

##d. What is the correlation coefficient r between X and Y? What is the relationship between r and R2?

cor(alumni$percent_of_classes_under_20, alumni$alumni_giving_rate)

## [1] 0.6456504

sqrt(0.4169)

## [1] 0.6456779

##The relationship between r and R-squared (R²) is “The correlation coefficient (r) is the square root of R-squared (R²). In other words, r= sqrt(R²)”.

##Here sqrt(0.4169) = 0.6456

##The corelation coefficient r determines how the variables X and Y are related to eachother. Its value ranges from -1 to +1, if the r values is negative it means that both variables are inversly related(If X increases then Y decreases), if its 0 then there is no relation and if its positive it means that the variables are directly related(If X increases then Y increases)

##If the r values is squared then it becomes equal to R^2. Meaning, if r = -1 or +1 and then it is squared the outcome is positive which explains the entire variance of Y(direct or indirect) with respect to X.

##e.Plot the training data (i.e., the data used to fit the model) with the fitted regression line and include a 95% (pointwise) confidence band for the mean responses. What do you observe about the confidence band at the point (X¯,Y¯)? Is it narrower or wider compared to the rest?

predictions <- predict(linear_model, interval = "confidence", level = 0.95)

plot(alumni$percent_of_classes_under_20, alumni$alumni_giving_rate,xlab = "Percent of Classes with < 20 Students", ylab = "Alumni Giving Rate", main = "Training Data and Fitted Regression Line")

abline(linear_model, col = "blue")

lines(alumni$percent_of_classes_under_20, predictions[, "fit"], col = "blue")
lines(alumni$percent_of_classes_under_20, predictions[, "upr"], col = "red", lty = 2)
lines(alumni$percent_of_classes_under_20, predictions[, "lwr"], col = "red", lty = 2)

X_bar <- mean(alumni$percent_of_classes_under_20)
Y_bar <- mean(alumni$alumni_giving_rate)

points(X_bar, Y_bar, pch = 19, col = "green")

legend("topleft", 
       legend = c("Fitted Line", "95% Confidence Band", "Point (X¯, Y¯)"),
       col = c("blue", "red", "green"), 
       lty = c(1, 2, NA))

##From the above scatter plot we can see that the 95% confidence interval of the mean responses is narrower at (X¯,Y¯) than at the rest of the ponts.

##Question 2.Assume mean function E(Y|X)=10+5∗X. For this exercise, use set.seed(7052) to ensure reproducibility.

#a.Generate data with X∼N(μ=2,σ=0.1), sample size n=100, and error term ϵ∼N(μ=0,σ=0.5).

set.seed(7052)

n <- 100
mu_X <- 2
sigma_X <- 0.1
mu_epsilon <- 0
sigma_epsilon <- 0.5

X <- rnorm(n, mean = mu_X, sd = sigma_X)
epsilon <- rnorm(n, mean = mu_epsilon, sd = sigma_epsilon)

Y <- 10 + 5 * X + epsilon

simulated_data <- data.frame(X, Y)

head(simulated_data)

##          X        Y
## 1 1.907630 20.46643
## 2 1.949162 20.08051
## 3 2.029797 20.32815
## 4 2.101782 20.39166
## 5 2.000072 20.16090
## 6 1.926412 19.51731

##b. Fit a simple linear regression to the simulated data from part a. What is the estimated prediction equation? Report the estimated coefficients and their standard errors. Are they significant? Clearly write out the null and alternative hypotheses, observed t-statistic(s), p-value(s), and interpret the estimates and test results. What is fitted model’s MSE?

model <- lm(Y ~ X, data = simulated_data)

summary_model <- summary(model)
summary_model

## 
## Call:
## lm(formula = Y ~ X, data = simulated_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2073 -0.3029  0.0093  0.3033  1.3545 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.0218     0.8336   10.82   <2e-16 ***
## X             5.5652     0.4155   13.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4509 on 98 degrees of freedom
## Multiple R-squared:  0.6468, Adjusted R-squared:  0.6432 
## F-statistic: 179.4 on 1 and 98 DF,  p-value: < 2.2e-16

sse <- sum(model$residuals^2)
mse <- sse/(n-2)
mse

## [1] 0.2032934

##The estimated prediction equation is Y <- 9.0218 + 5.5652 * X

##The estimated coefficients are 9.0218 and 5.5652. The standard errors are 0.8336 and 0.4155.

##The p-value is 2.2 x 10^-16. The t-statistic is 13.39. Fitted model’s MSE is 0.2032.

##The null hypothesis states that the X value does not affect the value of Y(Meaning the coefficient is zero). The alternate hypothesis states otherwise, the value of Y is dependent on X(Meaning the coefficient is not zero).

##Since the p-values is less than 0.05 means that the null hypothesis can be rejected.

##c.Repeat part b), but re-simulate the data and change the error term to ϵ∼N(0,σ=1)

set.seed(7052)

n1 <- 100
mu_X1 <- 2
sigma_X1 <- 0.1
mu_epsilon1 <- 0
sigma_epsilon1 <- 1

X <- rnorm(n, mean = mu_X1, sd = sigma_X1)
epsilon1 <- rnorm(n1, mean = mu_epsilon1, sd = sigma_epsilon1)

Y <- 10 + 5 * X + epsilon1

simulated_data1 <- data.frame(X, Y)

head(simulated_data1)

##          X        Y
## 1 1.907630 21.39470
## 2 1.949162 20.41522
## 3 2.029797 20.50732
## 4 2.101782 20.27442
## 5 2.000072 20.32145
## 6 1.926412 19.40257

model1 <- lm(Y ~ X, data = simulated_data1)

summary_model1 <- summary(model1)
summary_model1

## 
## Call:
## lm(formula = Y ~ X, data = simulated_data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4146 -0.6058  0.0186  0.6066  2.7090 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.0436     1.6673   4.824 5.16e-06 ***
## X             6.1303     0.8309   7.378 5.25e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9018 on 98 degrees of freedom
## Multiple R-squared:  0.3571, Adjusted R-squared:  0.3505 
## F-statistic: 54.43 on 1 and 98 DF,  p-value: 5.253e-11

sse1 <- sum(model1$residuals^2)
mse1 <- sse1/(n1-2)
mse1

## [1] 0.8131737

##The estimated prediction equation is Y <- 8.0436 + 1.6673 * X

##The estimated coefficients are 8.0436 and 6.1303. The standard errors are 1.6673 and 0.8309.

##The p-value is 5.253 x 10^-11. The t-statistic is 7.378. Fitted model’s MSE is 0.8132.

##Since the p-values is less than 0.05 means that the null hypothesis can be rejected.

##d. Repeat parts a)–c) using n=400. What do you conclude? What is the effect on the model parameter estimates when error variance gets smaller? What is the effect when sample size gets bigger?

set.seed(7052)

n2 <- 400
mu_X2 <- 2
sigma_X2 <- 0.1
mu_epsilon2 <- 0
sigma_epsilon2 <- 0.5

X <- rnorm(n2, mean = mu_X2, sd = sigma_X2)
epsilon2 <- rnorm(n2, mean = mu_epsilon2, sd = sigma_epsilon2)

Y <- 10 + 5 * X + epsilon2

simulated_data2 <- data.frame(X, Y)
head(simulated_data2)

##          X        Y
## 1 1.907630 18.94756
## 2 1.949162 19.78539
## 3 2.029797 20.37686
## 4 2.101782 21.28714
## 5 2.000072 19.61935
## 6 1.926412 19.60554

model2 <- lm(Y ~ X, data = simulated_data2)

summary_model2 <- summary(model2)
summary_model2

## 
## Call:
## lm(formula = Y ~ X, data = simulated_data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76214 -0.33740  0.03615  0.32077  1.63021 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.7466     0.5015   19.44   <2e-16 ***
## X             5.1177     0.2490   20.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4887 on 398 degrees of freedom
## Multiple R-squared:  0.5149, Adjusted R-squared:  0.5137 
## F-statistic: 422.5 on 1 and 398 DF,  p-value: < 2.2e-16

sse2 <- sum(model2$residuals^2)
mse2 <- sse2/(n2-2)
mse2

## [1] 0.2388269

##The estimated prediction equation is Y <- 9.7466 + 5.1177 * X

##The estimated coefficients are 9.7466 and 5.1177. The standard errors are 0.5015 and 0.2490.

##The p-value is 2.2 x 10^-16. The t-statistic is 20.55. Fitted model’s MSE is 0.24.

##Since the p-values is less than 0.05 means that the null hypothesis can be rejected.

set.seed(7052)

n3 <- 400
mu_X3 <- 2
sigma_X3 <- 0.1
mu_epsilon3 <- 0
sigma_epsilon3 <- 1

X <- rnorm(n3, mean = mu_X3, sd = sigma_X3)
epsilon3 <- rnorm(n3, mean = mu_epsilon3, sd = sigma_epsilon3)

Y <- 10 + 5 * X + epsilon3

simulated_data3 <- data.frame(X, Y)

head(simulated_data3)

##          X        Y
## 1 1.907630 18.35697
## 2 1.949162 19.82497
## 3 2.029797 20.60473
## 4 2.101782 22.06537
## 5 2.000072 19.23834
## 6 1.926412 19.57903

model3 <- lm(Y ~ X, data = simulated_data3)

summary_model3 <- summary(model3)
summary_model3

## 
## Call:
## lm(formula = Y ~ X, data = simulated_data3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5243 -0.6748  0.0723  0.6415  3.2604 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.4933     1.0029   9.466   <2e-16 ***
## X             5.2355     0.4979  10.514   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9774 on 398 degrees of freedom
## Multiple R-squared:  0.2174, Adjusted R-squared:  0.2154 
## F-statistic: 110.5 on 1 and 398 DF,  p-value: < 2.2e-16

sse3 <- sum(model3$residuals^2)
mse3 <- sse3/(n3-2)
mse3

## [1] 0.9553077

##The estimated prediction equation is Y <- 9.4933 + 5.2355 * X + epsilon.

##The estimated coefficients are 9.4933 and 5.2355. The standard errors are 1.0029 and 0.4979.

##The p-value is 2.2 x 10^-16. The t-statistic is 10.514. Fitted model’s MSE is 0.955.

##Since the p-values is less than 0.05 means that the null hypothesis can be rejected.

##When the error variance is increased from 0.5 to 1 estimated intercept value decreased and the estimated slope value got increased. The change of significant when the sample size was small(100), it only changed by a few decimal values when the sample size is increased to 400.

##When the sample size gets bigger, the standard errors got smaller.

##e. What about the MSE from each model?

mse

## [1] 0.2032934

mse1

## [1] 0.8131737

mse2

## [1] 0.2388269

mse3

## [1] 0.9553077

#The increase in variance of error from 0.5 to 1 (In 2 different cases where sample sizes are 100 & 400) has drastically impacted the value of MSE in a directly proportional manner. Where as, the formula states that the sample size is indirectly proportional to MSE but the values from the above result did not differ much.

Homework 2

2023-10-30