Data Loading and Initial Exploration

##     school          percent_of_classes_under_20 student_faculty_ratio
##  Length:48          Min.   :29.00               Min.   : 3.00        
##  Class :character   1st Qu.:44.75               1st Qu.: 8.00        
##  Mode  :character   Median :59.50               Median :10.50        
##                     Mean   :55.73               Mean   :11.54        
##                     3rd Qu.:66.25               3rd Qu.:13.50        
##                     Max.   :77.00               Max.   :23.00        
##  alumni_giving_rate    private      
##  Min.   : 7.00      Min.   :0.0000  
##  1st Qu.:18.75      1st Qu.:0.0000  
##  Median :29.00      Median :1.0000  
##  Mean   :29.27      Mean   :0.6875  
##  3rd Qu.:38.50      3rd Qu.:1.0000  
##  Max.   :67.00      Max.   :1.0000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   44.75   59.50   55.73   66.25   77.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   18.75   29.00   29.27   38.50   67.00

Question 1 ———————————————————————-

1a.

The final regression equation for predicting alumni donation rate (Y) based on the percent of classes under 20 students (X1) and the student-faculty ratio (X2) is:
Y= 39.6556 + 0.1662* X1 - 1.7021* X2 .
While this model is the best fit, there are outliers at the minimum and maximum residual values. However, the first quartile, median, and third quartile residuals deviate by less than 5, supporting this as an acceptable fit.

The significance codes indicate that the student-faculty ratio significantly affects the alumni donation rate, while the percent of classes under 20 students does not. Despite this, the model explains only 56.13% of the variation in the donation rate (R² = 0.5613). The low p-value confirms a statistically significant relationship in predicting the donation rate.

## 
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20 + 
##     student_faculty_ratio, data = alumni)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15.00  -6.57  -1.95   4.42  24.56 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  39.6556    13.5076   2.936 0.005225 ** 
## percent_of_classes_under_20   0.1662     0.1626   1.022 0.312128    
## student_faculty_ratio        -1.7021     0.4421  -3.850 0.000371 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.098 on 45 degrees of freedom
## Multiple R-squared:  0.5613, Adjusted R-squared:  0.5418 
## F-statistic: 28.79 on 2 and 45 DF,  p-value: 8.869e-09

1b.

## Predicted Alumni Giving Rate: 30.94291

If an observation occurred where the percent of classes under 20 was 50% and the student faculty ratio was 10, we would expect 30.94291% of the alumni to donate back to the college

1c.

##                               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)                 39.6555835 13.5075774  2.935803 0.0052247868
## percent_of_classes_under_20  0.1661686  0.1625520  1.022249 0.3121275033
## student_faculty_ratio       -1.7021103  0.4421271 -3.849821 0.0003709425

• Null Hypothesis: Each regression coefficient (slope) is equal to zero
If the p-value for a coefficient is less than 0.05, we reject the null hypothesis and conclude that the predictor has a statistically significant relationship with the response variable. Based on the p-values, conclude whether to reject each null hypothesis at α=0.05= 0.05α=0.05.
In testing the statistical significance of the regression coefficients with a t-test, the null hypothesis assumes that the coefficients are zero—implying that neither the percent of classes under 20 students nor the student-faculty ratio affects alumni donation rates.
The intercept’s p-value of 0.005, below the 0.05 threshold, indicates that at least one variable has an impact on the donation rate. The p-value for the percent of classes under 20 is 0.31, which is above 0.05, supporting the null hypothesis that this variable lacks statistical significance on donation rates. However, the student-faculty ratio’s p-value of 0.00037, being less than 0.05, leads us to reject the null hypothesis, confirming a statistically significant relationship between the student-faculty ratio and alumni donation rate.

1d.

## F-statistic: 28.79264
## Degrees of Freedom: 2 / 45
## P-value3 8.868867e-09

• Null Hypothesis: The model with predictors does not improve prediction over a model with no predictors
If the p-value is less than 0.05, we reject the null hypothesis and conclude that the model is statistically significant as a whole.

For the F-test, the null hypothesis assumes that all coefficients are zero, indicating no impact on alumni donation rates. The F-statistic is 28.79 with 2 and 45 degrees of freedom, implying that the combined effect of the percent of classes under 20 and the student-faculty ratio explains a substantial portion of the donation rate variation, based on 47 data points (2 predictors and 45 residuals).

The p-value is 868867e^-09, which is effectively zero, indicating that the overall model is statistically significant. Since this p-value is well below 0.05, we reject the null hypothesis, confirming that the variables together significantly influence the donation rate.

1e

## R-Squared Value: 0.5613406

The R^2 value indicates the proportion of the variance in the response variable (alumni giving rate) explained by the predictors. A higher R^2 value means a better fit.

The coefficient determination value is 0.5613406, meaning 56.13% of the alumni giving rate can be attributed to either the percent of classes under 20 students or the student faculty ratio.

1f

## Correlation coefficient between percent_of_classes_under_20 and alumni_giving_rate (r_1): 0.6456504
## Correlation coefficient between student_faculty_ratio and alumni_giving_rate (r_2): -0.7423975

The correlation coefficients r1 and r2 show the strength and direction of the relationship between each predictor and the response variable. Additionally, R^2 represents the combined effect of both predictors on the response variable, and its value can often be linked to the strength of r1 and r2.

The correlation coefficient between percent of classes under 20 and alumni giving rate (r1) is 0.6456504. As it is a moderately positive value, we can expect a positive linear relationship between percent of classes under 20 and donation rate. The correlation coefficient between student faculty ratio and alumni giving rate (r2) is -0.7423975.
As a strong negative value, we can expect a strong negative relationship between student faculty ratio and alumni giving rate. The coefficient of determination (R2) is 0.5613, and as stated before, shows how much the giving rate is effected by either of the variables. Using these r values, we receive an unexpected equation as R2 should equal the square root of (r2)^2 + (r2)^2. We would expect the value to be 0.983879. Because of this inconsistency, we must do additional analysis to confirm results.

Question 2 ———————————————————————-

2a and 2b

## 
## Call:
## lm(formula = Y ~ X1 + X2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.41801 -0.31386 -0.01849  0.32692  1.68933 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.20445    0.31401   32.50   <2e-16 ***
## X1           4.89254    0.15672   31.22   <2e-16 ***
## X2          -1.96562    0.03848  -51.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4894 on 997 degrees of freedom
## Multiple R-squared:  0.7697, Adjusted R-squared:  0.7693 
## F-statistic:  1666 on 2 and 997 DF,  p-value: < 2.2e-16

The estimated prediction equation is Y=10.20445 + 4.89254X1 – 1.96562X2.

The intercept (10.2045) has a standard error of 0.3140, the X1 coefficient (4.8925) has a standard error of 0.1567, and the X2 coefficient (-1.9656) has a standard error of 0.0385. Since the standard errors of the intercept and X1 are above 0.05, these errors are considered significant.

The null hypothesis assumes that the intercept, X1, and X2 coefficients are zero, implying no effect on the outcome. We reject this hypothesis based on the model’s results. The intercept has a t-value of 32.50 and a p-value of <2e-16, X1 has a t-value of 31.22 and a p-value of <2e-16, and X2 has a t-value of -51.08 with a p-value of <2e-16. The low p-values indicate high significance for each coefficient, with X1 positively correlated and X2 negatively correlated with Y.

The model’s Mean Squared Error (MSE) is 0.4894, suggesting relatively accurate predictions.

With a revised error term, the prediction equation is now: Y=10.70283 + 4.63185X1 – 1.93232X2.

Here, the intercept (10.7028) has a standard error of 0.6380, the X1 coefficient (4.6319) has a standard error of 0.3184, and the X2 coefficient (-1.9323) has a standard error of 0.0782. Since the intercept’s error is still above 0.05, it remains significant, while the X coefficients do not show significant values.
The null hypothesis assumes that the intercept, X1, and X2 coefficients are zero, indicating no effect on Y. Based on this model, we reject the null hypothesis. The extremely low p-values indicate high significance, with X1 positively correlated and X2 negatively correlated with Y. However, the model’s Mean Squared Error (MSE) is now 0.9933, suggesting that these predictions deviate more from the actual values compared to the previous model.

2b

• Null Hypothesis (H0): Each coefficient is equal to zero.
• Alternative Hypothesis (H1): Each coefficient is not equal to zero.
• The t-statistics and p-values indicate if the coefficients are significantly different from zero at a significance level (typically 0.05).
• Interpret the significance of coefficients based on p-values and report the MSE as a measure of model fit.

2c

##              Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 10.039291 0.63736589  15.75122  4.376192e-50
## X1           4.975936 0.31809587  15.64288  1.720637e-49
## X2          -2.016760 0.07810789 -25.82019 5.576910e-113
## [1] 0.9837228
## 
## Call:
## lm(formula = Y_new ~ X1 + X2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.11699 -0.62664  0.00437  0.66561  2.90740 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.03929    0.63737   15.75   <2e-16 ***
## X1           4.97594    0.31810   15.64   <2e-16 ***
## X2          -2.01676    0.07811  -25.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9933 on 997 degrees of freedom
## Multiple R-squared:  0.4596, Adjusted R-squared:  0.4585 
## F-statistic: 423.9 on 2 and 997 DF,  p-value: < 2.2e-16

• Compare the new estimated coefficients, standard errors, t-statistics, and p-values with the original model.
• Note if the increased variance in the error term affects the significance and accuracy of the estimates.

2d

##              Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)  9.921176 0.48089488  20.63065  8.645805e-65
## X1           5.043677 0.23874980  21.12537  6.221337e-67
## X2          -1.978523 0.06208384 -31.86857 1.795837e-111
## [1] 0.2419806
##              Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)  8.577889  0.9428544   9.097788 4.504768e-18
## X1           5.714599  0.4680988  12.208105 2.526879e-29
## X2          -2.009332  0.1217231 -16.507401 5.559728e-47
## [1] 0.9301854

2e

  1. Effect of Increasing Error Variance: • When error variance increases (from 0.5 to 1), the standard errors of the coefficients typically increase, making estimates less precise.
    • Larger error variance can result in higher p-values, potentially affecting the significance of the predictors.
  2. Effect of Decreasing Sample Size: • Reducing the sample size from 1000 to 400 generally results in less reliable coefficient estimates (higher standard errors) and potentially lower statistical significance.
    • Smaller sample sizes also lead to higher variability in the Mean Squared Error (MSE).
  3. Effect on MSE: • MSE tends to increase when error variance is higher, as there is more noise in the data.
    • Larger sample sizes often result in lower MSE due to more stable parameter estimates and a better fit to the data.

From models c and d, we observe that larger errors lead to less significant coefficients and a higher Mean Squared Error (MSE). This aligns with the expectation that as variable error increases, the impact of coefficients diminishes. Additionally, smaller sample sizes result in less precise estimates. However, when error variance is consistent, we can still expect significant coefficients, contributing to a low mean standard error.

Question 3 ———————————————————————-

3a

The multiple linear regression model with normal errors in matrix form is expressed as: Y=Xβ+ϵ

## Multiple Linear Regression Model with Normal Errors:
## Y = (39.6555834726146 * X(Intercept) + 0.166168629593652 * Xpercent_of_classes_under_20 + -1.70211027228975 * Xstudent_faculty_ratio) + ε
##               X1           X2
##  [1,] 1 1.937355  0.604712467
##  [2,] 1 2.018364  0.155937295
##  [3,] 1 1.916437 -0.248496232
##  [4,] 1 2.159528 -0.885879955
##  [5,] 1 2.032951  0.449972367
##  [6,] 1 1.917953 -0.017973444
##  [7,] 1 2.048743 -0.006476105
##  [8,] 1 2.073832  0.377534484
##  [9,] 1 2.057578  0.328488478
## [10,] 1 1.969461  0.237560528
##           [,1]
##  [1,] 18.93684
##  [2,] 20.17102
##  [3,] 20.11646
##  [4,] 21.57472
##  [5,] 19.57472
##  [6,] 19.59765
##  [7,] 20.17877
##  [8,] 18.87872
##  [9,] 19.39184
## [10,] 19.58116

3b

The multiple linear regression model is Y=Xsub1 βsub1+ Xsub2 βsub2 + epsilon, where Y is the alumni giving rate, X1 is the percent of classes under 20 and X2 is the student faculty ratio. Both B values are the coefficients as defined above and epsilon is the normally distributed errors. In R, I combined both Xsub1 and Xsub2 to simplify the linear equation.
The resulting equation using the data provided is: Y = (39.6555834726146 * X(Intercept) + 0.166168629593652 * Xpercent_of_classes_under_20 + -1.70211027228975 * Xstudent_faculty_ratio) + ε .
With this model, we assume the relationship between each variable and the Y value is linear, the residuals are independent with constant variance normally distributed. We also assume that there are no significant outliers within the data set.

3c

## Model Matrix (X):
##               X1           X2
##  [1,] 1 1.937355  0.604712467
##  [2,] 1 2.018364  0.155937295
##  [3,] 1 1.916437 -0.248496232
##  [4,] 1 2.159528 -0.885879955
##  [5,] 1 2.032951  0.449972367
##  [6,] 1 1.917953 -0.017973444
##  [7,] 1 2.048743 -0.006476105
##  [8,] 1 2.073832  0.377534484
##  [9,] 1 2.057578  0.328488478
## [10,] 1 1.969461  0.237560528

Please find the model matrix X as printed above.

3d

##                 (Intercept) percent_of_classes_under_20 
##                  39.6555835                   0.1661686 
##       student_faculty_ratio 
##                  -1.7021103
##                  [,1]
## (Intercept) 16.902316
## X1           1.516789
## X2          -1.564796

The least squares estimate of beta is the subset [39.6555834 // 0.16616868629 // -1.70211027]. This will output the estimated coefficients (intercept and slopes) for the model.

3e

An estimate is unbiased if the expected value of the estimator is equal to the true parameter value. In the context of linear regression, beta_hat is an unbiased estimator of beta if : E[beta_hat]=beta. This means that, on average, the estimate beta_hat will be equal to the true coefficient valuesvβ across repeated samples, assuming the model assumptions (linearity, independence, homoscedasticity, normality of errors) hold. Being unbiased means that most of the time the estimate provided from the sample is representative of the population we are pulling from. This means that the sample provided is a good representation of the population as a whole.