## school percent_of_classes_under_20 student_faculty_ratio
## Length:48 Min. :29.00 Min. : 3.00
## Class :character 1st Qu.:44.75 1st Qu.: 8.00
## Mode :character Median :59.50 Median :10.50
## Mean :55.73 Mean :11.54
## 3rd Qu.:66.25 3rd Qu.:13.50
## Max. :77.00 Max. :23.00
## alumni_giving_rate private
## Min. : 7.00 Min. :0.0000
## 1st Qu.:18.75 1st Qu.:0.0000
## Median :29.00 Median :1.0000
## Mean :29.27 Mean :0.6875
## 3rd Qu.:38.50 3rd Qu.:1.0000
## Max. :67.00 Max. :1.0000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 44.75 59.50 55.73 66.25 77.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 18.75 29.00 29.27 38.50 67.00
The final regression equation for predicting alumni donation rate (Y)
based on the percent of classes under 20 students (X1) and the
student-faculty ratio (X2) is:
Y= 39.6556 + 0.1662* X1 - 1.7021* X2
.
While this model is the best fit, there are outliers at the
minimum and maximum residual values. However, the first quartile,
median, and third quartile residuals deviate by less than 5, supporting
this as an acceptable fit.
The significance codes indicate that the student-faculty ratio significantly affects the alumni donation rate, while the percent of classes under 20 students does not. Despite this, the model explains only 56.13% of the variation in the donation rate (R² = 0.5613). The low p-value confirms a statistically significant relationship in predicting the donation rate.
##
## Call:
## lm(formula = alumni_giving_rate ~ percent_of_classes_under_20 +
## student_faculty_ratio, data = alumni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.00 -6.57 -1.95 4.42 24.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6556 13.5076 2.936 0.005225 **
## percent_of_classes_under_20 0.1662 0.1626 1.022 0.312128
## student_faculty_ratio -1.7021 0.4421 -3.850 0.000371 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.098 on 45 degrees of freedom
## Multiple R-squared: 0.5613, Adjusted R-squared: 0.5418
## F-statistic: 28.79 on 2 and 45 DF, p-value: 8.869e-09
## Predicted Alumni Giving Rate: 30.94291
If an observation occurred where the percent of classes under 20 was
50% and the student faculty ratio was 10, we would expect 30.94291% of
the alumni to donate back to the college
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6555835 13.5075774 2.935803 0.0052247868
## percent_of_classes_under_20 0.1661686 0.1625520 1.022249 0.3121275033
## student_faculty_ratio -1.7021103 0.4421271 -3.849821 0.0003709425
• Null Hypothesis: Each regression coefficient (slope) is equal to
zero
If the p-value for a coefficient is less than 0.05, we reject
the null hypothesis and conclude that the predictor has a statistically
significant relationship with the response variable. Based on the
p-values, conclude whether to reject each null hypothesis at α=0.05=
0.05α=0.05.
In testing the statistical significance of the
regression coefficients with a t-test, the null hypothesis assumes that
the coefficients are zero—implying that neither the percent of classes
under 20 students nor the student-faculty ratio affects alumni donation
rates.
The intercept’s p-value of 0.005, below the 0.05 threshold,
indicates that at least one variable has an impact on the donation rate.
The p-value for the percent of classes under 20 is 0.31, which is above
0.05, supporting the null hypothesis that this variable lacks
statistical significance on donation rates. However, the student-faculty
ratio’s p-value of 0.00037, being less than 0.05, leads us to reject the
null hypothesis, confirming a statistically significant relationship
between the student-faculty ratio and alumni donation rate.
## F-statistic: 28.79264
## Degrees of Freedom: 2 / 45
## P-value3 8.868867e-09
• Null Hypothesis: The model with predictors does not improve
prediction over a model with no predictors
If the p-value is less
than 0.05, we reject the null hypothesis and conclude that the model is
statistically significant as a whole.
For the F-test, the null hypothesis assumes that all coefficients are
zero, indicating no impact on alumni donation rates. The F-statistic is
28.79 with 2 and 45 degrees of freedom, implying that the combined
effect of the percent of classes under 20 and the student-faculty ratio
explains a substantial portion of the donation rate variation, based on
47 data points (2 predictors and 45 residuals).
The p-value is 868867e^-09, which is effectively zero, indicating
that the overall model is statistically significant. Since this p-value
is well below 0.05, we reject the null hypothesis, confirming that the
variables together significantly influence the donation rate.
## R-Squared Value: 0.5613406
The R^2 value indicates the proportion of the variance in the
response variable (alumni giving rate) explained by the predictors. A
higher R^2 value means a better fit.
The coefficient determination value is 0.5613406, meaning 56.13% of the alumni giving rate can be attributed to either the percent of classes under 20 students or the student faculty ratio.
## Correlation coefficient between percent_of_classes_under_20 and alumni_giving_rate (r_1): 0.6456504
## Correlation coefficient between student_faculty_ratio and alumni_giving_rate (r_2): -0.7423975
The correlation coefficients r1 and r2 show the strength and
direction of the relationship between each predictor and the response
variable. Additionally, R^2 represents the combined effect of both
predictors on the response variable, and its value can often be linked
to the strength of r1 and r2.
The correlation coefficient between percent of classes under 20 and
alumni giving rate (r1) is 0.6456504. As it is a moderately positive
value, we can expect a positive linear relationship between percent of
classes under 20 and donation rate. The correlation coefficient between
student faculty ratio and alumni giving rate (r2) is -0.7423975.
As
a strong negative value, we can expect a strong negative relationship
between student faculty ratio and alumni giving rate. The coefficient of
determination (R2) is 0.5613, and as stated before, shows how much the
giving rate is effected by either of the variables. Using these r
values, we receive an unexpected equation as R2 should equal the square
root of (r2)^2 + (r2)^2. We would expect the value to be 0.983879.
Because of this inconsistency, we must do additional analysis to confirm
results.
##
## Call:
## lm(formula = Y ~ X1 + X2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.41801 -0.31386 -0.01849 0.32692 1.68933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.20445 0.31401 32.50 <2e-16 ***
## X1 4.89254 0.15672 31.22 <2e-16 ***
## X2 -1.96562 0.03848 -51.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4894 on 997 degrees of freedom
## Multiple R-squared: 0.7697, Adjusted R-squared: 0.7693
## F-statistic: 1666 on 2 and 997 DF, p-value: < 2.2e-16
The estimated prediction equation is Y=10.20445 + 4.89254X1 –
1.96562X2.
The intercept (10.2045) has a standard error of 0.3140, the X1
coefficient (4.8925) has a standard error of 0.1567, and the X2
coefficient (-1.9656) has a standard error of 0.0385. Since the standard
errors of the intercept and X1 are above 0.05, these errors are
considered significant.
The null hypothesis assumes that the intercept, X1, and X2
coefficients are zero, implying no effect on the outcome. We reject this
hypothesis based on the model’s results. The intercept has a t-value of
32.50 and a p-value of <2e-16, X1 has a t-value of 31.22 and a
p-value of <2e-16, and X2 has a t-value of -51.08 with a p-value of
<2e-16. The low p-values indicate high significance for each
coefficient, with X1 positively correlated and X2 negatively correlated
with Y.
The model’s Mean Squared Error (MSE) is 0.4894, suggesting relatively
accurate predictions.
With a revised error term, the prediction equation is now: Y=10.70283
+ 4.63185X1 – 1.93232X2.
Here, the intercept (10.7028) has a standard error of 0.6380, the X1
coefficient (4.6319) has a standard error of 0.3184, and the X2
coefficient (-1.9323) has a standard error of 0.0782. Since the
intercept’s error is still above 0.05, it remains significant, while the
X coefficients do not show significant values.
The null hypothesis
assumes that the intercept, X1, and X2 coefficients are zero, indicating
no effect on Y. Based on this model, we reject the null hypothesis. The
extremely low p-values indicate high significance, with X1 positively
correlated and X2 negatively correlated with Y. However, the model’s
Mean Squared Error (MSE) is now 0.9933, suggesting that these
predictions deviate more from the actual values compared to the previous
model.
• Null Hypothesis (H0): Each coefficient is equal to zero.
•
Alternative Hypothesis (H1): Each coefficient is not equal to zero.
• The t-statistics and p-values indicate if the coefficients are
significantly different from zero at a significance level (typically
0.05).
• Interpret the significance of coefficients based on
p-values and report the MSE as a measure of model fit.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.039291 0.63736589 15.75122 4.376192e-50
## X1 4.975936 0.31809587 15.64288 1.720637e-49
## X2 -2.016760 0.07810789 -25.82019 5.576910e-113
## [1] 0.9837228
##
## Call:
## lm(formula = Y_new ~ X1 + X2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.11699 -0.62664 0.00437 0.66561 2.90740
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.03929 0.63737 15.75 <2e-16 ***
## X1 4.97594 0.31810 15.64 <2e-16 ***
## X2 -2.01676 0.07811 -25.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9933 on 997 degrees of freedom
## Multiple R-squared: 0.4596, Adjusted R-squared: 0.4585
## F-statistic: 423.9 on 2 and 997 DF, p-value: < 2.2e-16
• Compare the new estimated coefficients, standard errors,
t-statistics, and p-values with the original model.
• Note if the
increased variance in the error term affects the significance and
accuracy of the estimates.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.921176 0.48089488 20.63065 8.645805e-65
## X1 5.043677 0.23874980 21.12537 6.221337e-67
## X2 -1.978523 0.06208384 -31.86857 1.795837e-111
## [1] 0.2419806
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.577889 0.9428544 9.097788 4.504768e-18
## X1 5.714599 0.4680988 12.208105 2.526879e-29
## X2 -2.009332 0.1217231 -16.507401 5.559728e-47
## [1] 0.9301854
From models c and d, we observe that larger errors lead to less significant coefficients and a higher Mean Squared Error (MSE). This aligns with the expectation that as variable error increases, the impact of coefficients diminishes. Additionally, smaller sample sizes result in less precise estimates. However, when error variance is consistent, we can still expect significant coefficients, contributing to a low mean standard error.
The multiple linear regression model with normal errors in matrix form is expressed as: Y=Xβ+ϵ
## Multiple Linear Regression Model with Normal Errors:
## Y = (39.6555834726146 * X(Intercept) + 0.166168629593652 * Xpercent_of_classes_under_20 + -1.70211027228975 * Xstudent_faculty_ratio) + ε
## X1 X2
## [1,] 1 1.937355 0.604712467
## [2,] 1 2.018364 0.155937295
## [3,] 1 1.916437 -0.248496232
## [4,] 1 2.159528 -0.885879955
## [5,] 1 2.032951 0.449972367
## [6,] 1 1.917953 -0.017973444
## [7,] 1 2.048743 -0.006476105
## [8,] 1 2.073832 0.377534484
## [9,] 1 2.057578 0.328488478
## [10,] 1 1.969461 0.237560528
## [,1]
## [1,] 18.93684
## [2,] 20.17102
## [3,] 20.11646
## [4,] 21.57472
## [5,] 19.57472
## [6,] 19.59765
## [7,] 20.17877
## [8,] 18.87872
## [9,] 19.39184
## [10,] 19.58116
The multiple linear regression model is Y=Xsub1 βsub1+ Xsub2 βsub2 +
epsilon, where Y is the alumni giving rate, X1 is the percent of classes
under 20 and X2 is the student faculty ratio. Both B values are the
coefficients as defined above and epsilon is the normally distributed
errors. In R, I combined both Xsub1 and Xsub2 to simplify the linear
equation.
The resulting equation using the data provided is: Y =
(39.6555834726146 * X(Intercept) + 0.166168629593652 *
Xpercent_of_classes_under_20 + -1.70211027228975 *
Xstudent_faculty_ratio) + ε .
With this model, we assume the
relationship between each variable and the Y value is linear, the
residuals are independent with constant variance normally distributed.
We also assume that there are no significant outliers within the data
set.
## Model Matrix (X):
## X1 X2
## [1,] 1 1.937355 0.604712467
## [2,] 1 2.018364 0.155937295
## [3,] 1 1.916437 -0.248496232
## [4,] 1 2.159528 -0.885879955
## [5,] 1 2.032951 0.449972367
## [6,] 1 1.917953 -0.017973444
## [7,] 1 2.048743 -0.006476105
## [8,] 1 2.073832 0.377534484
## [9,] 1 2.057578 0.328488478
## [10,] 1 1.969461 0.237560528
Please find the model matrix X as printed above.
## (Intercept) percent_of_classes_under_20
## 39.6555835 0.1661686
## student_faculty_ratio
## -1.7021103
## [,1]
## (Intercept) 16.902316
## X1 1.516789
## X2 -1.564796
The least squares estimate of beta is the subset [39.6555834 // 0.16616868629 // -1.70211027]. This will output the estimated coefficients (intercept and slopes) for the model.
An estimate is unbiased if the expected value of the estimator is equal to the true parameter value. In the context of linear regression, beta_hat is an unbiased estimator of beta if : E[beta_hat]=beta. This means that, on average, the estimate beta_hat will be equal to the true coefficient valuesvβ across repeated samples, assuming the model assumptions (linearity, independence, homoscedasticity, normality of errors) hold. Being unbiased means that most of the time the estimate provided from the sample is representative of the population we are pulling from. This means that the sample provided is a good representation of the population as a whole.