Question 1: Alumni Donation Data (Multiple Linear Regression)

Data Import and Model Fitting

Load and fit the data. Summary statistics for predictor (percentage of classes under 20) and response (alumni giving rate) variables

##     school          percent_of_classes_under_20 student_faculty_ratio
##  Length:48          Min.   :29.00               Min.   : 3.00        
##  Class :character   1st Qu.:44.75               1st Qu.: 8.00        
##  Mode  :character   Median :59.50               Median :10.50        
##                     Mean   :55.73               Mean   :11.54        
##                     3rd Qu.:66.25               3rd Qu.:13.50        
##                     Max.   :77.00               Max.   :23.00        
##  alumni_giving_rate    private      
##  Min.   : 7.00      Min.   :0.0000  
##  1st Qu.:18.75      1st Qu.:0.0000  
##  Median :29.00      Median :1.0000  
##  Mean   :29.27      Mean   :0.6875  
##  3rd Qu.:38.50      3rd Qu.:1.0000  
##  Max.   :67.00      Max.   :1.0000
## R-squared for the model: 0.5613406
## P-values for model coefficients:
##                 (Intercept) percent_of_classes_under_20 
##                0.0052247868                0.3121275033 
##       student_faculty_ratio 
##                0.0003709425

Residual Diagnostics

The regression model summary shows the following key points:

R-squared: 0.561, indicating that about 56.1% of the variability in alumni giving rates is explained by the predictors. Significant Predictors: student_faculty_ratio has a negative, statistically significant impact (p < 0.001) on alumni giving rates. percent_of_classes_under_20 is not statistically significant (p = 0.312).

1a. Normality of Errors

Normality of Residuals: The Q-Q plot suggests minor deviations from normality, particularly at the tails, indicating slight skewness. Given the sample size, this may not severely impact model validity, but it should be noted.

## 
## Shapiro-Wilk Test for Normality:
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals_model
## W = 0.94577, p-value = 0.02721
## 
## Summary of Residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -15.00   -6.57   -1.95    0.00    4.42   24.56
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -15.00   -6.57   -1.95    0.00    4.42   24.56

1b. Constant Error Variance (Homoscedasticity)

Constant Error Variance: The residuals vs. fitted values plot shows no clear pattern, supporting the assumption of homoscedasticity (constant variance).

1c. Outliers and Influential Points

Outliers and Influential Points: The Cook’s distance plot shows no values exceeding 1, which suggests that there are no highly influential points.


1d. Multicollinearity

Multicollinearity: The VIFs for percent_of_classes_under_20 and student_faculty_ratio are both 2.61, indicating moderate multicollinearity but not at a level typically considered problematic (VIF < 5).

## Variance Inflation Factors (VIF) for predictors:
## percent_of_classes_under_20       student_faculty_ratio 
##                    2.611671                    2.611671

1e. Prediction and Interpretation about multicollinearity

The predicted alumni giving rate for an institution with 40% of classes under 20 students and a student/faculty ratio of 5 is approximately 37.79%.

Prediction Concerns This prediction falls within the observed range of alumni giving rates, suggesting no extrapolation concerns. Given the model’s moderate R-squared value (0.561), there may be unobserved factors affecting the alumni giving rate. Hence, the prediction should be interpreted cautiously as it only partially explains the variability in alumni giving.

## R-squared: 0.5613406
## P-values for model coefficients:
##                 (Intercept) percent_of_classes_under_20 
##                0.0052247868                0.3121275033 
##       student_faculty_ratio 
##                0.0003709425
## Predicted Alumni Giving Rate for percent_of_classes_under_20 = 40 and student_faculty_ratio = 5:
##        fit      lwr      upr
## 1 37.79178 16.56997 59.01358

Question 2: Simulation Study (Simple Linear Regression)

2a: Generate Data

##            X          Y
## 1   2.719762  8.4493950
## 2   2.884911  7.9075722
## 3   3.779354  0.2063892
## 4   3.035254  6.5769636
## 5   3.064644  6.0633260
## 6   3.857532 -0.4959653
## 7   3.230458  4.8881192
## 8   2.367469  9.7935534
## 9   2.656574  8.9779883
## 10  2.777169  8.9200079
## 11  3.612041  1.6788521
## 12  3.179907  5.9799008
## 13  3.200386  4.7080497
## 14  3.055341  6.5787042
## 15  2.722079  9.0506679
## 16  3.893457 -0.7001486
## 17  3.248925  5.1864339
## 18  2.016691 11.6290155
## 19  3.350678  3.8744521
## 20  2.763604  8.0309397
## 21  2.466088 10.2260825
## 22  2.891013  7.2654184
## 23  2.486998  9.8193943
## 24  2.635554  9.1574320
## 25  2.687480  9.9142314
## 26  2.156653 11.1549845
## 27  3.418894  3.8344951
## 28  3.076687  6.4904129
## 29  2.430932  9.8548731
## 30  3.626907  1.7899678
## 31  3.213232  6.1387148
## 32  2.852464  8.2149686
## 33  3.447563  3.4870517
## 34  3.439067  3.3297252
## 35  3.410791  2.7603449
## 36  3.344320  4.9183150
## 37  3.276959  4.1775558
## 38  2.969044  7.5847482
## 39  2.847019  8.9786145
## 40  2.809764  7.5373228
## 41  2.652647  9.5410577
## 42  2.896041  7.5749969
## 43  2.367302  9.8422012
## 44  4.084478 -3.7008647
## 45  3.603981  1.2417788
## 46  2.438446 10.0347403
## 47  2.798558  7.5980610
## 48  2.766672  8.8683685
## 49  3.389983  5.0160038
## 50  2.958315  6.6448013
## 51  3.126659  6.4751695
## 52  2.985727  7.4840273
## 53  2.978565  7.3152290
## 54  3.684301  0.7691676
## 55  2.887115  7.7049859
## 56  3.758235  0.4023137
## 57  2.225624 11.5028120
## 58  3.292307  4.5967459
## 59  3.061927  7.0473269
## 60  3.107971  6.0335987
## 61  3.189820  6.1255545
## 62  2.748838  8.1073792
## 63  2.833396  7.4806346
## 64  2.490712 11.6667859
## 65  2.464104  9.9684723
## 66  3.151764  6.0406987
## 67  3.224105  5.6491046
## 68  3.026502  6.5711902
## 69  3.461134  3.6052062
## 70  4.025042 -2.0922377
## 71  2.754484  8.4903630
## 72  1.845416 12.4486071
## 73  3.502869  2.9571265
## 74  2.645400 10.2949458
## 75  2.655996  8.8006842
## 76  3.512786  2.3366038
## 77  2.857613  7.9750519
## 78  2.389641 10.6826765
## 79  3.090652  6.5672641
## 80  2.930554  7.2472917
## 81  3.002882  6.4481457
## 82  3.192640  6.2088907
## 83  2.814670  8.0537905
## 84  3.322188  4.1043151
## 85  2.889757  7.6292560
## 86  3.165891  5.6851355
## 87  3.548420  3.1144957
## 88  3.217591  5.4245420
## 89  2.837034  8.4646717
## 90  3.574404  2.0696479
## 91  3.496752  3.1364342
## 92  3.274198  4.7678981
## 93  3.119366  6.1832343
## 94  2.686047  8.5528566
## 95  3.680326  0.6566281
## 96  2.699870  9.9193595
## 97  4.093666 -2.7475239
## 98  3.766305 -0.1642205
## 99  2.882150  7.4915910
## 100 2.486790  9.4729632

2b: Simple Linear Regression Using Only X

Simple Linear Regression Analysis for Y∼X: R-squared: 0.9482, indicating a strong linear fit. Intercept: 28.4074, which represents the estimated value of Y when X is zero. Slope: -7.2882, indicating a negative relationship between X and Y. The F-statistic is highly significant (p < 0.001), suggesting that X is a significant predictor of Y.

Lets proceed with residual diagnostics to check for non-linearity, constant variance, and normality of residuals, which will help confirm if this linear model is appropriate for the data generated by the quadratic function. After this, I’ll update the model to include a quadratic term and repeat the diagnostics.

## 
## Call:
## lm(formula = Y ~ X, data = my_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.50899 -0.40558  0.08469  0.51438  1.41226 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  28.4074     0.5300   53.60   <2e-16 ***
## X            -7.2882     0.1721  -42.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7817 on 98 degrees of freedom
## Multiple R-squared:  0.9482, Adjusted R-squared:  0.9476 
## F-statistic:  1793 on 1 and 98 DF,  p-value: < 2.2e-16
## Estimated Regression Equation: Y = 28.40741 + -7.288231 * X
## R-squared: 0.9481725
## P-values for model coefficients:
##  (Intercept)            X 
## 2.031255e-74 8.511546e-65

2c: Add Quadratic Term

Quadratic Model Analysis for𝑌∼𝑋+X^2 :

R-squared: 0.9802, a notable improvement over the simple linear model, confirming that including the quadratic term enhances the model’s explanatory power. Coefficients: The intercept is 8.4735, close to the true value in the generating function. The coefficient for X is 6.039, and for X^2 it is -2.1784, both significant and closely aligned with the parameters used to generate the data.

The quadratic model fits the data better than the linear model, capturing the non-linear relationship between X and Y as expected.

## 
## Call:
## lm(formula = Y ~ X + I(X^2), data = my_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96841 -0.36457 -0.05954  0.32720  1.66598 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.4735     1.6230   5.221 1.01e-06 ***
## X             6.0390     1.0679   5.655 1.57e-07 ***
## I(X^2)       -2.1784     0.1737 -12.543  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4852 on 97 degrees of freedom
## Multiple R-squared:  0.9802, Adjusted R-squared:  0.9798 
## F-statistic:  2405 on 2 and 97 DF,  p-value: < 2.2e-16
## Estimated Quadratic Regression Equation: Y = 8.473529 + 6.038958 * X + -2.1784 * X^2
## R-squared (Quadratic Model): 0.9802325
## P-values for quadratic model coefficients:
##  (Intercept)            X       I(X^2) 
## 1.013841e-06 1.570875e-07 5.088183e-22
## R-squared (Simple Linear Model): 0.9481725
## R-squared Improvement: 0.03206002

2d: Variance Inflation Factor (VIF) for Quadratic Model

Variance Inflation Factor (VIF) Results: Without Centering: Both X and X^2 have a VIF of approximately 99.89, indicating severe multicollinearity. With Centering: VIFs for X centered X^2 centered are nearly 1, demonstrating that centering has effectively eliminated multicollinearity.

The quadratic model captures the non-linear relationship in the data better than the simple linear model. Centering the predictor resolved multicollinearity issues, making the quadratic model more reliable.

## VIF for Quadratic Model:
##        X   I(X^2) 
## 99.89091 99.89091
## Interpretation:
## There may be moderate to high multicollinearity as some VIF values exceed the threshold of 5.
## VIF for Centered Quadratic Model:
##      X_centered I(X_centered^2) 
##        1.001994        1.001994
## Interpretation of Centered Model:
## Centering has reduced multicollinearity to acceptable levels, as VIF values are within limits.

2e: Centering X to Reduce Variance Inflation Factor (VIF) in Quadratic Model

Results and Interpretation

Without Centering:

VIF for X: 99.89
VIF for X^2 : 99.89
High VIF values indicate severe multicollinearity, likely due to the high correlation between X and X^2. This can inflate standard errors, making coefficient estimates unreliable.

With Centering:

VIF for Centered X: ~1
VIF for Centered X^2: ~1
Centering reduces the VIF values to approximately 1, effectively eliminating multicollinearity. This reduction occurs because centering decreases the overlap in variance between X and X^2, thus lowering their correlation.

Question 3:



3a. Assumptions of Linear Regression Model

  • Linearity: The relationship between the predictor(s) and the response variable is linear. This implies that the expected value of the response variable Y can be expressed as a linear function of the predictors.
  • Independence: Observations are independent of each other. This assumption means that the residuals (errors) for each observation are uncorrelated with each other. For time-series data, this assumption might not hold due to serial correlation, but for cross-sectional data, independence is typically assumed.
  • Homoscedasticity: The variance of the error term 𝜖is constant across all levels of the predictor variable(s). This condition implies that the spread of the residuals should not change as the predicted values change. When this assumption holds, it is known as homoscedasticity; if it is violated, we have heteroscedasticity, which may affect the reliability of statistical tests.
  • Normality of errors: The error terms𝜖are normally distributed with mean zero. This assumption is particularly important for constructing confidence intervals and conducting hypothesis tests about the regression coefficients. The errors’ distribution should resemble a normal distribution to make valid inferences from the model.

3b. Distribution of Residuals

In general, if the assumptions of the linear regression model hold, the residuals should be uncorrelated. This uncorrelated nature stems from the independence assumption, which implies that each observation in the dataset does not influence or relate to the others. In the ordinary least squares (OLS) regression, if the model is correctly specified, the residuals (observed errors) are also assumed to be random and uncorrelated.

Residual Uncorrelation
Residuals are generally uncorrelated if model assumptions hold, following a normal distribution around zero. Distribution of Residuals The residuals should ideally follow a normal distribution centered around zero. This is inferred from the normality assumption on the error term, which implies that residuals are randomly distributed with a mean of zero and a constant variance across observations.

3c. Residuals vs Fitted Values

Plotting residuals against fitted values helps detect model issues, such as non-linearity or heteroscedasticity.

Plotting the residuals against the fitted values helps in diagnosing potential issues with the model, such as non-linearity, heteroscedasticity, and outliers. The fitted values represent the model’s predictions, so plotting residuals against them allows us to check how well the model fits across different levels of the predicted variable.

If we plotted the residuals against the observed values instead, we could lose this interpretative insight. The observed values may contain variability not explained by the model, while the fitted values reflect the model’s systematic component. Therefore, plotting residuals against fitted values directly tests the model’s fit and whether the assumptions are met.

3d. Multicollinearity

Multicollinearity involves high correlations among predictors, inflating variances and causing instability in OLS coefficient estimates. Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated. In such cases, these predictors do not provide independent information, making it difficult to isolate their individual effects on the response variable.

Computational Problems Caused by Multicollinearity

  • Unstable Estimates Multicollinearity inflates the variances of the estimated regression coefficients, making them more sensitive to small changes in the model or data. This instability can lead to unreliable coefficient estimates that may vary significantly with small alterations in the data.

  • Interpretational Issues When predictors are highly correlated, it becomes difficult to interpret the effect of each predictor independently. The coefficients may not accurately reflect each variable’s unique contribution to the response variable.

  • Increased Standard Errors Multicollinearity inflates the standard errors of the coefficients, which weakens statistical tests. High standard errors make it challenging to determine whether predictors are statistically significant, reducing the model’s explanatory power and potentially obscuring meaningful relationships.

To diagnose multicollinearity, metrics like the Variance Inflation Factor (VIF) are used. High VIF values indicate a high degree of multicollinearity, and centering predictors (e.g., removing the mean) can sometimes reduce it, especially for polynomial terms.