The data used for this analysis contains 5 variables and 48 observations. For the purposes of data analysis we will be considering the following variables:
| Variable | Description | Observations/Changes Made |
|---|---|---|
| alumni_giving_rate | The rate at which alumni give donations to their school | This is the output variable (Y) |
| percent_of_classes_under_20 | Percent of classes at the school with less than 20 students | One of the dependent variables |
| student_faculty_ratio | Ratio of students and faculty at the school | One of the dependent variables |
| private | Indicates if school is private or public | One of the dependent variables - categorical. To be analyzed as a factor |
Below is a preliminary data analysis of variables we will be using for our Linear Regression model. Looking at the two scatter plots in column 1 (Alumni Giving Rate), we can be the points depicting association for both independent variables i.e. “Percent of Classes under 20” and “Student Faculty Ratio”.
Following data distributions can be inferred from the above plots:
| Variable | Data Distribution | Correlation with Y |
|---|---|---|
| Alumni Giving Rate (Y) | Right skewed | - |
| Percent of Classes under 20 | Left skewed | 0.646 |
| Student Faculty Ratio | Right skewed | -0.742 |
For one categorical variable (Private), we can first validate if this variable introduces any significant changes in our output variable Y by plotting a color-coded scatter plot.
From the above scatter plot we can infer that our categorical variable does have an affect when Y is plotted in relation to “Percent of Classes”. Most of the public schools can be seen in the bottom left section whereas private schools are toward top center and right.
To further explore we can visually infer the difference via box plots above. We can see the difference in the location and size of confidence intervals and the means of the two box plots. Private schools generally make higher donations than that of public schools.
From above analysis, we can see that all the independent variables add significant description to our output variable - Alumni Giving Rate. As we move forward to assess different models, our universe of variables would consist of all three independent variables.
In this section, we would evaluate different models and choose the best fit for our output variable.
Following models were fitted to find the best:
\[ Y = {b_0} + {b_1} * PctOfClassesUnd.20 + {b_2}* StdFacRatio + {b_3} * Private + \epsilon \]
| Variable | P-value |
|---|---|
| Intercept | 0.01005 |
| Percent of Classes < 20 | 0.66768 |
| Student Faculty Ratio | 0.00889 |
| Private | 0.24693 |
Above model indicate a good fit to describe Y as indicated by Adjusted R-squared value of 0.55. However one variable does not seem to be significant i.e. Percent of Classes under 20 - it p-value for t-test is extremely high compared to 0.05 (target alpha). Moving forward we will evaluate models with combinations of Percent of Classes under 20 and Private with some interactions.
| Model | Regression Equation |
|---|---|
| A | \[ Y = {b_0} + {b_1} * PctOfClassesUnd.20 + {b_2}* StdFacRatio + {b_3} * Private + \epsilon \] |
| B | \[ Y = {b_0} + {b_1} * StdFacRatio + {b_2} * Private + \epsilon \] |
| C | \[ Y = {b_0} + {b_1} * StdFacRatio + {b_2} * Private + {b_3} * (PctOfClassesUnd.20 * Private) + \epsilon \] |
| D | \[ Y = {b_0} + {b_1} * StdFacRatio + {b_2} * Private + {b_3} * (PctOfClassesUnd.20 * StdFacRatio) + \epsilon \] |
| E | \[ Y = {b_0} + {b_1} * StdFacRatio + {b_2} * Private + {b_3} * (StdFacRatio * Private) + \epsilon \] |
From the above, it was noted that the highest R-squared value was given by Model D of 0.61. However p-values in that for almost all the variables were above 0.05 therefore insignificant. For now, we will be moving forward with Model B which has R-squared value of 0.57 and p-values are significant for all the included variables.
\[ Y = 41.4294 + -1.4863 * StdFacRatio + 7.2669 * Private + 80.6 \] In this model, Adjusted R-squared is 0.55 and p-value for F-statistic is 4.877e-09. Both indicate that the regression model is significant as a whole. Individual p-values for T-value is also noted to be significant for the intercept, student/faculty ratio, and private.
Before proceeding, we also need to analyze residuals for our selected model.
For above plots it is evident that residuals are following a linear relationship on the left-side graph but not on the right-side graph. This is indicative that our best fitted model will need to be improved to imitate the trend witnessed in the actual data.
One form of remedy is transformation via the Box-Cox method. Using the maximum likelihood estimation this method gives us a lambda value for transformation of Y.
From above transformation, we get a lambda of 0.34. After we apply it to out output variable of Alumni Giving Rate (Y^lambda), we get the following equation for our transformed model:
\[ Y = 3.47 -0.056 * StdFacRatio + 0.354 * Private + 0.11 \]For this model, we get an R-squared value of 0.63 and Adjusted R-squared is 0.61 which is the best value we have obtained till now. P-value for F-statistic is 1.927e-10 which is significant. Further below table shows all individual t-vales are less than 0.05 therefore significant.
| Variable | P-value |
|---|---|
| Intercept | 7.63e-15 |
| Student-Faculty ratio | 0.00187 |
| Private | 0.04881 |
Let’s check if we improved our residuals to fitted non-linearity issue.
Comparing the right-side graph to the one before transformation, we can see that now our residuals are much closer to that zero line (notice the y axis scale for both graphs is different).
As a final step in our analysis, we can run the AIC step function and start from all variables and see what best model it proposes.
## Start: AIC=-102.48
## tranf.alumni_giving_rate ~ student_faculty_ratio + private +
## percent_of_classes_under_20
##
## Df Sum of Sq RSS AIC
## - percent_of_classes_under_20 1 0.00240 4.8072 -104.452
## <none> 4.8048 -102.476
## - private 1 0.33491 5.1397 -101.242
## - student_faculty_ratio 1 0.94339 5.7482 -95.871
##
## Step: AIC=-104.45
## tranf.alumni_giving_rate ~ student_faculty_ratio + private
##
## Df Sum of Sq RSS AIC
## <none> 4.8072 -104.452
## - private 1 0.43812 5.2453 -102.266
## - student_faculty_ratio 1 1.16694 5.9741 -96.021
As can be seen from above results, the model evaluated was in fact the method proposed by AIC step function.
\[ Y = 3.47 - 0.056 * StudentFacultyRatio + 0.354 * Private + 0.11 \]
Above is the regression equation of our final proposed model. We have to use above carefully; since we have transformed the Y variable, we would need to transform it back for real-world applications.
To improve the R-squared value for our final model, we could explore the possibility of adding interaction variables. As part of our analysis, we discovered that in Model D out R-squared value was 0.61 which was quite high for a base model but we could not justify it due to insignificant p-values for individual variables. This case needs to be investigated to further improve our model.
“Box-Cox Transformation - Cornell University.” Accessed November 23, 2022. https://www.css.cornell.edu/faculty/dgr2/_static/files/R_html/Transformations.html.
Alumni Donation Data