Introduction

The data used for this analysis contains 5 variables and 48 observations. For the purposes of data analysis we will be considering the following variables:

Data Source: Alumni Donation Data
Variable	Description	Observations/Changes Made
alumni_giving_rate	The rate at which alumni give donations to their school	This is the output variable (Y)
percent_of_classes_under_20	Percent of classes at the school with less than 20 students	One of the dependent variables
student_faculty_ratio	Ratio of students and faculty at the school	One of the dependent variables
private	Indicates if school is private or public	One of the dependent variables - categorical. To be analyzed as a factor

Data Analysis and Description

Alumni Giving Rate vs Numerical Variables

Below is a preliminary data analysis of variables we will be using for our Linear Regression model. Looking at the two scatter plots in column 1 (Alumni Giving Rate), we can be the points depicting association for both independent variables i.e. “Percent of Classes under 20” and “Student Faculty Ratio”.

Following data distributions can be inferred from the above plots:

None of the distributions are Normal. Correlations of independent variables is high enough to warrant further exploration.
Variable	Data Distribution	Correlation with Y
Alumni Giving Rate (Y)	Right skewed	-
Percent of Classes under 20	Left skewed	0.646
Student Faculty Ratio	Right skewed	-0.742

Alumni Giving Rate vs Categorical Variable

For one categorical variable (Private), we can first validate if this variable introduces any significant changes in our output variable Y by plotting a color-coded scatter plot.

From the above scatter plot we can infer that our categorical variable does have an affect when Y is plotted in relation to “Percent of Classes”. Most of the public schools can be seen in the bottom left section whereas private schools are toward top center and right.

To further explore we can visually infer the difference via box plots above. We can see the difference in the location and size of confidence intervals and the means of the two box plots. Private schools generally make higher donations than that of public schools.

Final Remarks on Variables

From above analysis, we can see that all the independent variables add significant description to our output variable - Alumni Giving Rate. As we move forward to assess different models, our universe of variables would consist of all three independent variables.

Modeling Methods and Results

In this section, we would evaluate different models and choose the best fit for our output variable.

Following models were fitted to find the best:

Model A: All variables were included

\[ Y = {b_0} + {b_1} * PctOfClassesUnd.20 + {b_2}* StdFacRatio + {b_3} * Private + \epsilon \]

**R-squared**: 0.57 | **Adjusted R-squared**: 0.55 | **P-value for F-statistic**: 2.818e-08
Variable	P-value
Intercept	0.01005
Percent of Classes < 20	0.66768
Student Faculty Ratio	0.00889
Private	0.24693

Above model indicate a good fit to describe Y as indicated by Adjusted R-squared value of 0.55. However one variable does not seem to be significant i.e. Percent of Classes under 20 - it p-value for t-test is extremely high compared to 0.05 (target alpha). Moving forward we will evaluate models with combinations of Percent of Classes under 20 and Private with some interactions.

Model	Regression Equation
A	\[ Y = {b_0} + {b_1} * PctOfClassesUnd.20 + {b_2}* StdFacRatio + {b_3} * Private + \epsilon \]

B	\[ Y = {b_0} + {b_1} * StdFacRatio + {b_2} * Private + \epsilon \]

C	\[ Y = {b_0} + {b_1} * StdFacRatio + {b_2} * Private + {b_3} * (PctOfClassesUnd.20 * Private) + \epsilon \]

D	\[ Y = {b_0} + {b_1} * StdFacRatio + {b_2} * Private + {b_3} * (PctOfClassesUnd.20 * StdFacRatio) + \epsilon \]

E	\[ Y = {b_0} + {b_1} * StdFacRatio + {b_2} * Private + {b_3} * (StdFacRatio * Private) + \epsilon \]

From the above, it was noted that the highest R-squared value was given by Model D of 0.61. However p-values in that for almost all the variables were above 0.05 therefore insignificant. For now, we will be moving forward with Model B which has R-squared value of 0.57 and p-values are significant for all the included variables.

Model B:

\[ Y = 41.4294 + -1.4863 * StdFacRatio + 7.2669 * Private + 80.6 \] In this model, Adjusted R-squared is 0.55 and p-value for F-statistic is 4.877e-09. Both indicate that the regression model is significant as a whole. Individual p-values for T-value is also noted to be significant for the intercept, student/faculty ratio, and private.

Before proceeding, we also need to analyze residuals for our selected model.

For above plots it is evident that residuals are following a linear relationship on the left-side graph but not on the right-side graph. This is indicative that our best fitted model will need to be improved to imitate the trend witnessed in the actual data.

Transformation:

One form of remedy is transformation via the Box-Cox method. Using the maximum likelihood estimation this method gives us a lambda value for transformation of Y.

From above transformation, we get a lambda of 0.34. After we apply it to out output variable of Alumni Giving Rate (Y^lambda), we get the following equation for our transformed model:

\[ Y = 3.47 -0.056 * StdFacRatio + 0.354 * Private + 0.11 \]For this model, we get an R-squared value of 0.63 and Adjusted R-squared is 0.61 which is the best value we have obtained till now. P-value for F-statistic is 1.927e-10 which is significant. Further below table shows all individual t-vales are less than 0.05 therefore significant.

P-values - Final Model
Variable	P-value
Intercept	7.63e-15
Student-Faculty ratio	0.00187
Private	0.04881

Let’s check if we improved our residuals to fitted non-linearity issue.

Comparing the right-side graph to the one before transformation, we can see that now our residuals are much closer to that zero line (notice the y axis scale for both graphs is different).

Final Model

As a final step in our analysis, we can run the AIC step function and start from all variables and see what best model it proposes.

## Start:  AIC=-102.48
## tranf.alumni_giving_rate ~ student_faculty_ratio + private + 
##     percent_of_classes_under_20
## 
##                               Df Sum of Sq    RSS      AIC
## - percent_of_classes_under_20  1   0.00240 4.8072 -104.452
## <none>                                     4.8048 -102.476
## - private                      1   0.33491 5.1397 -101.242
## - student_faculty_ratio        1   0.94339 5.7482  -95.871
## 
## Step:  AIC=-104.45
## tranf.alumni_giving_rate ~ student_faculty_ratio + private
## 
##                         Df Sum of Sq    RSS      AIC
## <none>                               4.8072 -104.452
## - private                1   0.43812 5.2453 -102.266
## - student_faculty_ratio  1   1.16694 5.9741  -96.021

As can be seen from above results, the model evaluated was in fact the method proposed by AIC step function.

\[ Y = 3.47 - 0.056 * StudentFacultyRatio + 0.354 * Private + 0.11 \]

Above is the regression equation of our final proposed model. We have to use above carefully; since we have transformed the Y variable, we would need to transform it back for real-world applications.

Topics for Further Discussion/Exploration

To improve the R-squared value for our final model, we could explore the possibility of adding interaction variables. As part of our analysis, we discovered that in Model D out R-squared value was 0.61 which was quite high for a base model but we could not justify it due to insignificant p-values for individual variables. This case needs to be investigated to further improve our model.

References

“Box-Cox Transformation - Cornell University.” Accessed November 23, 2022. https://www.css.cornell.edu/faculty/dgr2/_static/files/R_html/Transformations.html.
Alumni Donation Data

Alumni Donation Case Study

Homework 5 - Linear Regression

Group work

2023-02-20