Introduction

Alumni donations are an important source of revenue for colleges and universities. If administrators could determine the factors that influence increases in the percentage of alumni donation, they might be able to implement policies that could lead to increased revenues. Research shows that students who are more satisfied with their contact with teachers are more likely to graduate. As a result, one might suspect that smaller class sizes and lower student-faculty ratios might lead to a higher percentage of satisfied graduates, which in turn might lead to increases in the percentage of alumni donations. Similarly, to find various other factors that can affect the alumni donation rate, we have taken the dataset of 48 national universities (America’s Best Colleges, Year 2000 Edition) and implemented various linear regression model to find best model which can answer this question.

Data Preparation and EDA

Let us read the dataset and have a quick glimpse of the data types of the variables

## Observations: 48
## Variables: 5
## $ ï..school                   <chr> "Boston College", "Brandeis Univer...
## $ percent_of_classes_under_20 <int> 39, 68, 60, 65, 67, 52, 45, 69, 72...
## $ student_faculty_ratio       <int> 13, 8, 8, 3, 10, 8, 12, 7, 13, 10,...
## $ alumni_giving_rate          <int> 25, 33, 40, 46, 28, 31, 27, 31, 35...
## $ private                     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## Observations: 48
## Variables: 5
## $ ï..school                   <chr> "Boston College", "Brandeis Univer...
## $ percent_of_classes_under_20 <int> 39, 68, 60, 65, 67, 52, 45, 69, 72...
## $ student_faculty_ratio       <int> 13, 8, 8, 3, 10, 8, 12, 7, 13, 10,...
## $ alumni_giving_rate          <int> 25, 33, 40, 46, 28, 31, 27, 31, 35...
## $ private                     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

Now,let us look at the individual summaries for each of the 5 variables.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   44.75   59.50   55.73   66.25   77.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    8.00   10.50   11.54   13.50   23.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   18.75   29.00   29.27   38.50   67.00
## 
##  0  1 
## 15 33

Let us understand if there are any outliers, before we proceed to look at the distributions.

We see that there are no outliers. Let us move ahead to check distributions

The correlations among predictors don’t seem to be an issue here. We see some non - linear relationship between alumni giving rate and percent of classes under 20. This might be an issue for us and might have to apply some transformations to correct this.

We see that the variable student faculty ratio is skewed towards the right with the bulk of the observations having student faculty ratio between 5 to 10. The percent of classes under 20 is skewed towards the left with most schools having between 60% to 70% of their classes under 20. Also most of the schools have ~ 20 to 40% of their alumni making donations.

Now, let us look at how the categorical variable behaves.

As the slopes do not look parallel for both the predictors percent of classes under 20 and student faculty ratio. So we might have to include an interaction variable between percent of classes under 20 and private as well as for student faculty ratio and private.

Modeling

Let us use forward selection, backward elimination and step-wise selection to decide on the best performing model. We are using BIC as the accuracy metric for this part of modeling.

As we have built the 3 models using forward selection, backward elimination and step wise selection, let us compare the model metrics and understand how should we be proceeding. For this, let us define a function which will provide the metrics to compare the models.

##          fit_be   fit_fs fit_step
## AIC     352.196  352.196  352.196
## BIC     357.810  357.810  357.810
## adjR2     0.541    0.541    0.541
## RMSE      9.103    9.103    9.103
## PRESS  4138.880 4138.880 4138.880
## nterms    2.000    2.000    2.000
## 
## Call:
## lm(formula = alumni_giving_rate ~ student_faculty_ratio, data = alumni_copy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.328  -5.692  -1.471   4.058  24.272 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            53.0138     3.4215  15.495  < 2e-16 ***
## student_faculty_ratio  -2.0572     0.2737  -7.516 1.54e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.103 on 46 degrees of freedom
## Multiple R-squared:  0.5512, Adjusted R-squared:  0.5414 
## F-statistic: 56.49 on 1 and 46 DF,  p-value: 1.544e-09
## [1] 9.102516

Residual Diagnostics

All the 3 approaches end up giving the same model that shows student_faculty_ratio is the only significant variable which contributes towards the variability of the response variable. Now let us check the diagnostics.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We do not see any mean structure present, but we see there might be a problem with non-constant variance. It is a skew right data , might be probelmatic given we have only 48 observations. We see that serial correlation is not a problem here, as the gap between the points is quite random and does not show a pattern. This seems to be a random scatter as we are do not see any shape here except for the line trying to follow the scatter. We do see an outlier which is changing the direction of the fitted line - this point is an influential point.

Transformation (Box - Cox Procedure)

As we see a problem of non - constant variance as well as a problem with non - normality, we will apply box - cox trasformation to check if we can fix this.

## [1] 0.4242424
## 
## Call:
## lm(formula = alumni_giving_rate2 ~ student_faculty_ratio, data = alumni)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6870 -0.7874 -0.2050  0.6987  3.1568 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           10.95027    0.49036  22.331  < 2e-16 ***
## student_faculty_ratio -0.32134    0.03923  -8.192 1.55e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.305 on 46 degrees of freedom
## Multiple R-squared:  0.5933, Adjusted R-squared:  0.5844 
## F-statistic:  67.1 on 1 and 46 DF,  p-value: 1.546e-10
## [1] 1.30457

The lambda value for transformation is 0.42424. We see that the model variance being explained increased by ~4%. We also managed to decrease the RMSE by ~85% from 9.1 to 1.3

Residual Diagnostics

Let us look at the model diagnostics again -

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We do not see any mean structure present and we seem to have fixed the non - constant variance assumption through box - cox method. It is a skew right data , might be probelmatic given we have only 48 observations. We were not able to fix this through box - cox method. # We see that serial correlation is not a problem here, as the gap between the points is quite random and does not show a pattern. This seems to be a random scatter as we are do not see any shape here except for the line trying to follow the scatter. We do see an outlier which is changing the direction of the fitted line. This point is the influential point.

So our final model is Alumni_giving_rate = (0.42424 * (10.9507 - 0.32134(Student_faculty_ratio)) + 1) ^ (1/0.42424).

Some parameter models - adjusted R^2 is 0.5844 and RMSE is 1.3.