Content

EDA - Understanding the Data

##   ï..school         percent_of_classes_under_20 student_faculty_ratio
##  Length:48          Min.   :29.00               Min.   : 3.00        
##  Class :character   1st Qu.:44.75               1st Qu.: 8.00        
##  Mode  :character   Median :59.50               Median :10.50        
##                     Mean   :55.73               Mean   :11.54        
##                     3rd Qu.:66.25               3rd Qu.:13.50        
##                     Max.   :77.00               Max.   :23.00        
##  alumni_giving_rate    private      
##  Min.   : 7.00      Min.   :0.0000  
##  1st Qu.:18.75      1st Qu.:0.0000  
##  Median :29.00      Median :1.0000  
##  Mean   :29.27      Mean   :0.6875  
##  3rd Qu.:38.50      3rd Qu.:1.0000  
##  Max.   :67.00      Max.   :1.0000

For our alumni giving dataset we have 5 variables, our school, the percentage of classes with a class size under 20, the student-faculty ratio, the alumni giving rate, and a binary variable for whether or not the school is private. We can examine the distribution of our variables using boxplots and histograms. We have no missing values present in the dataset.

There doesn’t appear to be a clear distribution across any of the numeric variables, we can see the possibility of a bimodal distribution in our % of Class Sizes, it will be worth investigating whether there are substantial differences due to one of our variables, likely, the binary “private” school variable.

##   rank_giving alumni_giving_rate                          ï..school private
## 1         1.0                 67               Princeton University       1
## 2         2.0                 53                  Dartmouth College       1
## 3         3.0                 50                    Yale University       1
## 4         4.0                 49                   U. of Notre Dame       1
## 5         5.5                 46 California Institute of Technology       1
## 6         5.5                 46                 Harvard University       1
## 7         7.0                 45                    Duke University       1
## 8         8.0                 44  Massachusetts Inst. of Technology       1
## 9         9.0                 41                 U. of Pennsylvania       1

##    rank_giving alumni_giving_rate                      ï..school private
## 1          1.0                  7         U. of California-Davis       0
## 2          2.0                  8     U. of California-San Diego       0
## 3          3.0                  9        U. of California-Irvine       0
## 4          4.5                 12 U. of California-Santa Barbara       0
## 5          4.5                 12               U. of Washington       0
## 6          8.0                 13            New York University       1
## 7          8.0                 13   U. of California-Los Angeles       0
## 8          8.0                 13       U. of Michigan-Ann Arbor       0
## 9          8.0                 13             U. of Texas-Austin       0
## 10         8.0                 13        U. of Wisconsin-Madison       0

The top 10 schools for giving are all private, and only 1 of the bottom 10 schools for giving is private, which indicates that there is likely some difference between private and public schools for alumni giving rates.

There is a clear distinction between private and public schools, both in the student/faculty ratio as well as in the % of Classes with fewer than 20 students. This appears to have a positive linear relationship, but not one with a different intercept than a simple linear model, therefore we should not develop multiple linear models on the public/private distinction, though it will be helpful in building our predictive model.

library(plotly)  # for interactive plotting

fig <- plot_ly(alumni, x = ~student_faculty_ratio, y = ~percent_of_classes_under_20, z = ~alumni_giving_rate, color = ~private, colors = c('#BF382A', '#0C4B8E'))
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene = list(xaxis = list(title = 'Weight'),
                     yaxis = list(title = 'Gross horsepower'),
                     zaxis = list(title = '1/4 mile time')))

fig

When viewed in 3-d, there is a clear linear trend across our continuous variables, feel free to zoom in and out, as well as shift the axes of the plot to view it from multiple angles.

We can see a negative linear relationship between student/faculty ratio and the % of class sizes under 20, which makes intuitive sense, as the greater the student/faculty ratio, the greater the number of classes with more than 20 students. Our binary variable displays separation at the extreme for each of our continuous variables, with varying degrees of overlap.

We can see some strong positive and negative linear relationships between our numeric variables, however, it doesn’t appear that there is any need for concern regarding multicollinearity, as there are no variables with a correlation with an absolute value greater than 0.9.

B. Analyze the data set using linear regression models. Carry out model diagnostic analysis. If there are any violations of the model assumptions, propose and carry out possible remedies. Select the “best” model for the data set.

The minimum requirement for the data analysis includes: exploratory data analysis of your data set (summaries, plots, etc.), linear regression models and model diagnostic analysis, and appropriate remedies (e.g., transformations, if necessary). You will use the alumni giving rate as the response variable ( Y ) of interest. The potential predictors should include the percentage of classes with fewer than 20 students ( X1 ), student/faculty ratio ( X2 ), and the indicator variable private ( X3 ) (i.e., a 1 indicates a private school).

## 
## Call:
## lm(formula = alumni_giving_rate ~ ., data = numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.757  -6.320  -2.273   5.152  25.669 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                 36.78364   13.67220   2.690  0.01005 * 
## percent_of_classes_under_20  0.07725    0.17873   0.432  0.66768   
## student_faculty_ratio       -1.39835    0.51075  -2.738  0.00889 **
## private                      6.28534    5.35633   1.173  0.24693   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.06 on 44 degrees of freedom
## Multiple R-squared:  0.5747, Adjusted R-squared:  0.5457 
## F-statistic: 19.81 on 3 and 44 DF,  p-value: 2.818e-08

Our simple linear regression model has an Adjusted R-squared of 0.5457, which leaves room for improvement. Our most significant variable, according to this model, is the student/faculty ratio, meaning that this variable plays an important role in determining the alumni giving rate.

Our fitted value residuals appear uniformly distributed about 0 based on our residuals plot, with perhaps only clustering on the upper end of the x-axis giving cause for concern. Our qq plot indicates that we don’t see major deviations in our residuals that should cause us to rethink the model. The most important variables yielded by this model are the % classes under 20, the student/faculty ratio, and the public private distinction.

## percent_of_classes_under_20       student_faculty_ratio 
##                    3.183919                    3.514477 
##                     private 
##                    3.604326

Considering the effect of variables on the alumni giving rate, we can see that as student/faculty ratio decreases, the alumni giving rate increases, supporting the conclusions of our linear model. We can also see a general, if noisy, increase in the alumni giving rate as the % of Classes with less than 20 students increases.

Optimization

An excellent case study needs to work on selecting the “best” model for the data and/or carrying out appropriate remedies to improve the statistical inferences (e.g., you can try Box-Cox transformation if necessary).

## Call: earth(formula=alumni_giving_rate~., data=alumni, degree=1)
## 
##                                   coefficients
## (Intercept)                         22.3162169
## ï..schoolDartmouth College          25.5731456
## ï..schoolLehigh University          15.1284643
## ï..schoolPrinceton University       23.8361682
## ï..schoolU. of Notre Dame           27.7555301
## h(percent_of_classes_under_20-65)    0.9867945
## h(12-student_faculty_ratio)          2.5553188
## h(student_faculty_ratio-12)         -1.0717470
## 
## Selected 8 of 23 terms, and 6 of 50 predictors
## Termination condition: GRSq -10 at 23 terms
## Importance: student_faculty_ratio, ï..schoolU. of Notre Dame, ...
## Number of terms at each degree of interaction: 1 7 (additive model)
## GCV 68.93795    RSS 1564.03    GRSq 0.6263805    RSq 0.8158119

alumni$notre_dame <- ifelse(alumni$ï..school == "U. of Notre Dame", 1, 0)
alumni$princeton <- ifelse(alumni$ï..school == "Princeton University", 1, 0)
alumni$dartmouth <- ifelse(alumni$ï..school == "Dartmouth College", 1, 0)
alumni$lehigh <- ifelse(alumni$ï..school == "Lehigh University", 1, 0)

Using the earth() model in MARS, I deployed machine learning to search for an optimal linear model, the result was a linear regression model that identified better predictors to bring into the model, as well as any necessary data transformation. What the model was able to do was identify the addition of certain variables and transformations that improved the accuracy of the model, namely, adding a binary variable for certain high-donation schools. To tune this model, I created variables for those schools indicated in the model, then ran a 10-fold cross validation algorithm varying the degree of the variables, yielding this solution.

## Multivariate Adaptive Regression Spline 
## 
## 48 samples
##  7 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 43, 44, 43, 43, 44, 43, ... 
## Resampling results across tuning parameters:
## 
##   degree  RMSE      Rsquared   MAE     
##   1       9.365351  0.5552942  7.752930
##   2       9.338507  0.5635940  7.694028
##   3       9.338507  0.5635940  7.694028
##   4       9.338507  0.5635940  7.694028
##   5       9.338507  0.5635940  7.694028
## 
## Tuning parameter 'nprune' was held constant at a value of 5
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were nprune = 5 and degree = 2.

We can see that as the degree was changed for our model, we optimized our findings at level 2, indicating that we needed no additional power transformations to improve our model based on a maximization of the R-Squared.

The importance of our variables as calculated by our MARS model are shown to the right, with student/faculty ratio being the most important variable in the model, followed by our identifier variables for Notre Dame, Dartmouth, and Princeton.

summary(alumni_mars_tune$finalModel)

## Call: earth(x=data.frame[48,7], y=c(25,33,40,46,2...), keepxy=TRUE, degree=2,
##             nprune=5)
## 
##                                          coefficients
## (Intercept)                                 15.938967
## princeton                                   23.901531
## h(15-student_faculty_ratio)                  2.715950
## h(15-student_faculty_ratio) * notre_dame    13.814566
## h(15-student_faculty_ratio) * dartmouth      4.696256
## 
## Selected 5 of 14 terms, and 4 of 7 predictors (nprune=5)
## Termination condition: Reached nk 21
## Importance: student_faculty_ratio, notre_dame, dartmouth, princeton, ...
## Number of terms at each degree of interaction: 1 2 2
## GCV 72.65474    RSS 2072.174    GRSq 0.6062369    RSq 0.7559702

The final model generated from this analysis has an R-Squared value of 0.756, which is a major improvement over our simple linear model R-Squared value of 0.54. What we can surmise from this analysis is that a major amount of influence on alumni giving comes from being an alumnus of specific schools, which are uniformly private – Notre Dame, Dartmouth, and Princeton. These universities have a strong alumni donation base that places them above their competition, while many of our other universities see clear increases in donation based on reducing class sizes and emphasizing a small student/faculty ratio. What causes this extraordinary level of giving at these schools is unclear based on this analysis, though it would be an interesting topic to investigate for the future.

A university could use this analysis to inform their hiring practices to increase alumni donations, for example, if a university wanted to decrease their student/faculty ratio to achieve a certain level of alumni donations, they could use this model as a baseline for understanding what level of hiring they would need to commit to in order to achieve a certain level of donations. This may ultimately not favor a university’s recruitment strategy, as many public universities, such as the University of Cincinnati, compete based on experiential-learning & Co-Op opportunities, particularly in stem, and the effect of a decreased student/faculty ratio may not have the desired effect for schools that have a different consumer base, however, this analysis allows all stakeholders to approach the problem from an informed perspective.

Linear Regression with R

John Trygier

11/14/2021

Alumni Donations Analysis

Content

EDA - Understanding the Data

B. Analyze the data set using linear regression models. Carry out model diagnostic analysis. If there are any violations of the model assumptions, propose and carry out possible remedies. Select the “best” model for the data set.

Optimization