Alumni Donation Data Analysis

Problem Statement

College is a launching pad for countless students, helping them acquire the skills needed for a successful career after graduation. Some graduates show their gratitude by giving back and donating to various causes at their alma mater. Given the data of 48 schools, find the factors influencing the donation rate. Data Description:

  • School : Name of the school
  • X1 : % of Classes Under 20
  • X2 : Student/Faculty Ratio
  • X3 : Private
  • y : Alumni Giving Rate

Data Analysis

Lets look the structure of data:

data = read_xls("./alumni1.xls")
colnames(data)[2:5] = c("x1", "x2", "y", "x3")
data = data %>% mutate(x3 = as.factor(x3),
                       School = as.factor(School))
str(data)
## tibble[,5] [48 x 5] (S3: tbl_df/tbl/data.frame)
##  $ School: Factor w/ 48 levels "Boston College",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ x1    : num [1:48] 39 68 60 65 67 52 45 69 72 61 ...
##  $ x2    : num [1:48] 13 8 8 3 10 8 12 7 13 10 ...
##  $ y     : num [1:48] 25 33 40 46 28 31 27 31 35 53 ...
##  $ x3    : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

and high level stats of the data

summary(data)
##                                 School         x1              x2       
##  Boston College                    : 1   Min.   :29.00   Min.   : 3.00  
##  Brandeis University               : 1   1st Qu.:44.75   1st Qu.: 8.00  
##  Brown University                  : 1   Median :59.50   Median :10.50  
##  California Institute of Technology: 1   Mean   :55.73   Mean   :11.54  
##  Carnegie Mellon University        : 1   3rd Qu.:66.25   3rd Qu.:13.50  
##  Case Western Reserve Univ.        : 1   Max.   :77.00   Max.   :23.00  
##  (Other)                           :42                                  
##        y         x3    
##  Min.   : 7.00   0:15  
##  1st Qu.:18.75   1:33  
##  Median :29.00         
##  Mean   :29.27         
##  3rd Qu.:38.50         
##  Max.   :67.00         
## 

%Classes under 20 vs Donation Rate

plt_1 <- ggplot(data, aes(x1)) +
            geom_histogram() + 
            ggtitle("Distribution of \n'% of Classes Under 20'") + 
            theme_fivethirtyeight() + 
            theme(plot.title = element_text(size=13))

plt_2 <- ggplot(data, aes(y = x1)) + 
            geom_boxplot() + 
            ggtitle("Boxplot of \n'% of Classes Under 20'") + 
            theme_fivethirtyeight()+ 
            theme(plot.title = element_text(size=13))

plt_3 <- ggplot(data, aes(x1, y)) + 
            geom_point() + 
            geom_smooth(method = "lm", se = FALSE) +
            ggtitle("Y vs X1") + 
            theme_fivethirtyeight()+ 
            theme(plot.title = element_text(size=13))

plt_4 <- ggplot(data, aes(x1)) +
          geom_density() + 
          ggtitle("Distribution of \n'% of Classes Under 20'") + 
          theme_fivethirtyeight()+ 
            theme(plot.title = element_text(size=13))

grid.arrange(plt_1, plt_2, plt_4, plt_3, nrow = 2)

fit1 = lm(y ~ x1, data = data)
summary(fit1)
## 
## Call:
## lm(formula = y ~ x1, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.053  -7.158  -1.660   6.734  29.658 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.3861     6.5655  -1.125    0.266    
## x1            0.6578     0.1147   5.734 7.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.38 on 46 degrees of freedom
## Multiple R-squared:  0.4169, Adjusted R-squared:  0.4042 
## F-statistic: 32.88 on 1 and 46 DF,  p-value: 7.228e-07
cor.test(data$x1, data$y)
## 
##  Pearson's product-moment correlation
## 
## data:  data$x1 and data$y
## t = 5.7344, df = 46, p-value = 7.228e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4427365 0.7856553
## sample estimates:
##       cor 
## 0.6456504

From the plot, we see there is positive correlation between “% of classes under 20” and “Alumni donation Rate”. The Pearson’s correlation coefficient for Y and X1 is 0.65. Data is not normally distributed, but we do not have any outliers. With this we can conclude the positive linear relation between X1 and Y. Fitting a linear model gives an adjusted R-squared value of 0.404. The regression estimates obtained are not significant.

Student/Faculty Ratio vs Donation Rate

plt_1 <- ggplot(data, aes(x2)) +
  geom_histogram() + 
  ggtitle("Distribution of \n'Student/Faculty Ratio'") + 
  theme_fivethirtyeight()+ 
            theme(plot.title = element_text(size=13))

plt_2 <- ggplot(data, aes(y = x2)) + 
  geom_boxplot() + 
  ggtitle("Boxplot of \n'Student/Faculty Ratio'") + 
  theme_fivethirtyeight()+ 
            theme(plot.title = element_text(size=13))

plt_3 <- ggplot(data, aes(x2, y)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  ggtitle("Y vs X2") + 
  theme_fivethirtyeight()+ 
            theme(plot.title = element_text(size=13))
plt_4 <- ggplot(data, aes(x2)) +
  geom_density() + 
  ggtitle("Distribution of \n'Student/Faculty Ratio'") + 
  theme_fivethirtyeight()+ 
            theme(plot.title = element_text(size=13))

grid.arrange(plt_1, plt_2, plt_4, plt_3, nrow = 2)

fit2 = lm(y ~ x2, data = data)
summary(fit2)
## 
## Call:
## lm(formula = y ~ x2, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.328  -5.692  -1.471   4.058  24.272 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53.0138     3.4215  15.495  < 2e-16 ***
## x2           -2.0572     0.2737  -7.516 1.54e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.103 on 46 degrees of freedom
## Multiple R-squared:  0.5512, Adjusted R-squared:  0.5414 
## F-statistic: 56.49 on 1 and 46 DF,  p-value: 1.544e-09
cor.test(data$x2, data$y)
## 
##  Pearson's product-moment correlation
## 
## data:  data$x2 and data$y
## t = -7.5157, df = 46, p-value = 1.544e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8477145 -0.5807711
## sample estimates:
##        cor 
## -0.7423975

From the plot we see, these is a negative correlation between X2 and Y. The Pearson’s correlation coefficient for X2 and Y is -0.742. Data is right skewed and we have one outlier. With this we can conclude the negative linear relation between X2 and Y. Fitting a linear model gives an adjusted R-squared value of 0.541. The regression estimates obtained are significant and the final equation we obtained is: \[ Y = 53.01 – 2.05*X2 \]

College Type vs Donation Rate

plt_1 <- ggplot(data, aes(x3)) + 
          geom_bar() + 
          ggtitle("Private university \nCount") + 
          theme_fivethirtyeight()+ 
            theme(plot.title = element_text(size=13))

plt_2 <- ggplot(data, aes(group = x3, color = x3)) + 
            geom_point(aes(x=x1, y = y), size = 3, alpha = 0.5 ) +
            geom_smooth(method = "lm", aes(x = x1, y = y), se = FALSE) +
            ggtitle("Y vs X1 for private \nand state schools") + 
            theme_fivethirtyeight()+ 
            theme(plot.title = element_text(size=13))

plt_4 <- ggplot(data, aes(group = x3, color = x3)) + 
          geom_point(aes(x=x3, y = y), size = 3, alpha = 0.7 ) +
          ggtitle("Variation of 'Alumni Donation'\n w.r.t school type") + 
          theme_fivethirtyeight()+ 
            theme(plot.title = element_text(size=13))

plt_3 <- ggplot(data, aes(group = x3, color = x3)) + 
          geom_point(aes(x=x2, y = y), size = 3, alpha = 0.5 ) +
          geom_smooth(method = "lm", aes(x = x2, y = y), se = FALSE) +
          ggtitle("Y vs X2 for private \nand state schools") + 
          theme_fivethirtyeight()+ 
            theme(plot.title = element_text(size=13))
grid.arrange(plt_1, plt_2, plt_4, plt_3,nrow = 2)

t.test(y~x3, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  y by x3
## t = -7.7775, df = 42.568, p-value = 1.05e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -24.92037 -14.65539
## sample estimates:
## mean in group 0 mean in group 1 
##        15.66667        35.45455
fit3 = lm(y~x3, data = data)
summary(fit3)
## 
## Call:
## lm(formula = y ~ x3, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -6.455  -2.454   5.386  31.546 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   15.667      2.541   6.167 1.63e-07 ***
## x31           19.788      3.064   6.458 5.94e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.839 on 46 degrees of freedom
## Multiple R-squared:  0.4755, Adjusted R-squared:  0.4641 
## F-statistic: 41.71 on 1 and 46 DF,  p-value: 5.939e-08

We can see there is not much imbalance between Private and state schools. The relation between Y and X1, and Y and X2 changes drastically depending on the school type. This shows we should include interaction terms while including “private” feature to build the linear model. The adjusted R square for Y~x3 is 0.464

From t test for mean of Y under school type, we get p-val ~ 0, hence we can conclude there is significant information from the school type feature.

Linear Models

Model 1: FullModel

model1 = lm(y ~ x1 + x2 + x3 ,data=data)
summary(model1)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.757  -6.320  -2.273   5.152  25.669 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 36.78364   13.67220   2.690  0.01005 * 
## x1           0.07725    0.17873   0.432  0.66768   
## x2          -1.39835    0.51075  -2.738  0.00889 **
## x31          6.28534    5.35633   1.173  0.24693   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.06 on 44 degrees of freedom
## Multiple R-squared:  0.5747, Adjusted R-squared:  0.5457 
## F-statistic: 19.81 on 3 and 44 DF,  p-value: 2.818e-08

The estimated model for this is \[ Y = 36.78 + 0.077*X1 -1.39*X2 + 6.628*X3 \] we observe that only x2 coefficient is significant. We may further explore with only X2.

Model Evalution Metrics

data.frame("BIC" = BIC(model1), "AIC" = AIC(model1), "RMSE" = rmse(model1, data=data))
##        BIC      AIC     RMSE
## 1 362.9711 353.6151 8.674479

Residual Analysis

par(mfrow=c(1,2))
plot(model1, which = 1:2)

From the residual plots we see that the residuals are not normally plotted and the from the fitted plot vs residuals values we see non constant variance across fitted values, which implies the model can be improved further.

Model 2: With Interaction Terms

model2 = lm(y ~ x1 + x2 + x3 + x1:x3 + x2:x3, data = data)
summary(model2)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x1:x3 + x2:x3, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.624  -5.768  -1.769   5.323  23.564 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  32.0206    18.9423   1.690   0.0984 .
## x1           -0.1363     0.3018  -0.452   0.6538  
## x2           -0.6258     0.7466  -0.838   0.4067  
## x31          14.9514    26.7319   0.559   0.5789  
## x1:x31        0.2255     0.3806   0.593   0.5566  
## x2:x31       -1.2949     1.0335  -1.253   0.2171  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.999 on 42 degrees of freedom
## Multiple R-squared:  0.5994, Adjusted R-squared:  0.5517 
## F-statistic: 12.57 on 5 and 42 DF,  p-value: 1.739e-07

By including the interaction terms, we the estimated model as \[ Y = 32.02 -0.136*X1 -0.625*X2 + 14.95*X31 + 0.2255*X1*X3 -1.2949*X2*X3 \]

Model Evalution Metrics

print(data.frame("BIC" = BIC(model2), "AIC" = AIC(model2), "RMSE" = rmse(model2, data=data)))
##        BIC      AIC     RMSE
## 1 367.8352 354.7368 8.418261

Residual Analysis

par(mfrow=c(1,2))
plot(model2, which = 1:2)

From the residual plots, we see the variance is little stabilized across fitted Y with the addition of interaction terms but, normality is still not achieved. We can try some stabilizing transformations for better variance stability and normality. But, none of the estimated coefficients of this model are not significant and looks little over complicated. So, we should try fitting another model.

Model 3: With Institute type and Student Faculty Ratio

model3 = lm(y ~ x2 + x3 ,data=data)
summary(model3)
## 
## Call:
## lm(formula = y ~ x2 + x3, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.374  -6.148  -2.189   5.633  25.735 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.4294     8.3734   4.948 1.09e-05 ***
## x2           -1.4863     0.4642  -3.202  0.00251 ** 
## x31           7.2669     4.8071   1.512  0.13761    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.978 on 45 degrees of freedom
## Multiple R-squared:  0.5728, Adjusted R-squared:  0.5539 
## F-statistic: 30.17 on 2 and 45 DF,  p-value: 4.877e-09

Model Evalution Metrics

data.frame("BIC" = BIC(model3), "AIC" = AIC(model3), "RMSE" = rmse(model3, data=data))
##        BIC      AIC     RMSE
## 1 359.3033 351.8185 8.692876

Residual Analysis

par(mfrow=c(1,2))
plot(model3, which = 1:2)

This model looks fairly promising. We see that the AIC of this model is lower than before models. But from residual plots we see non constant variance across the fitter values. We can try transforming data and fitting the model again.

Box-Cox Transformation

par(mfrow=c(1,1))
bc <- MASS::boxcox(y ~ x2 + x3, data = data)

lambda <- bc$x[which.max(bc$y)]
print(lambda)
## [1] 0.3434343

From the box-cox transformation, we get optimal value for lambda as 0.343. Let’s mutate the Y and build the model on the transformed Y.

Model:4 Transformed Y

data$y_trs = (data$y ^ lambda - 1) / lambda
model4 = lm(y_trs ~ x2 + x3 ,data=data)
summary(model4)
## 
## Call:
## lm(formula = y_trs ~ x2 + x3, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1031 -0.6559 -0.2420  0.7583  1.9529 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.32445    0.90662   8.079 2.65e-10 ***
## x2          -0.16613    0.05026  -3.305  0.00187 ** 
## x31          1.05289    0.52049   2.023  0.04905 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9721 on 45 degrees of freedom
## Multiple R-squared:  0.6298, Adjusted R-squared:  0.6134 
## F-statistic: 38.28 on 2 and 45 DF,  p-value: 1.946e-10

Model Evalution Metrics

data.frame("BIC" = BIC(model4), "AIC" = AIC(model4), "RMSE" = rmse(model4, data=data))
##        BIC      AIC      RMSE
## 1 145.8865 138.4017 0.9412111

Residual Analysis

par(mfrow=c(1,2))
plot(model4, which = 1:2)

We can see from the residual plots that we have achieved almost constant variance across the fitted values and the residuals are also normally distributed. The model metrics obtained are best for this model.

Result

From all the above 4 models, model4 is the best model performing model. The evaluation metrics, R¬2 : 0.61 is the highest among the models, AIC(138.4) and BIC(145.88) are lowest among the models and RMSE(0.94) is also the lowest. It also satisfies the all the assumptions of the Linear regression model with significant coefficients. But we must be careful as it predicts the transformed Y. Therefore, model4 may be used for predicting the Alumni donation rate.