Alumni Donation Data Analysis
Problem Statement
College is a launching pad for countless students, helping them acquire the skills needed for a successful career after graduation. Some graduates show their gratitude by giving back and donating to various causes at their alma mater. Given the data of 48 schools, find the factors influencing the donation rate. Data Description:
- School : Name of the school
- X1 : % of Classes Under 20
- X2 : Student/Faculty Ratio
- X3 : Private
- y : Alumni Giving Rate
Data Analysis
Lets look the structure of data:
data = read_xls("./alumni1.xls")
colnames(data)[2:5] = c("x1", "x2", "y", "x3")
data = data %>% mutate(x3 = as.factor(x3),
School = as.factor(School))
str(data)## tibble[,5] [48 x 5] (S3: tbl_df/tbl/data.frame)
## $ School: Factor w/ 48 levels "Boston College",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ x1 : num [1:48] 39 68 60 65 67 52 45 69 72 61 ...
## $ x2 : num [1:48] 13 8 8 3 10 8 12 7 13 10 ...
## $ y : num [1:48] 25 33 40 46 28 31 27 31 35 53 ...
## $ x3 : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
and high level stats of the data
summary(data)## School x1 x2
## Boston College : 1 Min. :29.00 Min. : 3.00
## Brandeis University : 1 1st Qu.:44.75 1st Qu.: 8.00
## Brown University : 1 Median :59.50 Median :10.50
## California Institute of Technology: 1 Mean :55.73 Mean :11.54
## Carnegie Mellon University : 1 3rd Qu.:66.25 3rd Qu.:13.50
## Case Western Reserve Univ. : 1 Max. :77.00 Max. :23.00
## (Other) :42
## y x3
## Min. : 7.00 0:15
## 1st Qu.:18.75 1:33
## Median :29.00
## Mean :29.27
## 3rd Qu.:38.50
## Max. :67.00
##
%Classes under 20 vs Donation Rate
plt_1 <- ggplot(data, aes(x1)) +
geom_histogram() +
ggtitle("Distribution of \n'% of Classes Under 20'") +
theme_fivethirtyeight() +
theme(plot.title = element_text(size=13))
plt_2 <- ggplot(data, aes(y = x1)) +
geom_boxplot() +
ggtitle("Boxplot of \n'% of Classes Under 20'") +
theme_fivethirtyeight()+
theme(plot.title = element_text(size=13))
plt_3 <- ggplot(data, aes(x1, y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("Y vs X1") +
theme_fivethirtyeight()+
theme(plot.title = element_text(size=13))
plt_4 <- ggplot(data, aes(x1)) +
geom_density() +
ggtitle("Distribution of \n'% of Classes Under 20'") +
theme_fivethirtyeight()+
theme(plot.title = element_text(size=13))
grid.arrange(plt_1, plt_2, plt_4, plt_3, nrow = 2)fit1 = lm(y ~ x1, data = data)
summary(fit1)##
## Call:
## lm(formula = y ~ x1, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.053 -7.158 -1.660 6.734 29.658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.3861 6.5655 -1.125 0.266
## x1 0.6578 0.1147 5.734 7.23e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.38 on 46 degrees of freedom
## Multiple R-squared: 0.4169, Adjusted R-squared: 0.4042
## F-statistic: 32.88 on 1 and 46 DF, p-value: 7.228e-07
cor.test(data$x1, data$y)##
## Pearson's product-moment correlation
##
## data: data$x1 and data$y
## t = 5.7344, df = 46, p-value = 7.228e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4427365 0.7856553
## sample estimates:
## cor
## 0.6456504
From the plot, we see there is positive correlation between “% of classes under 20” and “Alumni donation Rate”. The Pearson’s correlation coefficient for Y and X1 is 0.65. Data is not normally distributed, but we do not have any outliers. With this we can conclude the positive linear relation between X1 and Y. Fitting a linear model gives an adjusted R-squared value of 0.404. The regression estimates obtained are not significant.
Student/Faculty Ratio vs Donation Rate
plt_1 <- ggplot(data, aes(x2)) +
geom_histogram() +
ggtitle("Distribution of \n'Student/Faculty Ratio'") +
theme_fivethirtyeight()+
theme(plot.title = element_text(size=13))
plt_2 <- ggplot(data, aes(y = x2)) +
geom_boxplot() +
ggtitle("Boxplot of \n'Student/Faculty Ratio'") +
theme_fivethirtyeight()+
theme(plot.title = element_text(size=13))
plt_3 <- ggplot(data, aes(x2, y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("Y vs X2") +
theme_fivethirtyeight()+
theme(plot.title = element_text(size=13))
plt_4 <- ggplot(data, aes(x2)) +
geom_density() +
ggtitle("Distribution of \n'Student/Faculty Ratio'") +
theme_fivethirtyeight()+
theme(plot.title = element_text(size=13))
grid.arrange(plt_1, plt_2, plt_4, plt_3, nrow = 2)fit2 = lm(y ~ x2, data = data)
summary(fit2)##
## Call:
## lm(formula = y ~ x2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.328 -5.692 -1.471 4.058 24.272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.0138 3.4215 15.495 < 2e-16 ***
## x2 -2.0572 0.2737 -7.516 1.54e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.103 on 46 degrees of freedom
## Multiple R-squared: 0.5512, Adjusted R-squared: 0.5414
## F-statistic: 56.49 on 1 and 46 DF, p-value: 1.544e-09
cor.test(data$x2, data$y)##
## Pearson's product-moment correlation
##
## data: data$x2 and data$y
## t = -7.5157, df = 46, p-value = 1.544e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8477145 -0.5807711
## sample estimates:
## cor
## -0.7423975
From the plot we see, these is a negative correlation between X2 and Y. The Pearson’s correlation coefficient for X2 and Y is -0.742. Data is right skewed and we have one outlier. With this we can conclude the negative linear relation between X2 and Y. Fitting a linear model gives an adjusted R-squared value of 0.541. The regression estimates obtained are significant and the final equation we obtained is: \[ Y = 53.01 – 2.05*X2 \]
College Type vs Donation Rate
plt_1 <- ggplot(data, aes(x3)) +
geom_bar() +
ggtitle("Private university \nCount") +
theme_fivethirtyeight()+
theme(plot.title = element_text(size=13))
plt_2 <- ggplot(data, aes(group = x3, color = x3)) +
geom_point(aes(x=x1, y = y), size = 3, alpha = 0.5 ) +
geom_smooth(method = "lm", aes(x = x1, y = y), se = FALSE) +
ggtitle("Y vs X1 for private \nand state schools") +
theme_fivethirtyeight()+
theme(plot.title = element_text(size=13))
plt_4 <- ggplot(data, aes(group = x3, color = x3)) +
geom_point(aes(x=x3, y = y), size = 3, alpha = 0.7 ) +
ggtitle("Variation of 'Alumni Donation'\n w.r.t school type") +
theme_fivethirtyeight()+
theme(plot.title = element_text(size=13))
plt_3 <- ggplot(data, aes(group = x3, color = x3)) +
geom_point(aes(x=x2, y = y), size = 3, alpha = 0.5 ) +
geom_smooth(method = "lm", aes(x = x2, y = y), se = FALSE) +
ggtitle("Y vs X2 for private \nand state schools") +
theme_fivethirtyeight()+
theme(plot.title = element_text(size=13))
grid.arrange(plt_1, plt_2, plt_4, plt_3,nrow = 2)t.test(y~x3, data = data)##
## Welch Two Sample t-test
##
## data: y by x3
## t = -7.7775, df = 42.568, p-value = 1.05e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -24.92037 -14.65539
## sample estimates:
## mean in group 0 mean in group 1
## 15.66667 35.45455
fit3 = lm(y~x3, data = data)
summary(fit3)##
## Call:
## lm(formula = y ~ x3, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -6.455 -2.454 5.386 31.546
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.667 2.541 6.167 1.63e-07 ***
## x31 19.788 3.064 6.458 5.94e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.839 on 46 degrees of freedom
## Multiple R-squared: 0.4755, Adjusted R-squared: 0.4641
## F-statistic: 41.71 on 1 and 46 DF, p-value: 5.939e-08
We can see there is not much imbalance between Private and state schools. The relation between Y and X1, and Y and X2 changes drastically depending on the school type. This shows we should include interaction terms while including “private” feature to build the linear model. The adjusted R square for Y~x3 is 0.464
From t test for mean of Y under school type, we get p-val ~ 0, hence we can conclude there is significant information from the school type feature.
Linear Models
Model 1: FullModel
model1 = lm(y ~ x1 + x2 + x3 ,data=data)
summary(model1)##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.757 -6.320 -2.273 5.152 25.669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.78364 13.67220 2.690 0.01005 *
## x1 0.07725 0.17873 0.432 0.66768
## x2 -1.39835 0.51075 -2.738 0.00889 **
## x31 6.28534 5.35633 1.173 0.24693
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.06 on 44 degrees of freedom
## Multiple R-squared: 0.5747, Adjusted R-squared: 0.5457
## F-statistic: 19.81 on 3 and 44 DF, p-value: 2.818e-08
The estimated model for this is \[ Y = 36.78 + 0.077*X1 -1.39*X2 + 6.628*X3 \] we observe that only x2 coefficient is significant. We may further explore with only X2.
Model Evalution Metrics
data.frame("BIC" = BIC(model1), "AIC" = AIC(model1), "RMSE" = rmse(model1, data=data))## BIC AIC RMSE
## 1 362.9711 353.6151 8.674479
Residual Analysis
par(mfrow=c(1,2))
plot(model1, which = 1:2)From the residual plots we see that the residuals are not normally plotted and the from the fitted plot vs residuals values we see non constant variance across fitted values, which implies the model can be improved further.
Model 2: With Interaction Terms
model2 = lm(y ~ x1 + x2 + x3 + x1:x3 + x2:x3, data = data)
summary(model2)##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x1:x3 + x2:x3, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.624 -5.768 -1.769 5.323 23.564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.0206 18.9423 1.690 0.0984 .
## x1 -0.1363 0.3018 -0.452 0.6538
## x2 -0.6258 0.7466 -0.838 0.4067
## x31 14.9514 26.7319 0.559 0.5789
## x1:x31 0.2255 0.3806 0.593 0.5566
## x2:x31 -1.2949 1.0335 -1.253 0.2171
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.999 on 42 degrees of freedom
## Multiple R-squared: 0.5994, Adjusted R-squared: 0.5517
## F-statistic: 12.57 on 5 and 42 DF, p-value: 1.739e-07
By including the interaction terms, we the estimated model as \[ Y = 32.02 -0.136*X1 -0.625*X2 + 14.95*X31 + 0.2255*X1*X3 -1.2949*X2*X3 \]
Model Evalution Metrics
print(data.frame("BIC" = BIC(model2), "AIC" = AIC(model2), "RMSE" = rmse(model2, data=data)))## BIC AIC RMSE
## 1 367.8352 354.7368 8.418261
Residual Analysis
par(mfrow=c(1,2))
plot(model2, which = 1:2) From the residual plots, we see the variance is little stabilized across fitted Y with the addition of interaction terms but, normality is still not achieved. We can try some stabilizing transformations for better variance stability and normality. But, none of the estimated coefficients of this model are not significant and looks little over complicated. So, we should try fitting another model.
Model 3: With Institute type and Student Faculty Ratio
model3 = lm(y ~ x2 + x3 ,data=data)
summary(model3)##
## Call:
## lm(formula = y ~ x2 + x3, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.374 -6.148 -2.189 5.633 25.735
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.4294 8.3734 4.948 1.09e-05 ***
## x2 -1.4863 0.4642 -3.202 0.00251 **
## x31 7.2669 4.8071 1.512 0.13761
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.978 on 45 degrees of freedom
## Multiple R-squared: 0.5728, Adjusted R-squared: 0.5539
## F-statistic: 30.17 on 2 and 45 DF, p-value: 4.877e-09
Model Evalution Metrics
data.frame("BIC" = BIC(model3), "AIC" = AIC(model3), "RMSE" = rmse(model3, data=data))## BIC AIC RMSE
## 1 359.3033 351.8185 8.692876
Residual Analysis
par(mfrow=c(1,2))
plot(model3, which = 1:2)This model looks fairly promising. We see that the AIC of this model is lower than before models. But from residual plots we see non constant variance across the fitter values. We can try transforming data and fitting the model again.
Box-Cox Transformation
par(mfrow=c(1,1))
bc <- MASS::boxcox(y ~ x2 + x3, data = data)lambda <- bc$x[which.max(bc$y)]
print(lambda)## [1] 0.3434343
From the box-cox transformation, we get optimal value for lambda as 0.343. Let’s mutate the Y and build the model on the transformed Y.
Model:4 Transformed Y
data$y_trs = (data$y ^ lambda - 1) / lambda
model4 = lm(y_trs ~ x2 + x3 ,data=data)
summary(model4)##
## Call:
## lm(formula = y_trs ~ x2 + x3, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1031 -0.6559 -0.2420 0.7583 1.9529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.32445 0.90662 8.079 2.65e-10 ***
## x2 -0.16613 0.05026 -3.305 0.00187 **
## x31 1.05289 0.52049 2.023 0.04905 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9721 on 45 degrees of freedom
## Multiple R-squared: 0.6298, Adjusted R-squared: 0.6134
## F-statistic: 38.28 on 2 and 45 DF, p-value: 1.946e-10
Model Evalution Metrics
data.frame("BIC" = BIC(model4), "AIC" = AIC(model4), "RMSE" = rmse(model4, data=data))## BIC AIC RMSE
## 1 145.8865 138.4017 0.9412111
Residual Analysis
par(mfrow=c(1,2))
plot(model4, which = 1:2)We can see from the residual plots that we have achieved almost constant variance across the fitted values and the residuals are also normally distributed. The model metrics obtained are best for this model.
Result
From all the above 4 models, model4 is the best model performing model. The evaluation metrics, R¬2 : 0.61 is the highest among the models, AIC(138.4) and BIC(145.88) are lowest among the models and RMSE(0.94) is also the lowest. It also satisfies the all the assumptions of the Linear regression model with significant coefficients. But we must be careful as it predicts the transformed Y. Therefore, model4 may be used for predicting the Alumni donation rate.