```
library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
library(lubridate)
library(tibble)
library(knitr)
library(statsr)
library(SignifReg)
library(tidyverse)
library(car)
```

Alumni donations are an important source of revenue for colleges and universities. If administrators could determine the factors that influence increased donation among alumniâ€™s, they might be able to implement policies that could lead to increased revenues. Research shows that students who are more satisfied with their contact with teachers are more likely to graduate. As a result, one might suspect that smaller class sizes and lower student-faculty ratios might leads to a higher percentage of satisfied graduates, which in turn might lead to increased alumni donations. **We have taken a dataset containing information of 48 national universities (Americaâ€™s Best Colleges, Year 2000 Edition) and studied how the different factors affect the alumni giving rate. We have implemented a multiple linear regression model to answer this question.**

The alumni data set has **5 variables and 48 observations**:

**School**: Name of the school.**Percent_of_classes_under_20**: The percentage of classes offered with fewer than 20 students.**Student_faculty_ratio**: The ratio of the students enrolled to the number of faculty in school.**Alumni_giving_rate**: The percentage of alumni that donated to the university.**Private**: This is an indicator variable indicating if the school is a private (1) or public institute (0).

**alumni_giving_rate**ranges from 7% to 67% with an average of 30% and no missing values.**percent_of_classes_under_20**ranges from 29% to 77% with an average of 48% and no missing values.**student_faculty_ratio**ranges from 3 to 23 with an average of 12 and no missing values.

**All values are reasonable and there seems to be no outlier.**

```
# Reading Data
url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"
alumni <- read.csv(url)
summary_fn <- function(x)
{
alumni %>%
summarise(min=min(x),
max=max(x),
range=diff(range(x)),
mean=round(mean(x),2),
median=median(x),
missing=sum(is.na(x)),
Q1=quantile(x,probs=0.25),
Q2=quantile(x,probs=0.75))
}
summary_data <- rbind(summary_fn(alumni$alumni_giving_rate),
summary_fn(alumni$percent_of_classes_under_20),
summary_fn(alumni$student_faculty_ratio))
summary_data <-
summary_data %>%
mutate(Variable = c("alumni_giving_rate",
"percent_of_classes_under_20",
"student_faculty_ratio")) %>%
select(Variable,everything())
kable(summary_data)
```

Variable | min | max | range | mean | median | missing | Q1 | Q2 |
---|---|---|---|---|---|---|---|---|

alumni_giving_rate | 7 | 67 | 60 | 29.27 | 29.0 | 0 | 18.75 | 38.50 |

percent_of_classes_under_20 | 29 | 77 | 48 | 55.73 | 59.5 | 0 | 44.75 | 66.25 |

student_faculty_ratio | 3 | 23 | 20 | 11.54 | 10.5 | 0 | 8.00 | 13.50 |

- The distribution of
**alumni_giving_rate**seems bi-modal and right skewed. - The distribution of of
**percent_of_classes_under_20**seems bi-modal as well. It has a high positive correlation of 0.646 with the response variable. - The distribution of
**student_faculty_ratio**seems right skewed. It has a high negative correlation of -0.742 with the response variable. - Roughly
**70% of the sampled schools are Private**. Also, Private schools seem to have a positive association with the alumni_giving_rate. - All predictor variables seem to have strong correlation among them. We will need to
**check for multi-collinearity post model building**.

```
alumni %>%
select(alumni_giving_rate,percent_of_classes_under_20,student_faculty_ratio,private) %>%
ggpairs()
```

Using the **Exploratory Data Analysis**, these are the predictor variables which have some association with the alumni giving rate:

- student_faculty_ratio
- percent_of_classes_under_20
- private

Now we will run a **forward-selection** algorithm to determine the best possible predictor variables based on \(R^2_{adj}\). The methodology is as follows:

- Start with single predictor regression of response vs each explanatory variable
- Pick the model with the hightest \(R^2_{adj}\)
- Add remaining variables one at a time to the existing model, and pick the model with the highest \(R^2_{adj}\)
- Repeat until the addition of any of the remaining variables does not result in a higher \(R^2_{adj}\)

Note: The alpha used is 0.1 and Bonferroni correction helps reduce the Type-I error rate.

The model selects only **student_faculty_ratio** as the predictor variable with an \(R^2_{adj} =\) 0.54

```
# The range of variables that the forward-selection algorithm will examine
scope <- alumni_giving_rate ~
percent_of_classes_under_20 +
student_faculty_ratio +
private
# Building the final model using forward-selection algorithm
model <- SignifReg(scope=scope,
data=alumni,
alpha=0.1,
direction="forward",
criterion="r-adj",
correction="FDR")
summary(model)
```

```
##
## Call:
## lm(formula = reg, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.328 -5.692 -1.471 4.058 24.272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.0138 3.4215 15.495 < 2e-16 ***
## student_faculty_ratio -2.0572 0.2737 -7.516 1.54e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.103 on 46 degrees of freedom
## Multiple R-squared: 0.5512, Adjusted R-squared: 0.5414
## F-statistic: 56.49 on 1 and 46 DF, p-value: 1.544e-09
```

Using the residuals vs fitted-values plot, we can conclude that the **constant variance assumption is being violated** for this model. The variance is fanning out i.e.Â increasing with increase in fitted_values.

```
# Constructing a dataframe containing model attributes
model_attributes <-
data.frame(index=1:nrow(alumni),
residuals = model$residuals,
fitted_values = model$fitted.values)
# Residuals vs Fitted-Values Plot
ggplot(model_attributes, aes(x=fitted_values,y=residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
geom_abline(intercept = 0, slope = 0.7, color = "blue") +
geom_abline(intercept = 0, slope = -0.7, color = "blue") +
ylim(-25, 25)
```

To fix the non-constant variance problem we can apply **Box-Cox Transformation** to the response variable (Y) where:

\[Y^\lambda_{i} = (Y^\lambda_{i}-1)/\lambda\; for \; \lambda \neq 0\]

\[Y^\lambda_{i} = ln(Y{}i)\; for \; \lambda = 0\]

The \(\lambda\) obtained by the Box-Cox function is approximately 0.5

`MASS::boxcox(alumni_giving_rate ~ student_faculty_ratio, data = alumni)`

Applying **forward-selection** algorithm using the transformed response variable i.e. \(Y^\lambda_{i} = (Y^\lambda_{i}-1)/\lambda\) for \(\lambda =\) 0.5

The **final model** selects **student_faculty_ratio** and **private** as the predictor variables with an \(R^2_{adj} =\) 0.60.

```
lambda=0.5
alumni$alumni_giving_rate_boxcox <- (alumni$alumni_giving_rate^lambda-1)/lambda
scope <- alumni_giving_rate_boxcox ~
percent_of_classes_under_20 +
student_faculty_ratio +
private
model1 <- SignifReg(scope=scope,
data=alumni,
alpha=0.1,
direction="forward",
criterion="r-adj",
correction="FDR")
summary(model1)
```

```
##
## Call:
## lm(formula = reg, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.418 -1.082 -0.439 1.188 3.517
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.57852 1.51584 6.979 1.1e-08 ***
## student_faculty_ratio -0.27803 0.08403 -3.309 0.00185 **
## private 1.66528 0.87023 1.914 0.06204 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.625 on 45 degrees of freedom
## Multiple R-squared: 0.6207, Adjusted R-squared: 0.6039
## F-statistic: 36.82 on 2 and 45 DF, p-value: 3.363e-10
```

We will check if our **final model** satistfies all the assumptions:

**Errors are normally distributed with mean=0**

Using a Histogram, errors seem to be normally distributed and centred at 0.

```
# Constructing a dataframe containing model attributes
model_attributes1 <-
data.frame(index=1:nrow(alumni),
residuals = model1$residuals,
fitted_values = model1$fitted.values)
# Plotting Histogram of Residuals
model_attributes1 %>%
ggplot(aes(x=residuals)) +
geom_histogram(binwidth=2)
```

Using a Q-Q Plot, errors seem to be normally distributed as well.

```
# Constructing Q-Q Plot
qqnorm(model_attributes1$residuals)
qqline(model_attributes1$residuals, col='red')
```

**Uncorrelated Errors**

There seems to be no pattern for the errors over time (index). Thus we can safely assume that the errors are uncorrelated.

```
# Plotting Residuals over Time
model_attributes1 %>%
ggplot(aes(x=index,y=residuals)) +
geom_point()
```

**Constance Variance**

We can clearly see that the residuals are constantly varied across the fitted values.

```
# Residuals vs Fitted-Value Plot
ggplot(model_attributes1, aes(x=fitted_values,y=residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
geom_hline(yintercept = 3, color = "blue") +
geom_hline(yintercept = -3, color = "blue")
```