## Setup

library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
library(lubridate)
library(tibble)
library(knitr)
library(statsr)
library(SignifReg)
library(tidyverse)
library(car)

## 1. Introduction

Alumni donations are an important source of revenue for colleges and universities. If administrators could determine the factors that influence increased donation among alumniâ€™s, they might be able to implement policies that could lead to increased revenues. Research shows that students who are more satisfied with their contact with teachers are more likely to graduate. As a result, one might suspect that smaller class sizes and lower student-faculty ratios might leads to a higher percentage of satisfied graduates, which in turn might lead to increased alumni donations. We have taken a dataset containing information of 48 national universities (Americaâ€™s Best Colleges, Year 2000 Edition) and studied how the different factors affect the alumni giving rate. We have implemented a multiple linear regression model to answer this question.

## 2. Data Description

The alumni data set has 5 variables and 48 observations:

• School: Name of the school.
• Percent_of_classes_under_20: The percentage of classes offered with fewer than 20 students.
• Student_faculty_ratio: The ratio of the students enrolled to the number of faculty in school.
• Alumni_giving_rate: The percentage of alumni that donated to the university.
• Private: This is an indicator variable indicating if the school is a private (1) or public institute (0).

### Summary Statistics:

• alumni_giving_rate ranges from 7% to 67% with an average of 30% and no missing values.
• percent_of_classes_under_20 ranges from 29% to 77% with an average of 48% and no missing values.
• student_faculty_ratio ranges from 3 to 23 with an average of 12 and no missing values.

All values are reasonable and there seems to be no outlier.

# Reading Data
url <- "https://bgreenwell.github.io/uc-bana7052/data/alumni.csv"

summary_fn <- function(x)
{
alumni %>%
summarise(min=min(x),
max=max(x),
range=diff(range(x)),
mean=round(mean(x),2),
median=median(x),
missing=sum(is.na(x)),
Q1=quantile(x,probs=0.25),
Q2=quantile(x,probs=0.75))
}

summary_data <- rbind(summary_fn(alumni$alumni_giving_rate), summary_fn(alumni$percent_of_classes_under_20),
fitted_values = model$fitted.values) # Residuals vs Fitted-Values Plot ggplot(model_attributes, aes(x=fitted_values,y=residuals)) + geom_point() + geom_hline(yintercept = 0, color = "red") + geom_abline(intercept = 0, slope = 0.7, color = "blue") + geom_abline(intercept = 0, slope = -0.7, color = "blue") + ylim(-25, 25) To fix the non-constant variance problem we can apply Box-Cox Transformation to the response variable (Y) where: $Y^\lambda_{i} = (Y^\lambda_{i}-1)/\lambda\; for \; \lambda \neq 0$ $Y^\lambda_{i} = ln(Y{}i)\; for \; \lambda = 0$ The $$\lambda$$ obtained by the Box-Cox function is approximately 0.5 MASS::boxcox(alumni_giving_rate ~ student_faculty_ratio, data = alumni) Applying forward-selection algorithm using the transformed response variable i.e. $$Y^\lambda_{i} = (Y^\lambda_{i}-1)/\lambda$$ for $$\lambda =$$ 0.5 The final model selects student_faculty_ratio and private as the predictor variables with an $$R^2_{adj} =$$ 0.60. lambda=0.5 alumni$alumni_giving_rate_boxcox <- (alumni$alumni_giving_rate^lambda-1)/lambda scope <- alumni_giving_rate_boxcox ~ percent_of_classes_under_20 + student_faculty_ratio + private model1 <- SignifReg(scope=scope, data=alumni, alpha=0.1, direction="forward", criterion="r-adj", correction="FDR") summary(model1) ## ## Call: ## lm(formula = reg, data = data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.418 -1.082 -0.439 1.188 3.517 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 10.57852 1.51584 6.979 1.1e-08 *** ## student_faculty_ratio -0.27803 0.08403 -3.309 0.00185 ** ## private 1.66528 0.87023 1.914 0.06204 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.625 on 45 degrees of freedom ## Multiple R-squared: 0.6207, Adjusted R-squared: 0.6039 ## F-statistic: 36.82 on 2 and 45 DF, p-value: 3.363e-10 ### Model Diagnostics We will check if our final model satistfies all the assumptions: 1. Errors are normally distributed with mean=0 Using a Histogram, errors seem to be normally distributed and centred at 0. # Constructing a dataframe containing model attributes model_attributes1 <- data.frame(index=1:nrow(alumni), residuals = model1$residuals,
fitted_values = model1$fitted.values) # Plotting Histogram of Residuals model_attributes1 %>% ggplot(aes(x=residuals)) + geom_histogram(binwidth=2)  Using a Q-Q Plot, errors seem to be normally distributed as well. # Constructing Q-Q Plot qqnorm(model_attributes1$residuals)
qqline(model_attributes1\$residuals, col='red')

1. Uncorrelated Errors

There seems to be no pattern for the errors over time (index). Thus we can safely assume that the errors are uncorrelated.

# Plotting Residuals over Time
model_attributes1 %>%
ggplot(aes(x=index,y=residuals)) +
geom_point()

1. Constance Variance

We can clearly see that the residuals are constantly varied across the fitted values.

# Residuals vs Fitted-Value Plot
ggplot(model_attributes1, aes(x=fitted_values,y=residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
geom_hline(yintercept = 3, color = "blue") +
geom_hline(yintercept = -3, color = "blue")