Overview

In this analysis, I illustrate a univariate linear regression following the approach outlined in the textbook “Linear Regression Using R” by David Lilja. I use a data set from the ISLR R package ISLR stands for the textbook “An Introduction to Statistical Learning: with Applications in R” by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. ISLR contains a number of public data sets and I use a College admissions data set to illustrate the linear regression approach.

In the next section, I describe the data set and the method to gather the data. Then I visualize the data set with scatter plots and exploratory data analysis. Next, I run the regression and describe the model outputs. Finally, I conduct residual analysis to see if a univariate linear model can explain the data well.

Obtaining and Describing the Data

We install the R package ISLR and then construct a dataframe with the required observations but just the columns needed for the univariate regression. The data set College is derived from the 1995 issue of US News and World Report according to the College{ISLR} documentation.

library(ISLR)

school_name = labels(College)[[1]]  # Get the school names as a column of string from the College data fram.
Apps = College$Apps   # This column is the number of applications submitted in 1995
Enroll = College$Enroll # This column is the number of enrolled students in 1995 in each college
# This column is the percentage of entering students 
# who were in the top 10 percent in their high school class.
Top10 = College$Top10perc

dat = data.frame( school_name, Apps, Enroll, Admission = Enroll/Apps, Top10 )
head(dat)

##                    school_name Apps Enroll Admission Top10
## 1 Abilene Christian University 1660    721 0.4343373    23
## 2           Adelphi University 2186    512 0.2342177    16
## 3               Adrian College 1428    336 0.2352941    22
## 4          Agnes Scott College  417    137 0.3285372    60
## 5    Alaska Pacific University  193     55 0.2849741    16
## 6            Albertson College  587    158 0.2691652    38

str(dat)

## 'data.frame':    777 obs. of  5 variables:
##  $ school_name: Factor w/ 777 levels "Abilene Christian University",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Admission  : num  0.434 0.234 0.235 0.329 0.285 ...
##  $ Top10      : num  23 16 22 60 16 38 17 37 30 21 ...

We explain the columns above by using the first row as an example. Abilene Christian University is the first entry in the College dataframe. It had 1660 applications submitted in 1995. 721 applicants enrolled for studies. This yielded an admission rate of \[43.43\% = \frac{721}{1660}\] and \(23\%\) of students came from the top decile of their high school class.

Visualizing and Describing the Dataset

Now we plot the admissions rate versus the top 10 percent. A more elite college can be expected to have a lower admissions rate. We explore whether more elite colleges have a higher percentage of students in the top 10 percent of their high school classes.

We see in the plot below that there is a large diversity in admissions rates for a given level of top 10 percent in their high school class. For example, when a College has 20 percent of the entering class coming from the top 10 percent of their high schools, the range of admission rates come go from under 10 percent to 80 percent.

However, when a College has an high level of its class coming from the top 10 percent, the range of admission rates tends to decline and the level is lower. This means that a College with a lot of elite entering freshmen will tend to have tougher admissions standards.

plot( dat$Top10, dat$Admission, xlab="Top10 Pct", ylab = "Admission Rate")

Now let’s explore the distribution of top 10 through summary statistics and its histogram. The median percentage of top 10 students is 23 percent and the distribution is skewed to the right. Some college has 1 percent of its students from the top 10 percent. Another one has 96 percent from the top 10.

summary(dat$Top10)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   15.00   23.00   27.56   35.00   96.00

hist(dat$Top10)

The distribution of admission rates is also right skewed. Many schools have admission rates around a median of 29 percent. The right skew appears to be less prominent that for top 10 percent variable. The most elite school has a 6.8 percent admissions rate and the most permissive school has an 83% admissions rate.

summary(dat$Admission)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06892 0.22011 0.29152 0.30937 0.38268 0.83705

hist(dat$Admission)

Constructing a linear regression

We now construct a linear regression model to explain admission rates in terms of the percentage of top 10 students. We expect the model to show a negative slope for the fitted line because a higher percentage of good students should imply a more demanding admissions process and lower admissions rate.

mod1 = lm( dat$Admission ~ dat$Top10 , data=dat)
summary(mod1)

## 
## Call:
## lm(formula = dat$Admission ~ dat$Top10, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28501 -0.08276 -0.01843  0.06577  0.48362 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.3785252  0.0077077   49.11   <2e-16 ***
## dat$Top10   -0.0025095  0.0002356  -10.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1158 on 775 degrees of freedom
## Multiple R-squared:  0.1277, Adjusted R-squared:  0.1266 
## F-statistic: 113.4 on 1 and 775 DF,  p-value: < 2.2e-16

The linear regression model shows:

\[ admissions = 0.3785252 - 0.0025095 * top10 \] # Evaluating the Quality of the model

The high F-statistic F=113.4 and extremely small p-value (almost zero) suggests the linear regression model is statistically significant. The probability of the top 10 percent is not relevant to the admissions rate is very small.

The R-squared says 12.77% of the variation can be explained by the model. This is a very low value but reflects the challenge of explaining all College admissions policy and results with a single variable.

Looking at the plot of the fitted line below in red, the line does appear sensible and not driven by outliers. The slope of the line is negative ( -0.0025 ) as I predicted and both coefficients are statistically significant.

plot( dat$Top10, dat$Admission, xlab="Top10 Pct", ylab = "Admission Rate", main="Top 10 Pct vs. Admission Rates: With Linear Regression line" )
abline(mod1, col='red', lwd=3)

Residual Analysis

Below we plot all the residuals. The scale-location plot shows the assumption of equal variance is violated. The cluster of points for admission rates near 0.35 is much greater than those near 0.10 admissions rate.

The residuals vs. leverage plot shows no significant influential observations driving the regression fit.

The Normal Q-Q plot shows the residuals are skewed to the right. When admissions rates are high, the range of top 10 students shows the greatest variation.

par(mfrow=c(2,2))  # Change the panel layout to 2 by 2

plot(mod1)

Let’s examine the histogram of residuals. The plot of the residuals confirms the right skew as explained previously.

par(mfrow=c(1,1))

hist(mod1$residuals)

Data 605 Discussion Week 11

Alexander Ng

11/7/2019

Overview

Obtaining and Describing the Data

Visualizing and Describing the Dataset

Constructing a linear regression

Residual Analysis

Conclusion