In this analysis, I illustrate a univariate linear regression following the approach outlined in the textbook “Linear Regression Using R” by David Lilja. I use a data set from the ISLR R package ISLR stands for the textbook “An Introduction to Statistical Learning: with Applications in R” by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. ISLR contains a number of public data sets and I use a College admissions data set to illustrate the linear regression approach.
In the next section, I describe the data set and the method to gather the data. Then I visualize the data set with scatter plots and exploratory data analysis. Next, I run the regression and describe the model outputs. Finally, I conduct residual analysis to see if a univariate linear model can explain the data well.
We install the R package ISLR and then construct a dataframe with the required observations but just the columns needed for the univariate regression. The data set College is derived from the 1995 issue of US News and World Report according to the College{ISLR} documentation.
library(ISLR)
school_name = labels(College)[[1]] # Get the school names as a column of string from the College data fram.
Apps = College$Apps # This column is the number of applications submitted in 1995
Enroll = College$Enroll # This column is the number of enrolled students in 1995 in each college
# This column is the percentage of entering students
# who were in the top 10 percent in their high school class.
Top10 = College$Top10perc
dat = data.frame( school_name, Apps, Enroll, Admission = Enroll/Apps, Top10 )
head(dat)
## school_name Apps Enroll Admission Top10
## 1 Abilene Christian University 1660 721 0.4343373 23
## 2 Adelphi University 2186 512 0.2342177 16
## 3 Adrian College 1428 336 0.2352941 22
## 4 Agnes Scott College 417 137 0.3285372 60
## 5 Alaska Pacific University 193 55 0.2849741 16
## 6 Albertson College 587 158 0.2691652 38
str(dat)
## 'data.frame': 777 obs. of 5 variables:
## $ school_name: Factor w/ 777 levels "Abilene Christian University",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Admission : num 0.434 0.234 0.235 0.329 0.285 ...
## $ Top10 : num 23 16 22 60 16 38 17 37 30 21 ...
We explain the columns above by using the first row as an example. Abilene Christian University is the first entry in the College dataframe. It had 1660 applications submitted in 1995. 721 applicants enrolled for studies. This yielded an admission rate of \[43.43\% = \frac{721}{1660}\] and \(23\%\) of students came from the top decile of their high school class.
Now we plot the admissions rate versus the top 10 percent. A more elite college can be expected to have a lower admissions rate. We explore whether more elite colleges have a higher percentage of students in the top 10 percent of their high school classes.
We see in the plot below that there is a large diversity in admissions rates for a given level of top 10 percent in their high school class. For example, when a College has 20 percent of the entering class coming from the top 10 percent of their high schools, the range of admission rates come go from under 10 percent to 80 percent.
However, when a College has an high level of its class coming from the top 10 percent, the range of admission rates tends to decline and the level is lower. This means that a College with a lot of elite entering freshmen will tend to have tougher admissions standards.
plot( dat$Top10, dat$Admission, xlab="Top10 Pct", ylab = "Admission Rate")
Now let’s explore the distribution of top 10 through summary statistics and its histogram. The median percentage of top 10 students is 23 percent and the distribution is skewed to the right. Some college has 1 percent of its students from the top 10 percent. Another one has 96 percent from the top 10.
summary(dat$Top10)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 15.00 23.00 27.56 35.00 96.00
hist(dat$Top10)
The distribution of admission rates is also right skewed. Many schools have admission rates around a median of 29 percent. The right skew appears to be less prominent that for top 10 percent variable. The most elite school has a 6.8 percent admissions rate and the most permissive school has an 83% admissions rate.
summary(dat$Admission)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06892 0.22011 0.29152 0.30937 0.38268 0.83705
hist(dat$Admission)
We now construct a linear regression model to explain admission rates in terms of the percentage of top 10 students. We expect the model to show a negative slope for the fitted line because a higher percentage of good students should imply a more demanding admissions process and lower admissions rate.
mod1 = lm( dat$Admission ~ dat$Top10 , data=dat)
summary(mod1)
##
## Call:
## lm(formula = dat$Admission ~ dat$Top10, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28501 -0.08276 -0.01843 0.06577 0.48362
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3785252 0.0077077 49.11 <2e-16 ***
## dat$Top10 -0.0025095 0.0002356 -10.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1158 on 775 degrees of freedom
## Multiple R-squared: 0.1277, Adjusted R-squared: 0.1266
## F-statistic: 113.4 on 1 and 775 DF, p-value: < 2.2e-16
The linear regression model shows:
\[ admissions = 0.3785252 - 0.0025095 * top10 \] # Evaluating the Quality of the model
The high F-statistic F=113.4 and extremely small p-value (almost zero) suggests the linear regression model is statistically significant. The probability of the top 10 percent is not relevant to the admissions rate is very small.
The R-squared says 12.77% of the variation can be explained by the model. This is a very low value but reflects the challenge of explaining all College admissions policy and results with a single variable.
Looking at the plot of the fitted line below in red, the line does appear sensible and not driven by outliers. The slope of the line is negative ( -0.0025 ) as I predicted and both coefficients are statistically significant.
plot( dat$Top10, dat$Admission, xlab="Top10 Pct", ylab = "Admission Rate", main="Top 10 Pct vs. Admission Rates: With Linear Regression line" )
abline(mod1, col='red', lwd=3)
Below we plot all the residuals. The scale-location plot shows the assumption of equal variance is violated. The cluster of points for admission rates near 0.35 is much greater than those near 0.10 admissions rate.
The residuals vs. leverage plot shows no significant influential observations driving the regression fit.
The Normal Q-Q plot shows the residuals are skewed to the right. When admissions rates are high, the range of top 10 students shows the greatest variation.
par(mfrow=c(2,2)) # Change the panel layout to 2 by 2
plot(mod1)
Let’s examine the histogram of residuals. The plot of the residuals confirms the right skew as explained previously.
par(mfrow=c(1,1))
hist(mod1$residuals)
The univariate linear regression model of admission rates in terms of the top 10 percent is useful but not a full story. More work with other variables may explain the admissions rate for Colleges more effectively.