Welcome to Lab 3! In this assignment you will use linear regression to explore which institutional characteristics predict post-graduation earnings using the College Scorecard dataset.
What is “knitting”? When you are finished, you will “knit” this document to HTML. Click the Knit button at the top of RStudio (or press Ctrl+Shift+K). This will run all your code and produce an HTML report. Submit both this .Rmd file AND the knitted HTML file.
Important: Make sure all your code runs without errors before knitting. If a chunk has an error, knitting will fail. A good practice is to use Run All (Ctrl+Alt+R) before knitting to check that everything works.
First, let’s load the College Scorecard dataset. This file contains data on 1,326 institutions with variables covering admissions, enrollment, finances, and outcomes.
scorecard <- read.csv("collegescorecard.csv")
# View(scorecard)
# NOTE: View() opens an interactive table in RStudio but does NOT work
# when knitting. Keep it commented out. You can run it manually in
# your console to browse the data.
str(scorecard)
## 'data.frame': 1326 obs. of 19 variables:
## $ unitid : int 100654 100663 100706 100724 100751 100830 100858 100937 101435 101480 ...
## $ instname : chr "Alabama A & M University" "University of Alabama at Birmingham" "University of Alabama in Huntsville" "Alabama State University" ...
## $ city : chr "Normal" "Birmingham" "Huntsville" "Montgomery" ...
## $ state : chr "AL" "AL" "AL" "AL" ...
## $ control : int 1 1 1 1 1 1 1 2 2 1 ...
## $ adm_rate : num 89.9 86.7 80.6 51.2 56.5 ...
## $ sat_avg : int 823 1146 1180 830 1171 970 1215 1177 999 1036 ...
## $ enrollment : int 4051 11200 5525 5354 28692 4322 19761 1181 1100 7195 ...
## $ pctpell : num 71.2 35 32.8 82.7 21.1 ...
## $ pctfloan : num 82 54 47.3 87.3 41.5 ...
## $ avg_net_price : num 13415 14805 17520 11936 20916 ...
## $ grad_rate : num 29.1 53.8 48.4 25.2 66.7 ...
## $ retention_rate : num 63.1 80.2 81 62.2 87 ...
## $ earn10yr : int 31400 40300 46600 27800 42400 34800 45400 41900 37100 35100 ...
## $ pct_edu_program: num 14.9 8.62 1.73 21.5 8.4 ...
## $ avg_facsalary : num 7079 10170 9341 6557 9605 ...
## $ ft_faculty : num 88.6 91.1 65.5 66.4 71.1 ...
## $ instr_expend : num 7459 17208 9352 7393 9817 ...
## $ pell_cat : int 3 2 1 3 1 2 1 1 2 2 ...
head(scorecard)
## unitid instname city state control adm_rate
## 1 100654 Alabama A & M University Normal AL 1 89.89
## 2 100663 University of Alabama at Birmingham Birmingham AL 1 86.73
## 3 100706 University of Alabama in Huntsville Huntsville AL 1 80.62
## 4 100724 Alabama State University Montgomery AL 1 51.25
## 5 100751 The University of Alabama Tuscaloosa AL 1 56.55
## 6 100830 Auburn University at Montgomery Montgomery AL 1 83.71
## sat_avg enrollment pctpell pctfloan avg_net_price grad_rate retention_rate
## 1 823 4051 71.15 82.04 13415 29.14 63.14
## 2 1146 11200 35.05 53.97 14805 53.77 80.16
## 3 1180 5525 32.81 47.28 17520 48.35 80.98
## 4 830 5354 82.65 87.35 11936 25.17 62.19
## 5 1171 28692 21.07 41.48 20916 66.65 87.00
## 6 970 4322 40.06 64.76 11915 27.05 63.21
## earn10yr pct_edu_program avg_facsalary ft_faculty instr_expend pell_cat
## 1 31400 14.90 7079 88.56 7459 3
## 2 40300 8.62 10170 91.06 17208 2
## 3 46600 1.73 9341 65.55 9352 1
## 4 27800 21.50 6557 66.41 7393 3
## 5 42400 8.40 9605 71.09 9817 1
## 6 34800 14.06 7173 92.62 6817 2
Question: How many observations and variables do we have? What types of variables do you see (numeric, character, factor)? Take a moment to familiarize yourself with the column names.
We have 1,326 observations and 19 variables in the dataset. The variables include a mixture of integer, numeric, and character types. Six variables are stored as integers, including unitid, control, sat_avg, enrollment, earn10yr, and pell_cat. Ten variables are stored as numeric values, including adm_rate, pctpell, pctfloan, avg_net_price, grad_rate, retention_rate, pct_edu_program, avg_facsalary, ft_faculty, and instr_expend. Three variables are stored as character strings: instname, city, and state. If a variable such as state were to be used in a regression model, it would typically be converted to a factor variable so that it can be included as a categorical variable. The column names indicate that the dataset contains information on institutional characteristics, student demographics, and financial variables that may influence earnings outcomes.
Our dependent variable (outcome) is
earn10yr — the median earnings of students 10 years after
entering the institution. Your task is to identify which institutional
characteristics best predict post-graduation earnings.
Here is a reminder of the available variables you can use as independent variables (predictors):
| Variable | Description |
|---|---|
unitid |
Unique institution identifier |
instname |
Institution name |
city |
City |
state |
State |
control |
1 = public, 2 = private nonprofit |
adm_rate |
Admission rate (0-100) |
sat_avg |
Average SAT score (~850-1540) |
enrollment |
Total student enrollment |
pctpell |
Percent Pell grant recipients (0-100, proxy for socioeconomic status) |
pctfloan |
Percent receiving federal loans (0-100) |
avg_net_price |
Average net price in dollars |
grad_rate |
6-year graduation rate (0-100) — an interesting predictor to consider! |
retention_rate |
First-year retention rate (0-100) |
earn10yr |
Median earnings 10 years after entry (DV) |
pct_edu_program |
Percent education programs |
avg_facsalary |
Average faculty monthly salary |
ft_faculty |
Percent full-time faculty (0-100) |
instr_expend |
Instructional expenditure per FTE student |
pell_cat |
1 = low Pell (<=33%), 2 = medium (33-66%), 3 = high (>66%) |
2a. To what extent are differences in median earnings ten years after college entry associated with institutional selectivity, student socioeconomic composition, and institutional outcomes, and do institutions that enroll larger proportions of Pell grant recipients produce different earnings outcomes after accounting for selectivity, graduation rates, retention rates, and net price?
2b. List 5-6 candidate independent variables from the table above. For each one, briefly explain why you think it might be related to post-graduation earnings.
2c. Rank your chosen variables in order of expected importance (strongest predictor first). This is your hypothesis — we will check it against the data later!
Before we run any regressions, we need to make sure the data are
clean. Let’s use the describe() function from the
psych package to get a detailed summary of every
variable.
describe(scorecard)
## scorecard
##
## 19 Variables 1326 Observations
## --------------------------------------------------------------------------------
## unitid
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1326 1 182728 182415 46374 110637
## .10 .25 .50 .75 .90 .95
## 128248 152539 183066 214650 230786 236995
##
## lowest : 100654 100663 100706 100724 100751, highest: 436818 436827 436836 442356 448840
## --------------------------------------------------------------------------------
## instname
## n missing distinct
## 1326 0 1315
##
## lowest : Abilene Christian University Abraham Baldwin Agricultural College Adams State University Adelphi University Adrian College
## highest: Yale University Yeshiva University York College York College Pennsylvania Young Harris College
## --------------------------------------------------------------------------------
## city
## n missing distinct
## 1326 0 867
##
## lowest : Aberdeen Abilene Abington Ada Adrian
## highest: Worcester Yankton York Young Harris Ypsilanti
## --------------------------------------------------------------------------------
## state
## n missing distinct
## 1326 0 52
##
## lowest : AK AL AR AZ CA, highest: VT WA WI WV WY
## --------------------------------------------------------------------------------
## control
## n missing distinct Info Mean
## 1326 0 2 0.706 1.621
##
## Value 1 2
## Frequency 502 824
## Proportion 0.379 0.621
## --------------------------------------------------------------------------------
## adm_rate
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1212 1 64.18 65.24 20.77 28.76
## .10 .25 .50 .75 .90 .95
## 39.58 53.81 66.16 76.95 86.68 92.10
##
## lowest : 5.69 5.84 7.05 7.41 7.42 , highest: 99.75 99.77 99.81 99.87 100
## --------------------------------------------------------------------------------
## sat_avg
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 466 1 1061 1050 142.2 878.0
## .10 .25 .50 .75 .90 .95
## 920.5 976.0 1043.0 1122.0 1235.0 1322.8
##
## lowest : 666 716 723 749 750, highest: 1497 1501 1503 1504 1534
## --------------------------------------------------------------------------------
## enrollment
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1234 1 5579 3686 6450 568
## .10 .25 .50 .75 .90 .95
## 794 1348 2512 6499 15735 21734
##
## lowest : 126 158 160 161 195, highest: 39460 43139 43931 47079 50919
## --------------------------------------------------------------------------------
## pctpell
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1177 1 36.49 35.81 15.86 14.90
## .10 .25 .50 .75 .90 .95
## 18.61 26.21 35.58 44.89 54.17 61.68
##
## lowest : 6.16 7.84 9.11 9.58 9.83 , highest: 84.98 87.58 92.46 92.59 94.51
## --------------------------------------------------------------------------------
## pctfloan
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1177 1 60.23 61.05 18.43 31.84
## .10 .25 .50 .75 .90 .95
## 38.61 49.71 61.89 72.03 79.75 84.38
##
## lowest : 0 2.9 3.24 4.82 8.62 , highest: 93.34 93.69 93.95 94.57 100
## --------------------------------------------------------------------------------
## avg_net_price
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1284 1 18848 18526 7336 9129
## .10 .25 .50 .75 .90 .95
## 11096 14236 18118 22790 28120 31168
##
## lowest : 1776 2035 2452 3698 4299, highest: 37971 38225 38809 40306 41414
## --------------------------------------------------------------------------------
## grad_rate
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1184 1 54.75 54.3 19.71 28.43
## .10 .25 .50 .75 .90 .95
## 33.48 42.49 53.70 66.27 79.38 86.47
##
## lowest : 4.84 4.9 5.36 8.38 10.17, highest: 95.76 95.78 96.94 97.47 97.79
## --------------------------------------------------------------------------------
## retention_rate
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1111 1 75.75 75.98 12.84 57.08
## .10 .25 .50 .75 .90 .95
## 61.11 68.37 75.98 83.62 90.72 94.15
##
## lowest : 0 21.82 33.13 42.5 43.33, highest: 98.61 98.67 99.34 99.49 100
## --------------------------------------------------------------------------------
## earn10yr
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 378 1 42939 41900 10496 30200
## .10 .25 .50 .75 .90 .95
## 32600 36500 41400 47475 54800 60875
##
## lowest : 17600 22400 22900 24600 24700, highest: 85800 87200 91600 110600 116400
## --------------------------------------------------------------------------------
## pct_edu_program
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 786 0.989 7.147 6.485 7.562 0.00
## .10 .25 .50 .75 .90 .95
## 0.00 0.53 5.57 11.23 16.48 20.07
##
## lowest : 0 0.02 0.03 0.07 0.08 , highest: 38.33 39.68 50 58.51 64.97
## --------------------------------------------------------------------------------
## avg_facsalary
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1187 1 7526 7336 2268 4741
## .10 .25 .50 .75 .90 .95
## 5252 6112 7199 8509 10216 11548
##
## lowest : 1476 2660 2709 2978 2985, highest: 16042 16120 16589 17861 19862
## --------------------------------------------------------------------------------
## ft_faculty
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1000 0.995 70.82 71.52 26.58 29.31
## .10 .25 .50 .75 .90 .95
## 38.34 53.15 71.93 94.44 100.00 100.00
##
## lowest : 4.03 14.23 15.71 16 16.24, highest: 99.49 99.52 99.69 99.72 100
## --------------------------------------------------------------------------------
## instr_expend
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1253 1 10135 8668 5976 4655
## .10 .25 .50 .75 .90 .95
## 5260 6514 8243 10790 15112 20956
##
## lowest : 1889 2036 2167 2238 2480, highest: 78147 88200 92854 97613 106214
## --------------------------------------------------------------------------------
## pell_cat
## n missing distinct Info Mean pMedian Gmd
## 1326 0 3 0.76 1.619 1.5 0.5498
##
## Value 1 2 3
## Frequency 549 733 44
## Proportion 0.414 0.553 0.033
## --------------------------------------------------------------------------------
colSums(is.na(scorecard))
## unitid instname city state control
## 0 0 0 0 0
## adm_rate sat_avg enrollment pctpell pctfloan
## 0 0 0 0 0
## avg_net_price grad_rate retention_rate earn10yr pct_edu_program
## 0 0 0 0 0
## avg_facsalary ft_faculty instr_expend pell_cat
## 0 0 0 0
If there are missing values, we will remove them using
na.omit(). This is a simple approach — in a more advanced
course you might learn about imputation methods instead.
scorecard <- na.omit(scorecard)
# Check descriptives again after removing missing data
describe(scorecard)
## scorecard
##
## 19 Variables 1326 Observations
## --------------------------------------------------------------------------------
## unitid
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1326 1 182728 182415 46374 110637
## .10 .25 .50 .75 .90 .95
## 128248 152539 183066 214650 230786 236995
##
## lowest : 100654 100663 100706 100724 100751, highest: 436818 436827 436836 442356 448840
## --------------------------------------------------------------------------------
## instname
## n missing distinct
## 1326 0 1315
##
## lowest : Abilene Christian University Abraham Baldwin Agricultural College Adams State University Adelphi University Adrian College
## highest: Yale University Yeshiva University York College York College Pennsylvania Young Harris College
## --------------------------------------------------------------------------------
## city
## n missing distinct
## 1326 0 867
##
## lowest : Aberdeen Abilene Abington Ada Adrian
## highest: Worcester Yankton York Young Harris Ypsilanti
## --------------------------------------------------------------------------------
## state
## n missing distinct
## 1326 0 52
##
## lowest : AK AL AR AZ CA, highest: VT WA WI WV WY
## --------------------------------------------------------------------------------
## control
## n missing distinct Info Mean
## 1326 0 2 0.706 1.621
##
## Value 1 2
## Frequency 502 824
## Proportion 0.379 0.621
## --------------------------------------------------------------------------------
## adm_rate
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1212 1 64.18 65.24 20.77 28.76
## .10 .25 .50 .75 .90 .95
## 39.58 53.81 66.16 76.95 86.68 92.10
##
## lowest : 5.69 5.84 7.05 7.41 7.42 , highest: 99.75 99.77 99.81 99.87 100
## --------------------------------------------------------------------------------
## sat_avg
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 466 1 1061 1050 142.2 878.0
## .10 .25 .50 .75 .90 .95
## 920.5 976.0 1043.0 1122.0 1235.0 1322.8
##
## lowest : 666 716 723 749 750, highest: 1497 1501 1503 1504 1534
## --------------------------------------------------------------------------------
## enrollment
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1234 1 5579 3686 6450 568
## .10 .25 .50 .75 .90 .95
## 794 1348 2512 6499 15735 21734
##
## lowest : 126 158 160 161 195, highest: 39460 43139 43931 47079 50919
## --------------------------------------------------------------------------------
## pctpell
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1177 1 36.49 35.81 15.86 14.90
## .10 .25 .50 .75 .90 .95
## 18.61 26.21 35.58 44.89 54.17 61.68
##
## lowest : 6.16 7.84 9.11 9.58 9.83 , highest: 84.98 87.58 92.46 92.59 94.51
## --------------------------------------------------------------------------------
## pctfloan
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1177 1 60.23 61.05 18.43 31.84
## .10 .25 .50 .75 .90 .95
## 38.61 49.71 61.89 72.03 79.75 84.38
##
## lowest : 0 2.9 3.24 4.82 8.62 , highest: 93.34 93.69 93.95 94.57 100
## --------------------------------------------------------------------------------
## avg_net_price
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1284 1 18848 18526 7336 9129
## .10 .25 .50 .75 .90 .95
## 11096 14236 18118 22790 28120 31168
##
## lowest : 1776 2035 2452 3698 4299, highest: 37971 38225 38809 40306 41414
## --------------------------------------------------------------------------------
## grad_rate
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1184 1 54.75 54.3 19.71 28.43
## .10 .25 .50 .75 .90 .95
## 33.48 42.49 53.70 66.27 79.38 86.47
##
## lowest : 4.84 4.9 5.36 8.38 10.17, highest: 95.76 95.78 96.94 97.47 97.79
## --------------------------------------------------------------------------------
## retention_rate
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1111 1 75.75 75.98 12.84 57.08
## .10 .25 .50 .75 .90 .95
## 61.11 68.37 75.98 83.62 90.72 94.15
##
## lowest : 0 21.82 33.13 42.5 43.33, highest: 98.61 98.67 99.34 99.49 100
## --------------------------------------------------------------------------------
## earn10yr
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 378 1 42939 41900 10496 30200
## .10 .25 .50 .75 .90 .95
## 32600 36500 41400 47475 54800 60875
##
## lowest : 17600 22400 22900 24600 24700, highest: 85800 87200 91600 110600 116400
## --------------------------------------------------------------------------------
## pct_edu_program
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 786 0.989 7.147 6.485 7.562 0.00
## .10 .25 .50 .75 .90 .95
## 0.00 0.53 5.57 11.23 16.48 20.07
##
## lowest : 0 0.02 0.03 0.07 0.08 , highest: 38.33 39.68 50 58.51 64.97
## --------------------------------------------------------------------------------
## avg_facsalary
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1187 1 7526 7336 2268 4741
## .10 .25 .50 .75 .90 .95
## 5252 6112 7199 8509 10216 11548
##
## lowest : 1476 2660 2709 2978 2985, highest: 16042 16120 16589 17861 19862
## --------------------------------------------------------------------------------
## ft_faculty
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1000 0.995 70.82 71.52 26.58 29.31
## .10 .25 .50 .75 .90 .95
## 38.34 53.15 71.93 94.44 100.00 100.00
##
## lowest : 4.03 14.23 15.71 16 16.24, highest: 99.49 99.52 99.69 99.72 100
## --------------------------------------------------------------------------------
## instr_expend
## n missing distinct Info Mean pMedian Gmd .05
## 1326 0 1253 1 10135 8668 5976 4655
## .10 .25 .50 .75 .90 .95
## 5260 6514 8243 10790 15112 20956
##
## lowest : 1889 2036 2167 2238 2480, highest: 78147 88200 92854 97613 106214
## --------------------------------------------------------------------------------
## pell_cat
## n missing distinct Info Mean pMedian Gmd
## 1326 0 3 0.76 1.619 1.5 0.5498
##
## Value 1 2 3
## Frequency 549 733 44
## Proportion 0.414 0.553 0.033
## --------------------------------------------------------------------------------
Each institution should appear only once, identified by
unitid. Let’s verify there are no duplicates.
scorecard$dup_unitid <- duplicated(scorecard$unitid)
table(scorecard$dup_unitid)
##
## FALSE
## 1326
summary(scorecard)
## unitid instname city state
## Min. :100654 Length:1326 Length:1326 Length:1326
## 1st Qu.:152539 Class :character Class :character Class :character
## Median :183066 Mode :character Mode :character Mode :character
## Mean :182728
## 3rd Qu.:214650
## Max. :448840
## control adm_rate sat_avg enrollment
## Min. :1.000 Min. : 5.69 Min. : 666 Min. : 126
## 1st Qu.:1.000 1st Qu.: 53.81 1st Qu.: 976 1st Qu.: 1348
## Median :2.000 Median : 66.16 Median :1043 Median : 2512
## Mean :1.621 Mean : 64.18 Mean :1061 Mean : 5579
## 3rd Qu.:2.000 3rd Qu.: 76.95 3rd Qu.:1122 3rd Qu.: 6499
## Max. :2.000 Max. :100.00 Max. :1534 Max. :50919
## pctpell pctfloan avg_net_price grad_rate
## Min. : 6.16 Min. : 0.00 Min. : 1776 Min. : 4.84
## 1st Qu.:26.21 1st Qu.: 49.71 1st Qu.:14236 1st Qu.:42.49
## Median :35.58 Median : 61.90 Median :18118 Median :53.70
## Mean :36.49 Mean : 60.23 Mean :18848 Mean :54.75
## 3rd Qu.:44.89 3rd Qu.: 72.03 3rd Qu.:22790 3rd Qu.:66.27
## Max. :94.51 Max. :100.00 Max. :41414 Max. :97.79
## retention_rate earn10yr pct_edu_program avg_facsalary
## Min. : 0.00 Min. : 17600 Min. : 0.000 Min. : 1476
## 1st Qu.: 68.37 1st Qu.: 36500 1st Qu.: 0.530 1st Qu.: 6112
## Median : 75.98 Median : 41400 Median : 5.570 Median : 7199
## Mean : 75.75 Mean : 42939 Mean : 7.147 Mean : 7526
## 3rd Qu.: 83.61 3rd Qu.: 47475 3rd Qu.:11.227 3rd Qu.: 8509
## Max. :100.00 Max. :116400 Max. :64.970 Max. :19862
## ft_faculty instr_expend pell_cat dup_unitid
## Min. : 4.03 Min. : 1889 Min. :1.000 Mode :logical
## 1st Qu.: 53.15 1st Qu.: 6514 1st Qu.:1.000 FALSE:1326
## Median : 71.93 Median : 8243 Median :2.000
## Mean : 70.82 Mean : 10135 Mean :1.619
## 3rd Qu.: 94.44 3rd Qu.: 10790 3rd Qu.:2.000
## Max. :100.00 Max. :106214 Max. :3.000
Question: Are there any missing value or out-of-range data problems? How many observations remain after removing missing data? Did you find any duplicates?
There were no missing values in the dataset, so all 1,326 observations remain. I also checked for duplicate universities using the unitid variable and found none, meaning each university appears only once in the data.
Let’s create some quick scatterplots to get a visual sense of how key
variables relate to post-graduation earnings. Notice that
earn10yr is on the x-axis here — we are just exploring
patterns.
plot(scorecard$earn10yr, scorecard$pctpell,
main = "Median Earnings vs. Pct Pell",
xlab = "Median Earnings 10 Years After Entry", ylab = "Percent Pell Grant Recipients")
plot(scorecard$earn10yr, scorecard$sat_avg,
main = "Median Earnings vs. SAT Average",
xlab = "Median Earnings 10 Years After Entry", ylab = "SAT Average")
plot(scorecard$earn10yr, scorecard$avg_facsalary,
main = "Median Earnings vs. Avg Faculty Salary",
xlab = "Median Earnings 10 Years After Entry", ylab = "Average Faculty Salary")
plot(scorecard$earn10yr, scorecard$ft_faculty,
main = "Median Earnings vs. Full-Time Faculty",
xlab = "Median Earnings 10 Years After Entry", ylab = "Percent Full-Time Faculty")
Question: What patterns do you notice in these scatterplots? Which variables appear to have the strongest relationship with median earnings? Are the relationships positive or negative?
Looking at the scatterplots, the strongest relationships with median earnings appear to be average SAT score and percent of Pell grant recipients. The SAT plot shows a clear upward trend, where institutions with higher average SAT scores tend to have higher median earnings ten years after entry. This suggests a positive relationship between SAT scores and earnings.
The percent Pell grant recipients shows the opposite pattern. As the percentage of Pell recipients increases, median earnings tend to decrease. This indicates a negative relationship between the share of low income students at an institution and earnings outcomes.
Average faculty salary appears to have a moderate positive relationship with median earnings. Institutions with higher faculty salaries tend to have somewhat higher earnings, although the pattern is more scattered and less clear than the relationship with SAT scores or Pell grant recipients.
The percent of full time faculty shows very little clear relationship with median earnings. The points are widely spread with no strong upward or downward trend.
Overall, average SAT score has the strongest positive relationship with median earnings, while percent Pell recipients has the strongest negative relationship. The other variables show weaker patterns.
Now let’s look at the correlations among all numeric variables. We
will use rcorr() from the Hmisc package, which
gives us both correlation coefficients and p-values in one step.
# Select only numeric columns for correlation
numeric_cols <- scorecard[, sapply(scorecard, is.numeric)]
corr_results <- rcorr(as.matrix(numeric_cols))
# View correlation coefficients (rounded to 3 decimal places)
round(corr_results$r, 3)
## unitid control adm_rate sat_avg enrollment pctpell pctfloan
## unitid 1.000 -0.042 0.070 -0.048 -0.075 -0.052 0.035
## control -0.042 1.000 -0.152 0.140 -0.553 -0.087 0.272
## adm_rate 0.070 -0.152 1.000 -0.362 -0.032 0.079 0.269
## sat_avg -0.048 0.140 -0.362 1.000 0.224 -0.742 -0.526
## enrollment -0.075 -0.553 -0.032 0.224 1.000 -0.137 -0.376
## pctpell -0.052 -0.087 0.079 -0.742 -0.137 1.000 0.504
## pctfloan 0.035 0.272 0.269 -0.526 -0.376 0.504 1.000
## avg_net_price -0.042 0.614 -0.088 0.392 -0.231 -0.457 0.122
## grad_rate 0.014 0.229 -0.295 0.832 0.173 -0.719 -0.339
## retention_rate -0.078 0.012 -0.258 0.766 0.302 -0.628 -0.420
## earn10yr 0.002 0.081 -0.269 0.638 0.178 -0.569 -0.343
## pct_edu_program -0.013 0.004 0.165 -0.266 -0.133 0.171 0.207
## avg_facsalary -0.025 -0.152 -0.326 0.662 0.412 -0.515 -0.501
## ft_faculty 0.028 -0.075 -0.039 0.136 0.053 -0.082 -0.063
## instr_expend -0.050 0.122 -0.452 0.648 0.094 -0.410 -0.425
## pell_cat -0.019 -0.088 0.053 -0.636 -0.112 0.851 0.428
## avg_net_price grad_rate retention_rate earn10yr pct_edu_program
## unitid -0.042 0.014 -0.078 0.002 -0.013
## control 0.614 0.229 0.012 0.081 0.004
## adm_rate -0.088 -0.295 -0.258 -0.269 0.165
## sat_avg 0.392 0.832 0.766 0.638 -0.266
## enrollment -0.231 0.173 0.302 0.178 -0.133
## pctpell -0.457 -0.719 -0.628 -0.569 0.171
## pctfloan 0.122 -0.339 -0.420 -0.343 0.207
## avg_net_price 1.000 0.496 0.323 0.382 -0.220
## grad_rate 0.496 1.000 0.797 0.633 -0.216
## retention_rate 0.323 0.797 1.000 0.615 -0.289
## earn10yr 0.382 0.633 0.615 1.000 -0.368
## pct_edu_program -0.220 -0.216 -0.289 -0.368 1.000
## avg_facsalary 0.242 0.617 0.652 0.690 -0.383
## ft_faculty -0.076 0.103 0.064 -0.003 0.041
## instr_expend 0.246 0.549 0.488 0.513 -0.264
## pell_cat -0.436 -0.631 -0.565 -0.499 0.187
## avg_facsalary ft_faculty instr_expend pell_cat
## unitid -0.025 0.028 -0.050 -0.019
## control -0.152 -0.075 0.122 -0.088
## adm_rate -0.326 -0.039 -0.452 0.053
## sat_avg 0.662 0.136 0.648 -0.636
## enrollment 0.412 0.053 0.094 -0.112
## pctpell -0.515 -0.082 -0.410 0.851
## pctfloan -0.501 -0.063 -0.425 0.428
## avg_net_price 0.242 -0.076 0.246 -0.436
## grad_rate 0.617 0.103 0.549 -0.631
## retention_rate 0.652 0.064 0.488 -0.565
## earn10yr 0.690 -0.003 0.513 -0.499
## pct_edu_program -0.383 0.041 -0.264 0.187
## avg_facsalary 1.000 0.031 0.646 -0.450
## ft_faculty 0.031 1.000 0.089 -0.078
## instr_expend 0.646 0.089 1.000 -0.340
## pell_cat -0.450 -0.078 -0.340 1.000
# View p-values (rounded to 4 decimal places)
round(corr_results$P, 4)
## unitid control adm_rate sat_avg enrollment pctpell pctfloan
## unitid NA 0.1222 0.0112 0.0796 0.0065 0.0592 0.2062
## control 0.1222 NA 0.0000 0.0000 0.0000 0.0015 0.0000
## adm_rate 0.0112 0.0000 NA 0.0000 0.2483 0.0039 0.0000
## sat_avg 0.0796 0.0000 0.0000 NA 0.0000 0.0000 0.0000
## enrollment 0.0065 0.0000 0.2483 0.0000 NA 0.0000 0.0000
## pctpell 0.0592 0.0015 0.0039 0.0000 0.0000 NA 0.0000
## pctfloan 0.2062 0.0000 0.0000 0.0000 0.0000 0.0000 NA
## avg_net_price 0.1219 0.0000 0.0014 0.0000 0.0000 0.0000 0.0000
## grad_rate 0.6055 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## retention_rate 0.0045 0.6634 0.0000 0.0000 0.0000 0.0000 0.0000
## earn10yr 0.9423 0.0033 0.0000 0.0000 0.0000 0.0000 0.0000
## pct_edu_program 0.6475 0.8976 0.0000 0.0000 0.0000 0.0000 0.0000
## avg_facsalary 0.3717 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## ft_faculty 0.3110 0.0060 0.1567 0.0000 0.0536 0.0029 0.0228
## instr_expend 0.0708 0.0000 0.0000 0.0000 0.0006 0.0000 0.0000
## pell_cat 0.4915 0.0013 0.0547 0.0000 0.0000 0.0000 0.0000
## avg_net_price grad_rate retention_rate earn10yr pct_edu_program
## unitid 0.1219 0.6055 0.0045 0.9423 0.6475
## control 0.0000 0.0000 0.6634 0.0033 0.8976
## adm_rate 0.0014 0.0000 0.0000 0.0000 0.0000
## sat_avg 0.0000 0.0000 0.0000 0.0000 0.0000
## enrollment 0.0000 0.0000 0.0000 0.0000 0.0000
## pctpell 0.0000 0.0000 0.0000 0.0000 0.0000
## pctfloan 0.0000 0.0000 0.0000 0.0000 0.0000
## avg_net_price NA 0.0000 0.0000 0.0000 0.0000
## grad_rate 0.0000 NA 0.0000 0.0000 0.0000
## retention_rate 0.0000 0.0000 NA 0.0000 0.0000
## earn10yr 0.0000 0.0000 0.0000 NA 0.0000
## pct_edu_program 0.0000 0.0000 0.0000 0.0000 NA
## avg_facsalary 0.0000 0.0000 0.0000 0.0000 0.0000
## ft_faculty 0.0054 0.0002 0.0195 0.8997 0.1316
## instr_expend 0.0000 0.0000 0.0000 0.0000 0.0000
## pell_cat 0.0000 0.0000 0.0000 0.0000 0.0000
## avg_facsalary ft_faculty instr_expend pell_cat
## unitid 0.3717 0.3110 0.0708 0.4915
## control 0.0000 0.0060 0.0000 0.0013
## adm_rate 0.0000 0.1567 0.0000 0.0547
## sat_avg 0.0000 0.0000 0.0000 0.0000
## enrollment 0.0000 0.0536 0.0006 0.0000
## pctpell 0.0000 0.0029 0.0000 0.0000
## pctfloan 0.0000 0.0228 0.0000 0.0000
## avg_net_price 0.0000 0.0054 0.0000 0.0000
## grad_rate 0.0000 0.0002 0.0000 0.0000
## retention_rate 0.0000 0.0195 0.0000 0.0000
## earn10yr 0.0000 0.8997 0.0000 0.0000
## pct_edu_program 0.0000 0.1316 0.0000 0.0000
## avg_facsalary NA 0.2628 0.0000 0.0000
## ft_faculty 0.2628 NA 0.0011 0.0043
## instr_expend 0.0000 0.0011 NA 0.0000
## pell_cat 0.0000 0.0043 0.0000 NA
Let’s save the correlation matrix to a CSV file so you can examine it more easily in Excel or Google Sheets.
write.csv(round(corr_results$r, 3), file = "Correlation Matrix College Scorecard.csv")
corrplot(corr_results$r, method = "color", type = "upper",
tl.cex = 0.7, tl.col = "black",
addCoef.col = "black", number.cex = 0.5,
title = "Correlation Matrix - College Scorecard",
mar = c(0, 0, 2, 0))
This plot shows how median earnings distributions differ across institutions grouped by the percentage of Pell grant recipients (a proxy for the socioeconomic status of the student body).
scorecard$pell_cat_label <- factor(scorecard$pell_cat,
labels = c("Low", "Medium", "High"))
ggplot(scorecard, aes(x = earn10yr, group = pell_cat_label, col = pell_cat_label)) +
geom_density() +
labs(title = "Median Earnings Distribution by Pell Grant Category",
x = "Median Earnings 10 Years After Entry",
y = "Density",
color = "Pell Category") +
theme_minimal()
Question: Which variables have the strongest
correlations with earn10yr? Are there any pairs of
independent variables that are very highly correlated with each
other (potential multicollinearity concerns)? How do median
earnings distributions differ across Pell categories?
Looking at the correlation matrix, several variables have strong relationships with “earn10yr”. The strongest positive correlations with earnings are average faculty salary (.690), average SAT score (.638), and graduation rate (.633). Retention rate (.615) and instructional expenditure (.513) also show positive relationships with earnings. In contrast, percent of Pell grant recipients has a fairly strong negative correlation with earnings (-.569), suggesting that institutions with higher shares of Pell recipients tend to have lower median earnings ten years after entry.
There are also some independent variables that are highly correlated with each other, which could raise multicollinearity concerns. For example, average SAT score and graduation rate have a correlation of 0.832, and SAT score and retention rate have a correlation of 0.766. Graduation rate and retention rate are also strongly related (0.797). These variables measure different things conceptually, but they appear to move together in the data, likely because more selective institutions tend to admit academically stronger students and who also end up having higher completion rates. Percent Pell recipients is strongly negatively correlated with SAT score (-0.742) and graduation rate (-0.719), which suggests that institutions enrolling fewer low income students tend to be more selective and have higher completion rates.
When looking at the distribution of earnings across the Pell groups, schools with larger shares of Pell grant recipients tend to have lower median earnings, while schools with fewer Pell recipients tend to have higher earnings outcomes.
Now we move from exploration to modeling. We will start with a
bivariate regression — one predictor and one outcome.
In R, we use the lm() function (short for “linear model”).
The syntax is:
lm(dependent_variable ~ independent_variable, data = your_data)
The tilde (~) means “predicted by.” So
earn10yr ~ pctpell reads as “median earnings predicted by
percent Pell recipients.”
Here is what the bivariate regression looks like using
pctpell as the predictor. This is just an
example. You should replace pctpell with YOUR
first-choice independent variable from Section 2.
# EXAMPLE ONLY — this shows you the syntax
regression_1 <- lm(earn10yr ~ pctpell, data = scorecard)
summary(regression_1)
Replace pctpell below with the first variable from your
ranked list in Section 2.
# TODO: Replace pctpell with YOUR first independent variable
regression_1 <- lm(earn10yr ~ sat_avg, data = scorecard)
summary(regression_1)
##
## Call:
## lm(formula = earn10yr ~ sat_avg, data = scorecard)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23281 -4836 -991 3789 71815
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9444.869 1751.497 -5.392 0.0000000822 ***
## sat_avg 49.387 1.639 30.135 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7807 on 1324 degrees of freedom
## Multiple R-squared: 0.4068, Adjusted R-squared: 0.4064
## F-statistic: 908.1 on 1 and 1324 DF, p-value: < 0.00000000000000022
Question: Is the result in line with your expectations? Interpret the following:
The result is generally in line with my expectations. In the earlier scatterplots and correlations, average SAT score appeared to have a strong positive relationship with median earnings, and the regression results show the same pattern. The slope is 49.387 which means that for every one point increase in average SAT score, median earnings ten years after entry increase by about $49 on average. For example, if one school has an average SAT score that is 100 points higher than another, the model predicts about $4,939 in higher median earnings. The predictor is statistically significant, with a p-value less than 2e-16. The R-squared value is 0.4068 which means that average SAT score alone explains about 40.7% of the variation in median earnings across the institutions in the dataset. This suggests that institutional selectivity, as measured by SAT scores, is strongly associated with differences in earnings outcomes.
Before we trust our regression results, we need to check several assumptions. Let’s run through the key diagnostics for your bivariate model.
If the residuals are roughly normally distributed, the histogram should look approximately bell-shaped.
hist(scale(regression_1$residuals),
main = "Histogram of Standardized Residuals",
xlab = "Standardized Residuals",
col = "lightblue",
breaks = 20)
R provides four built-in diagnostic plots for regression models. These help you assess linearity, normality, homoscedasticity (equal variance), and influential observations.
par(mfrow = c(2, 2))
plot(regression_1)
par(mfrow = c(1, 1))
What to look for in each plot:
Standardized coefficients allow you to compare the relative importance of predictors (especially useful later with multiple regression). We standardize by scaling both variables to z-scores.
# TODO: Replace pctpell with YOUR variable
lm(scale(earn10yr) ~ scale(sat_avg), data = scorecard)
##
## Call:
## lm(formula = scale(earn10yr) ~ scale(sat_avg), data = scorecard)
##
## Coefficients:
## (Intercept) scale(sat_avg)
## -0.0000000000000003479 0.6378355198648106850
A common rule of thumb: observations with standardized residuals greater than |2| may be outliers worth investigating.
regression_1$standardized.residuals <- rstandard(regression_1)
regression_1$large_residual <- abs(regression_1$standardized.residuals) > 2
# How many potential outliers?
sum(regression_1$large_residual)
## [1] 50
The Durbin-Watson test checks for autocorrelation in the residuals (whether errors are independent). Values close to 2 indicate no autocorrelation. Values significantly below 2 suggest positive autocorrelation, and values above 2 suggest negative autocorrelation.
dwt(regression_1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.2134893 1.572985 0
## Alternative hypothesis: rho != 0
Question: Did you find any major assumption violations? Specifically comment on:
No major assumption violations were observed. The histogram and Q-Q plot suggest that the residuals are approximately normally distributed, although there are a few observations in the upper tail that deviate slightly from the line. The Scale-Location plot shows a somewhat increasing spread in the residuals at higher fitted values, which suggests mild heteroscedasticity, but the pattern is not severe. A small number of observations have larger residuals, but none appear to be extremely influential in the leverage plot. The Durbin–Watson statistic is 1.57 with a significant p-value, which suggests some positive autocorrelation in the residuals. Because the dataset is cross-sectional rather than time-series, this result is less concerning and may reflect clustering across institutions rather than temporal dependence.
Now comes the fun part! You will build your regression model sequentially, adding one predictor at a time. This lets you see how each new variable contributes to explaining post-graduation earnings.
Instructions:
regression_1).regression_2.regression_final.Replace YOUR_VAR_1, YOUR_VAR_2, etc. with
your actual variable names.
# TODO: Replace YOUR_VAR_1 and YOUR_VAR_2 with your variable names
regression_2 <- lm(earn10yr ~ sat_avg + pctpell, data = scorecard)
summary(regression_2)
##
## Call:
## lm(formula = earn10yr ~ sat_avg + pctpell, data = scorecard)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21499 -4723 -932 3613 71259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8949.893 3200.266 2.797 0.00524 **
## sat_avg 37.218 2.404 15.479 < 0.0000000000000002 ***
## pctpell -150.387 22.053 -6.819 0.0000000000139 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7676 on 1323 degrees of freedom
## Multiple R-squared: 0.427, Adjusted R-squared: 0.4261
## F-statistic: 492.9 on 2 and 1323 DF, p-value: < 0.00000000000000022
# TODO: Replace with your variable names
regression_3 <- lm(earn10yr ~ sat_avg + pctpell + grad_rate, data = scorecard)
summary(regression_3)
##
## Call:
## lm(formula = earn10yr ~ sat_avg + pctpell + grad_rate, data = scorecard)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20265 -4413 -965 3553 69436
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13743.660 3203.309 4.290 0.000019128650303 ***
## sat_avg 22.642 3.072 7.372 0.000000000000296 ***
## pctpell -105.099 22.467 -4.678 0.000003195105989 ***
## grad_rate 164.651 22.247 7.401 0.000000000000239 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7524 on 1322 degrees of freedom
## Multiple R-squared: 0.4498, Adjusted R-squared: 0.4485
## F-statistic: 360.2 on 3 and 1322 DF, p-value: < 0.00000000000000022
# TODO: Replace with your variable names
regression_4 <- lm(earn10yr ~ sat_avg + pctpell + grad_rate + retention_rate, data = scorecard)
summary(regression_4)
##
## Call:
## lm(formula = earn10yr ~ sat_avg + pctpell + grad_rate + retention_rate,
## data = scorecard)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20269 -4187 -966 3324 68121
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8408.834 3270.142 2.571 0.0102 *
## sat_avg 17.114 3.153 5.427 0.000000067944 ***
## pctpell -102.443 22.152 -4.625 0.000004120461 ***
## grad_rate 99.042 24.294 4.077 0.000048374510 ***
## retention_rate 193.974 30.902 6.277 0.000000000467 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7417 on 1321 degrees of freedom
## Multiple R-squared: 0.4657, Adjusted R-squared: 0.4641
## F-statistic: 287.9 on 4 and 1321 DF, p-value: < 0.00000000000000022
# TODO: Replace with your variable names
regression_5 <- lm(earn10yr ~ sat_avg + pctpell + grad_rate + retention_rate + avg_net_price, data = scorecard)
summary(regression_5)
##
## Call:
## lm(formula = earn10yr ~ sat_avg + pctpell + grad_rate + retention_rate +
## avg_net_price, data = scorecard)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20679 -4005 -785 3092 66018
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4045.52154 3397.28187 1.191 0.233942
## sat_avg 18.13862 3.14053 5.776 0.0000000095483 ***
## pctpell -82.56622 22.46408 -3.675 0.000247 ***
## grad_rate 64.77687 25.36554 2.554 0.010769 *
## retention_rate 212.36399 30.97733 6.855 0.0000000000109 ***
## avg_net_price 0.16100 0.03677 4.379 0.0000128630470 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7367 on 1320 degrees of freedom
## Multiple R-squared: 0.4734, Adjusted R-squared: 0.4714
## F-statistic: 237.3 on 5 and 1320 DF, p-value: < 0.00000000000000022
# TODO: Replace with your variable names
regression_final <- lm(earn10yr ~ sat_avg + pctpell + grad_rate + retention_rate + avg_net_price + instr_expend, data = scorecard)
summary(regression_final)
##
## Call:
## lm(formula = earn10yr ~ sat_avg + pctpell + grad_rate + retention_rate +
## avg_net_price + instr_expend, data = scorecard)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20882 -3967 -714 3057 66432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14236.83859 3596.76007 3.958 0.000079529629077 ***
## sat_avg 7.34118 3.39891 2.160 0.0310 *
## pctpell -107.32303 22.25809 -4.822 0.000001588249506 ***
## grad_rate 53.07165 24.90273 2.131 0.0333 *
## retention_rate 218.06299 30.36159 7.182 0.000000000001143 ***
## avg_net_price 0.15887 0.03603 4.410 0.000011185332031 ***
## instr_expend 0.23817 0.03184 7.480 0.000000000000135 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7218 on 1319 degrees of freedom
## Multiple R-squared: 0.4948, Adjusted R-squared: 0.4925
## F-statistic: 215.3 on 6 and 1319 DF, p-value: < 0.00000000000000022
Here is an example of what a completed final model might look like. Yours will likely be different!
# EXAMPLE ONLY — do not copy this blindly!
regression_final <- lm(earn10yr ~ avg_facsalary + ft_faculty + instr_expend + enrollment + pctpell, data = scorecard)
summary(regression_final)
Question: How does R-squared change as you add variables? Does each new variable contribute significantly (check the p-value for each coefficient)? Did any variable become non-significant after adding others?
The explanatory power of the model increases as additional variables are added, though the improvements become smaller with each step. The first model using only sat_avg had an R-squared of 0.4068. After adding pctpell, the R-squared increased to 0.427. Including grad_rate raised it further to 0.4498, and adding retention_rate increased it to 0.4657. When avg_net_price was included, the R-squared rose to 0.4734, and the final model with instr_expend reached R-squared of 0.4948. This means the final model explains about 49.5% of the variation in median earnings across institutions.
Now repeat the diagnostic checks for your final model. This is critical — a model is only trustworthy if its assumptions are reasonably met.
hist(scale(regression_final$residuals),
main = "Residuals - Final Model",
xlab = "Standardized Residuals",
col = "lightblue",
breaks = 20)
par(mfrow = c(2, 2))
plot(regression_final)
par(mfrow = c(1, 1))
regression_final$standardized.residuals <- rstandard(regression_final)
regression_final$large_residual <- abs(regression_final$standardized.residuals) > 2
# How many potential outliers in the final model?
sum(regression_final$large_residual)
## [1] 56
Standardized coefficients tell you which predictors have the largest effect in standard deviation units. Replace the variable names below with your actual variables.
# TODO: Replace with YOUR variables from the final model
lm(scale(earn10yr) ~ scale(sat_avg) + scale(pctpell) + scale(grad_rate)+ scale(retention_rate)+ scale(avg_net_price) + scale(instr_expend),
data = scorecard)
##
## Call:
## lm(formula = scale(earn10yr) ~ scale(sat_avg) + scale(pctpell) +
## scale(grad_rate) + scale(retention_rate) + scale(avg_net_price) +
## scale(instr_expend), data = scorecard)
##
## Coefficients:
## (Intercept) scale(sat_avg) scale(pctpell)
## 0.000000000000000586 0.094811280558475930 -0.151123894332738984
## scale(grad_rate) scale(retention_rate) scale(avg_net_price)
## 0.091239088130648238 0.246742178438977527 0.102414618198945331
## scale(instr_expend)
## 0.194454116456967752
# EXAMPLE ONLY — matches the example final model
lm(scale(earn10yr) ~ scale(sat_avg) + scale(pctpell) + scale(grad_rate)+ scale(retention_rate)+ scale(avg_net_price)+ scale(instr_expend),
data = scorecard)
VIF checks for multicollinearity. As a rule of thumb:
vif(regression_final)
## sat_avg pctpell grad_rate retention_rate avg_net_price
## 5.030848 2.564664 4.785257 3.081387 1.408157
## instr_expend
## 1.764496
dwt(regression_final)
## lag Autocorrelation D-W Statistic p-value
## 1 0.1045195 1.790548 0
## Alternative hypothesis: rho != 0
Question: Discuss the following for your final model:
Multicollinearity does not appear to be a major issue in the final model. The VIF values are sat_avg = 5.03, pctpell = 2.56, grad_rate = 4.79, retention_rate = 3.08, avg_net_price = 1.41, and instr_expend = 1.76. The highest value is for sat_avg at just over 5, which suggests some overlap with other predictors. Looking back at the correlations, this likely reflects the fact that more selective institutions tend to have higher graduation and retention rates. However, none of the values approach 10, so multicollinearity does not appear severe.
The Durbin–Watson statistic is 1.79 with a p-value of 0, which suggests some positive autocorrelation in the residuals. Since the dataset consists of different universities rather than observations over time, this result is less concerning than it would be in a time series setting.
The residual diagnostics suggest that the normality assumption is mostly satisfied. The histogram of standardized residuals is roughly centered around zero and has a generally bell shaped distribution, though there is a longer tail on the upper end. The Q-Q plot follows the diagonal line for most observations but bends slightly in the upper tail, indicating a few schools where earnings are higher than the model predicts.
The Scale-Location plot shows that the spread of the residuals increases slightly at higher fitted values. This suggests some mild heteroscedasticity, but the pattern is not extreme.
Overall, the model appears reasonably trustworthy. The assumptions are mostly satisfied, and the model explains about 49.5% of the variation in median earnings across institutions. While there are clearly other factors affecting earnings that are not captured here, the model provides a useful picture of how selectivity, socioeconomic composition, completion rates, and institutional resources relate to earnings outcomes.
Take a step back and summarize what you found.
9a. The final model shows that institutional characteristics related to student outcomes and institutional investment are strongly associated with earnings ten years after entry. Schools with higher retention rates and higher instructional spending tend to produce higher median earnings, while institutions enrolling larger shares of Pell Grant recipients tend to have lower earnings outcomes even after accounting for the other variables in the model. Average SAT scores, graduation rates, and net price also remain positively related to earnings, though their effects are smaller once institutional performance measures are included. The model explains about 49.5% of the variation in earnings across institutions, which suggests these institutional factors capture a substantial portion of the differences in long term earnings outcomes.
9b. Were your initial hypotheses (from Section 2) supported? What surprised you?
The results mostly supported the initial hypotheses. Institutions with higher SAT scores, higher graduation rates, and higher retention rates were all associated with higher earnings outcomes, while institutions with larger shares of Pell Grant recipients tended to have lower median earnings. However, one result differed from the initial expectations. SAT scores were the strongest predictor in the bivariate model, but once other institutional variables were added, retention rate and instructional expenditure became stronger predictors. This suggests that institutional performance and investment in instruction may play an important role in shaping earnings outcomes beyond the academic preparation of incoming students.
9c. What are the limitations of this analysis? (Think about: causation vs. correlation, omitted variables, generalizability, etc.)
This analysis has several limitations. The model identifies associations between institutional characteristics and post graduation earnings, but it does not establish causal mechanisms. Students are sorted into institutions through long running processes that the dataset does not measure, so coefficients on variables such as SAT averages, retention, and instructional spending can reflect selection as much as institutional impact. The dataset also does not measure conditions that shape earnings trajectories before college, including childhood poverty, housing instability, food insecurity, domestic violence, and unequal access to primary and secondary education. Because these conditions are missing, the regression explains variation in earnings using institutional indicators while leaving the production of disadvantage largely outside the model.
The negative relationship between pctpell and earn10yr is a clear example. The result suggests that institutions serving larger shares of low income students sit within resource constraints and labor market structures that shape long term outcomes. This dataset is still useful because it flags where the earnings gap is most strongly patterned, and it motivates deeper study of how programs like Pell interact with institutional funding, student support, and local opportunity structures.
author field at the top
of this document