Section 0: Setup and Knitting Instructions

Welcome to Lab 3! In this assignment you will use linear regression to explore which institutional characteristics predict post-graduation earnings using the College Scorecard dataset.

What is “knitting”? When you are finished, you will “knit” this document to HTML. Click the Knit button at the top of RStudio (or press Ctrl+Shift+K). This will run all your code and produce an HTML report. Submit both this .Rmd file AND the knitted HTML file.

Important: Make sure all your code runs without errors before knitting. If a chunk has an error, knitting will fail. A good practice is to use Run All (Ctrl+Alt+R) before knitting to check that everything works.

Section 1: Load and Inspect the Data

First, let’s load the College Scorecard dataset. This file contains data on 1,326 institutions with variables covering admissions, enrollment, finances, and outcomes.

scorecard <- read.csv("collegescorecard.csv")

# View(scorecard)
# NOTE: View() opens an interactive table in RStudio but does NOT work
# when knitting. Keep it commented out. You can run it manually in
# your console to browse the data.

str(scorecard)

## 'data.frame':    1326 obs. of  19 variables:
##  $ unitid         : int  100654 100663 100706 100724 100751 100830 100858 100937 101435 101480 ...
##  $ instname       : chr  "Alabama A & M University" "University of Alabama at Birmingham" "University of Alabama in Huntsville" "Alabama State University" ...
##  $ city           : chr  "Normal" "Birmingham" "Huntsville" "Montgomery" ...
##  $ state          : chr  "AL" "AL" "AL" "AL" ...
##  $ control        : int  1 1 1 1 1 1 1 2 2 1 ...
##  $ adm_rate       : num  89.9 86.7 80.6 51.2 56.5 ...
##  $ sat_avg        : int  823 1146 1180 830 1171 970 1215 1177 999 1036 ...
##  $ enrollment     : int  4051 11200 5525 5354 28692 4322 19761 1181 1100 7195 ...
##  $ pctpell        : num  71.2 35 32.8 82.7 21.1 ...
##  $ pctfloan       : num  82 54 47.3 87.3 41.5 ...
##  $ avg_net_price  : num  13415 14805 17520 11936 20916 ...
##  $ grad_rate      : num  29.1 53.8 48.4 25.2 66.7 ...
##  $ retention_rate : num  63.1 80.2 81 62.2 87 ...
##  $ earn10yr       : int  31400 40300 46600 27800 42400 34800 45400 41900 37100 35100 ...
##  $ pct_edu_program: num  14.9 8.62 1.73 21.5 8.4 ...
##  $ avg_facsalary  : num  7079 10170 9341 6557 9605 ...
##  $ ft_faculty     : num  88.6 91.1 65.5 66.4 71.1 ...
##  $ instr_expend   : num  7459 17208 9352 7393 9817 ...
##  $ pell_cat       : int  3 2 1 3 1 2 1 1 2 2 ...

head(scorecard)

##   unitid                            instname       city state control adm_rate
## 1 100654            Alabama A & M University     Normal    AL       1    89.89
## 2 100663 University of Alabama at Birmingham Birmingham    AL       1    86.73
## 3 100706 University of Alabama in Huntsville Huntsville    AL       1    80.62
## 4 100724            Alabama State University Montgomery    AL       1    51.25
## 5 100751           The University of Alabama Tuscaloosa    AL       1    56.55
## 6 100830     Auburn University at Montgomery Montgomery    AL       1    83.71
##   sat_avg enrollment pctpell pctfloan avg_net_price grad_rate retention_rate
## 1     823       4051   71.15    82.04         13415     29.14          63.14
## 2    1146      11200   35.05    53.97         14805     53.77          80.16
## 3    1180       5525   32.81    47.28         17520     48.35          80.98
## 4     830       5354   82.65    87.35         11936     25.17          62.19
## 5    1171      28692   21.07    41.48         20916     66.65          87.00
## 6     970       4322   40.06    64.76         11915     27.05          63.21
##   earn10yr pct_edu_program avg_facsalary ft_faculty instr_expend pell_cat
## 1    31400           14.90          7079      88.56         7459        3
## 2    40300            8.62         10170      91.06        17208        2
## 3    46600            1.73          9341      65.55         9352        1
## 4    27800           21.50          6557      66.41         7393        3
## 5    42400            8.40          9605      71.09         9817        1
## 6    34800           14.06          7173      92.62         6817        2

Question: How many observations and variables do we have? What types of variables do you see (numeric, character, factor)? Take a moment to familiarize yourself with the column names.

We have 1,326 observations and 19 variables in the dataset. The variables include a mixture of integer, numeric, and character types. Six variables are stored as integers, including unitid, control, sat_avg, enrollment, earn10yr, and pell_cat. Ten variables are stored as numeric values, including adm_rate, pctpell, pctfloan, avg_net_price, grad_rate, retention_rate, pct_edu_program, avg_facsalary, ft_faculty, and instr_expend. Three variables are stored as character strings: instname, city, and state. If a variable such as state were to be used in a regression model, it would typically be converted to a factor variable so that it can be included as a categorical variable. The column names indicate that the dataset contains information on institutional characteristics, student demographics, and financial variables that may influence earnings outcomes.

Section 2: Research Question

Background

Our dependent variable (outcome) is earn10yr — the median earnings of students 10 years after entering the institution. Your task is to identify which institutional characteristics best predict post-graduation earnings.

Here is a reminder of the available variables you can use as independent variables (predictors):

Variable	Description
`unitid`	Unique institution identifier
`instname`	Institution name
`city`	City
`state`	State
`control`	1 = public, 2 = private nonprofit
`adm_rate`	Admission rate (0-100)
`sat_avg`	Average SAT score (~850-1540)
`enrollment`	Total student enrollment
`pctpell`	Percent Pell grant recipients (0-100, proxy for socioeconomic status)
`pctfloan`	Percent receiving federal loans (0-100)
`avg_net_price`	Average net price in dollars
`grad_rate`	6-year graduation rate (0-100) — an interesting predictor to consider!
`retention_rate`	First-year retention rate (0-100)
`earn10yr`	Median earnings 10 years after entry (DV)
`pct_edu_program`	Percent education programs
`avg_facsalary`	Average faculty monthly salary
`ft_faculty`	Percent full-time faculty (0-100)
`instr_expend`	Instructional expenditure per FTE student
`pell_cat`	1 = low Pell (<=33%), 2 = medium (33-66%), 3 = high (>66%)

Your Task

2a. To what extent are differences in median earnings ten years after college entry associated with institutional selectivity, student socioeconomic composition, and institutional outcomes, and do institutions that enroll larger proportions of Pell grant recipients produce different earnings outcomes after accounting for selectivity, graduation rates, retention rates, and net price?

2b. List 5-6 candidate independent variables from the table above. For each one, briefly explain why you think it might be related to post-graduation earnings.

pctpell: This variable represents the proportion of students from lower income backgrounds at an institution. Colleges that enroll higher percentages of Pell grant recipients may have a student body with fewer financial resources and professional networks, which may be associated with lower median earnings after graduation.
sat_avg: Institutions with higher average SAT scores tend to admit students with stronger academic preparation. Students entering college with higher levels of preparation may be more likely to enter competitive occupations or graduate programs that lead to higher long term earnings.
adm_rate: Admission rate reflects institutional selectivity. Institutions with lower admission rates are generally more selective and may provide students with stronger academic environments, reputational advantages, and professional networks that can influence earnings outcomes.
grad_rate: Graduation rate measures the proportion of students who successfully complete their degree programs. Institutions with higher graduation rates may provide stronger academic support systems and better pathways to degree completion, which may improve long term earning outcomes.
instr_expend: Instructional expenditure per student reflects the level of institutional investment in teaching and academic programs. Institutions that spend more on instruction may provide greater academic resources and educational quality, which may influence graduates’ earnings outcomes.
avg_net_price: Net price represents the average cost students pay after financial aid. Institutions with higher net prices may have greater financial resources, stronger reputations, or more selective admissions processes, which may influence graduates earnings opportunities.

2c. Rank your chosen variables in order of expected importance (strongest predictor first). This is your hypothesis — we will check it against the data later!

sat_avg
pctpell
grad_rate
retention_rate
instr_expend
avg_net_price

Section 3: Data Cleaning and Checks

Before we run any regressions, we need to make sure the data are clean. Let’s use the describe() function from the psych package to get a detailed summary of every variable.

Descriptive Statistics

describe(scorecard)

## scorecard 
## 
##  19  Variables      1326  Observations
## --------------------------------------------------------------------------------
## unitid 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1326        1   182728   182415    46374   110637 
##      .10      .25      .50      .75      .90      .95 
##   128248   152539   183066   214650   230786   236995 
## 
## lowest : 100654 100663 100706 100724 100751, highest: 436818 436827 436836 442356 448840
## --------------------------------------------------------------------------------
## instname 
##        n  missing distinct 
##     1326        0     1315 
## 
## lowest : Abilene Christian University         Abraham Baldwin Agricultural College Adams State University               Adelphi University                   Adrian College                      
## highest: Yale University                      Yeshiva University                   York College                         York College Pennsylvania            Young Harris College                
## --------------------------------------------------------------------------------
## city 
##        n  missing distinct 
##     1326        0      867 
## 
## lowest : Aberdeen     Abilene      Abington     Ada          Adrian      
## highest: Worcester    Yankton      York         Young Harris Ypsilanti   
## --------------------------------------------------------------------------------
## state 
##        n  missing distinct 
##     1326        0       52 
## 
## lowest : AK AL AR AZ CA, highest: VT WA WI WV WY
## --------------------------------------------------------------------------------
## control 
##        n  missing distinct     Info     Mean 
##     1326        0        2    0.706    1.621 
##                       
## Value          1     2
## Frequency    502   824
## Proportion 0.379 0.621
## --------------------------------------------------------------------------------
## adm_rate 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1212        1    64.18    65.24    20.77    28.76 
##      .10      .25      .50      .75      .90      .95 
##    39.58    53.81    66.16    76.95    86.68    92.10 
## 
## lowest : 5.69  5.84  7.05  7.41  7.42 , highest: 99.75 99.77 99.81 99.87 100  
## --------------------------------------------------------------------------------
## sat_avg 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0      466        1     1061     1050    142.2    878.0 
##      .10      .25      .50      .75      .90      .95 
##    920.5    976.0   1043.0   1122.0   1235.0   1322.8 
## 
## lowest :  666  716  723  749  750, highest: 1497 1501 1503 1504 1534
## --------------------------------------------------------------------------------
## enrollment 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1234        1     5579     3686     6450      568 
##      .10      .25      .50      .75      .90      .95 
##      794     1348     2512     6499    15735    21734 
## 
## lowest :   126   158   160   161   195, highest: 39460 43139 43931 47079 50919
## --------------------------------------------------------------------------------
## pctpell 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1177        1    36.49    35.81    15.86    14.90 
##      .10      .25      .50      .75      .90      .95 
##    18.61    26.21    35.58    44.89    54.17    61.68 
## 
## lowest : 6.16  7.84  9.11  9.58  9.83 , highest: 84.98 87.58 92.46 92.59 94.51
## --------------------------------------------------------------------------------
## pctfloan 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1177        1    60.23    61.05    18.43    31.84 
##      .10      .25      .50      .75      .90      .95 
##    38.61    49.71    61.89    72.03    79.75    84.38 
## 
## lowest : 0     2.9   3.24  4.82  8.62 , highest: 93.34 93.69 93.95 94.57 100  
## --------------------------------------------------------------------------------
## avg_net_price 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1284        1    18848    18526     7336     9129 
##      .10      .25      .50      .75      .90      .95 
##    11096    14236    18118    22790    28120    31168 
## 
## lowest :  1776  2035  2452  3698  4299, highest: 37971 38225 38809 40306 41414
## --------------------------------------------------------------------------------
## grad_rate 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1184        1    54.75     54.3    19.71    28.43 
##      .10      .25      .50      .75      .90      .95 
##    33.48    42.49    53.70    66.27    79.38    86.47 
## 
## lowest : 4.84  4.9   5.36  8.38  10.17, highest: 95.76 95.78 96.94 97.47 97.79
## --------------------------------------------------------------------------------
## retention_rate 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1111        1    75.75    75.98    12.84    57.08 
##      .10      .25      .50      .75      .90      .95 
##    61.11    68.37    75.98    83.62    90.72    94.15 
## 
## lowest : 0     21.82 33.13 42.5  43.33, highest: 98.61 98.67 99.34 99.49 100  
## --------------------------------------------------------------------------------
## earn10yr 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0      378        1    42939    41900    10496    30200 
##      .10      .25      .50      .75      .90      .95 
##    32600    36500    41400    47475    54800    60875 
## 
## lowest :  17600  22400  22900  24600  24700, highest:  85800  87200  91600 110600 116400
## --------------------------------------------------------------------------------
## pct_edu_program 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0      786    0.989    7.147    6.485    7.562     0.00 
##      .10      .25      .50      .75      .90      .95 
##     0.00     0.53     5.57    11.23    16.48    20.07 
## 
## lowest : 0     0.02  0.03  0.07  0.08 , highest: 38.33 39.68 50    58.51 64.97
## --------------------------------------------------------------------------------
## avg_facsalary 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1187        1     7526     7336     2268     4741 
##      .10      .25      .50      .75      .90      .95 
##     5252     6112     7199     8509    10216    11548 
## 
## lowest :  1476  2660  2709  2978  2985, highest: 16042 16120 16589 17861 19862
## --------------------------------------------------------------------------------
## ft_faculty 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1000    0.995    70.82    71.52    26.58    29.31 
##      .10      .25      .50      .75      .90      .95 
##    38.34    53.15    71.93    94.44   100.00   100.00 
## 
## lowest : 4.03  14.23 15.71 16    16.24, highest: 99.49 99.52 99.69 99.72 100  
## --------------------------------------------------------------------------------
## instr_expend 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1253        1    10135     8668     5976     4655 
##      .10      .25      .50      .75      .90      .95 
##     5260     6514     8243    10790    15112    20956 
## 
## lowest :   1889   2036   2167   2238   2480, highest:  78147  88200  92854  97613 106214
## --------------------------------------------------------------------------------
## pell_cat 
##        n  missing distinct     Info     Mean  pMedian      Gmd 
##     1326        0        3     0.76    1.619      1.5   0.5498 
##                             
## Value          1     2     3
## Frequency    549   733    44
## Proportion 0.414 0.553 0.033
## --------------------------------------------------------------------------------

Check for Missing Values

colSums(is.na(scorecard))

##          unitid        instname            city           state         control 
##               0               0               0               0               0 
##        adm_rate         sat_avg      enrollment         pctpell        pctfloan 
##               0               0               0               0               0 
##   avg_net_price       grad_rate  retention_rate        earn10yr pct_edu_program 
##               0               0               0               0               0 
##   avg_facsalary      ft_faculty    instr_expend        pell_cat 
##               0               0               0               0

If there are missing values, we will remove them using na.omit(). This is a simple approach — in a more advanced course you might learn about imputation methods instead.

scorecard <- na.omit(scorecard)

# Check descriptives again after removing missing data
describe(scorecard)

## scorecard 
## 
##  19  Variables      1326  Observations
## --------------------------------------------------------------------------------
## unitid 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1326        1   182728   182415    46374   110637 
##      .10      .25      .50      .75      .90      .95 
##   128248   152539   183066   214650   230786   236995 
## 
## lowest : 100654 100663 100706 100724 100751, highest: 436818 436827 436836 442356 448840
## --------------------------------------------------------------------------------
## instname 
##        n  missing distinct 
##     1326        0     1315 
## 
## lowest : Abilene Christian University         Abraham Baldwin Agricultural College Adams State University               Adelphi University                   Adrian College                      
## highest: Yale University                      Yeshiva University                   York College                         York College Pennsylvania            Young Harris College                
## --------------------------------------------------------------------------------
## city 
##        n  missing distinct 
##     1326        0      867 
## 
## lowest : Aberdeen     Abilene      Abington     Ada          Adrian      
## highest: Worcester    Yankton      York         Young Harris Ypsilanti   
## --------------------------------------------------------------------------------
## state 
##        n  missing distinct 
##     1326        0       52 
## 
## lowest : AK AL AR AZ CA, highest: VT WA WI WV WY
## --------------------------------------------------------------------------------
## control 
##        n  missing distinct     Info     Mean 
##     1326        0        2    0.706    1.621 
##                       
## Value          1     2
## Frequency    502   824
## Proportion 0.379 0.621
## --------------------------------------------------------------------------------
## adm_rate 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1212        1    64.18    65.24    20.77    28.76 
##      .10      .25      .50      .75      .90      .95 
##    39.58    53.81    66.16    76.95    86.68    92.10 
## 
## lowest : 5.69  5.84  7.05  7.41  7.42 , highest: 99.75 99.77 99.81 99.87 100  
## --------------------------------------------------------------------------------
## sat_avg 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0      466        1     1061     1050    142.2    878.0 
##      .10      .25      .50      .75      .90      .95 
##    920.5    976.0   1043.0   1122.0   1235.0   1322.8 
## 
## lowest :  666  716  723  749  750, highest: 1497 1501 1503 1504 1534
## --------------------------------------------------------------------------------
## enrollment 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1234        1     5579     3686     6450      568 
##      .10      .25      .50      .75      .90      .95 
##      794     1348     2512     6499    15735    21734 
## 
## lowest :   126   158   160   161   195, highest: 39460 43139 43931 47079 50919
## --------------------------------------------------------------------------------
## pctpell 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1177        1    36.49    35.81    15.86    14.90 
##      .10      .25      .50      .75      .90      .95 
##    18.61    26.21    35.58    44.89    54.17    61.68 
## 
## lowest : 6.16  7.84  9.11  9.58  9.83 , highest: 84.98 87.58 92.46 92.59 94.51
## --------------------------------------------------------------------------------
## pctfloan 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1177        1    60.23    61.05    18.43    31.84 
##      .10      .25      .50      .75      .90      .95 
##    38.61    49.71    61.89    72.03    79.75    84.38 
## 
## lowest : 0     2.9   3.24  4.82  8.62 , highest: 93.34 93.69 93.95 94.57 100  
## --------------------------------------------------------------------------------
## avg_net_price 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1284        1    18848    18526     7336     9129 
##      .10      .25      .50      .75      .90      .95 
##    11096    14236    18118    22790    28120    31168 
## 
## lowest :  1776  2035  2452  3698  4299, highest: 37971 38225 38809 40306 41414
## --------------------------------------------------------------------------------
## grad_rate 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1184        1    54.75     54.3    19.71    28.43 
##      .10      .25      .50      .75      .90      .95 
##    33.48    42.49    53.70    66.27    79.38    86.47 
## 
## lowest : 4.84  4.9   5.36  8.38  10.17, highest: 95.76 95.78 96.94 97.47 97.79
## --------------------------------------------------------------------------------
## retention_rate 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1111        1    75.75    75.98    12.84    57.08 
##      .10      .25      .50      .75      .90      .95 
##    61.11    68.37    75.98    83.62    90.72    94.15 
## 
## lowest : 0     21.82 33.13 42.5  43.33, highest: 98.61 98.67 99.34 99.49 100  
## --------------------------------------------------------------------------------
## earn10yr 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0      378        1    42939    41900    10496    30200 
##      .10      .25      .50      .75      .90      .95 
##    32600    36500    41400    47475    54800    60875 
## 
## lowest :  17600  22400  22900  24600  24700, highest:  85800  87200  91600 110600 116400
## --------------------------------------------------------------------------------
## pct_edu_program 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0      786    0.989    7.147    6.485    7.562     0.00 
##      .10      .25      .50      .75      .90      .95 
##     0.00     0.53     5.57    11.23    16.48    20.07 
## 
## lowest : 0     0.02  0.03  0.07  0.08 , highest: 38.33 39.68 50    58.51 64.97
## --------------------------------------------------------------------------------
## avg_facsalary 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1187        1     7526     7336     2268     4741 
##      .10      .25      .50      .75      .90      .95 
##     5252     6112     7199     8509    10216    11548 
## 
## lowest :  1476  2660  2709  2978  2985, highest: 16042 16120 16589 17861 19862
## --------------------------------------------------------------------------------
## ft_faculty 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1000    0.995    70.82    71.52    26.58    29.31 
##      .10      .25      .50      .75      .90      .95 
##    38.34    53.15    71.93    94.44   100.00   100.00 
## 
## lowest : 4.03  14.23 15.71 16    16.24, highest: 99.49 99.52 99.69 99.72 100  
## --------------------------------------------------------------------------------
## instr_expend 
##        n  missing distinct     Info     Mean  pMedian      Gmd      .05 
##     1326        0     1253        1    10135     8668     5976     4655 
##      .10      .25      .50      .75      .90      .95 
##     5260     6514     8243    10790    15112    20956 
## 
## lowest :   1889   2036   2167   2238   2480, highest:  78147  88200  92854  97613 106214
## --------------------------------------------------------------------------------
## pell_cat 
##        n  missing distinct     Info     Mean  pMedian      Gmd 
##     1326        0        3     0.76    1.619      1.5   0.5498 
##                             
## Value          1     2     3
## Frequency    549   733    44
## Proportion 0.414 0.553 0.033
## --------------------------------------------------------------------------------

Check for Duplicates

Each institution should appear only once, identified by unitid. Let’s verify there are no duplicates.

scorecard$dup_unitid <- duplicated(scorecard$unitid)
table(scorecard$dup_unitid)

## 
## FALSE 
##  1326

Summary Statistics

summary(scorecard)

##      unitid         instname             city              state          
##  Min.   :100654   Length:1326        Length:1326        Length:1326       
##  1st Qu.:152539   Class :character   Class :character   Class :character  
##  Median :183066   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :182728                                                           
##  3rd Qu.:214650                                                           
##  Max.   :448840                                                           
##     control         adm_rate         sat_avg       enrollment   
##  Min.   :1.000   Min.   :  5.69   Min.   : 666   Min.   :  126  
##  1st Qu.:1.000   1st Qu.: 53.81   1st Qu.: 976   1st Qu.: 1348  
##  Median :2.000   Median : 66.16   Median :1043   Median : 2512  
##  Mean   :1.621   Mean   : 64.18   Mean   :1061   Mean   : 5579  
##  3rd Qu.:2.000   3rd Qu.: 76.95   3rd Qu.:1122   3rd Qu.: 6499  
##  Max.   :2.000   Max.   :100.00   Max.   :1534   Max.   :50919  
##     pctpell         pctfloan      avg_net_price     grad_rate    
##  Min.   : 6.16   Min.   :  0.00   Min.   : 1776   Min.   : 4.84  
##  1st Qu.:26.21   1st Qu.: 49.71   1st Qu.:14236   1st Qu.:42.49  
##  Median :35.58   Median : 61.90   Median :18118   Median :53.70  
##  Mean   :36.49   Mean   : 60.23   Mean   :18848   Mean   :54.75  
##  3rd Qu.:44.89   3rd Qu.: 72.03   3rd Qu.:22790   3rd Qu.:66.27  
##  Max.   :94.51   Max.   :100.00   Max.   :41414   Max.   :97.79  
##  retention_rate      earn10yr      pct_edu_program  avg_facsalary  
##  Min.   :  0.00   Min.   : 17600   Min.   : 0.000   Min.   : 1476  
##  1st Qu.: 68.37   1st Qu.: 36500   1st Qu.: 0.530   1st Qu.: 6112  
##  Median : 75.98   Median : 41400   Median : 5.570   Median : 7199  
##  Mean   : 75.75   Mean   : 42939   Mean   : 7.147   Mean   : 7526  
##  3rd Qu.: 83.61   3rd Qu.: 47475   3rd Qu.:11.227   3rd Qu.: 8509  
##  Max.   :100.00   Max.   :116400   Max.   :64.970   Max.   :19862  
##    ft_faculty      instr_expend       pell_cat     dup_unitid     
##  Min.   :  4.03   Min.   :  1889   Min.   :1.000   Mode :logical  
##  1st Qu.: 53.15   1st Qu.:  6514   1st Qu.:1.000   FALSE:1326     
##  Median : 71.93   Median :  8243   Median :2.000                  
##  Mean   : 70.82   Mean   : 10135   Mean   :1.619                  
##  3rd Qu.: 94.44   3rd Qu.: 10790   3rd Qu.:2.000                  
##  Max.   :100.00   Max.   :106214   Max.   :3.000

Question: Are there any missing value or out-of-range data problems? How many observations remain after removing missing data? Did you find any duplicates?

There were no missing values in the dataset, so all 1,326 observations remain. I also checked for duplicate universities using the unitid variable and found none, meaning each university appears only once in the data.

Initial Scatterplots

Let’s create some quick scatterplots to get a visual sense of how key variables relate to post-graduation earnings. Notice that earn10yr is on the x-axis here — we are just exploring patterns.

plot(scorecard$earn10yr, scorecard$pctpell,
     main = "Median Earnings vs. Pct Pell",
     xlab = "Median Earnings 10 Years After Entry", ylab = "Percent Pell Grant Recipients")

plot(scorecard$earn10yr, scorecard$sat_avg,
     main = "Median Earnings vs. SAT Average",
     xlab = "Median Earnings 10 Years After Entry", ylab = "SAT Average")

plot(scorecard$earn10yr, scorecard$avg_facsalary,
     main = "Median Earnings vs. Avg Faculty Salary",
     xlab = "Median Earnings 10 Years After Entry", ylab = "Average Faculty Salary")

plot(scorecard$earn10yr, scorecard$ft_faculty,
     main = "Median Earnings vs. Full-Time Faculty",
     xlab = "Median Earnings 10 Years After Entry", ylab = "Percent Full-Time Faculty")

Question: What patterns do you notice in these scatterplots? Which variables appear to have the strongest relationship with median earnings? Are the relationships positive or negative?

Looking at the scatterplots, the strongest relationships with median earnings appear to be average SAT score and percent of Pell grant recipients. The SAT plot shows a clear upward trend, where institutions with higher average SAT scores tend to have higher median earnings ten years after entry. This suggests a positive relationship between SAT scores and earnings.

The percent Pell grant recipients shows the opposite pattern. As the percentage of Pell recipients increases, median earnings tend to decrease. This indicates a negative relationship between the share of low income students at an institution and earnings outcomes.

Average faculty salary appears to have a moderate positive relationship with median earnings. Institutions with higher faculty salaries tend to have somewhat higher earnings, although the pattern is more scattered and less clear than the relationship with SAT scores or Pell grant recipients.

The percent of full time faculty shows very little clear relationship with median earnings. The points are widely spread with no strong upward or downward trend.

Overall, average SAT score has the strongest positive relationship with median earnings, while percent Pell recipients has the strongest negative relationship. The other variables show weaker patterns.

Section 4: Exploratory Data Analysis

Correlation Matrix

Now let’s look at the correlations among all numeric variables. We will use rcorr() from the Hmisc package, which gives us both correlation coefficients and p-values in one step.

# Select only numeric columns for correlation
numeric_cols <- scorecard[, sapply(scorecard, is.numeric)]

corr_results <- rcorr(as.matrix(numeric_cols))

# View correlation coefficients (rounded to 3 decimal places)
round(corr_results$r, 3)

##                 unitid control adm_rate sat_avg enrollment pctpell pctfloan
## unitid           1.000  -0.042    0.070  -0.048     -0.075  -0.052    0.035
## control         -0.042   1.000   -0.152   0.140     -0.553  -0.087    0.272
## adm_rate         0.070  -0.152    1.000  -0.362     -0.032   0.079    0.269
## sat_avg         -0.048   0.140   -0.362   1.000      0.224  -0.742   -0.526
## enrollment      -0.075  -0.553   -0.032   0.224      1.000  -0.137   -0.376
## pctpell         -0.052  -0.087    0.079  -0.742     -0.137   1.000    0.504
## pctfloan         0.035   0.272    0.269  -0.526     -0.376   0.504    1.000
## avg_net_price   -0.042   0.614   -0.088   0.392     -0.231  -0.457    0.122
## grad_rate        0.014   0.229   -0.295   0.832      0.173  -0.719   -0.339
## retention_rate  -0.078   0.012   -0.258   0.766      0.302  -0.628   -0.420
## earn10yr         0.002   0.081   -0.269   0.638      0.178  -0.569   -0.343
## pct_edu_program -0.013   0.004    0.165  -0.266     -0.133   0.171    0.207
## avg_facsalary   -0.025  -0.152   -0.326   0.662      0.412  -0.515   -0.501
## ft_faculty       0.028  -0.075   -0.039   0.136      0.053  -0.082   -0.063
## instr_expend    -0.050   0.122   -0.452   0.648      0.094  -0.410   -0.425
## pell_cat        -0.019  -0.088    0.053  -0.636     -0.112   0.851    0.428
##                 avg_net_price grad_rate retention_rate earn10yr pct_edu_program
## unitid                 -0.042     0.014         -0.078    0.002          -0.013
## control                 0.614     0.229          0.012    0.081           0.004
## adm_rate               -0.088    -0.295         -0.258   -0.269           0.165
## sat_avg                 0.392     0.832          0.766    0.638          -0.266
## enrollment             -0.231     0.173          0.302    0.178          -0.133
## pctpell                -0.457    -0.719         -0.628   -0.569           0.171
## pctfloan                0.122    -0.339         -0.420   -0.343           0.207
## avg_net_price           1.000     0.496          0.323    0.382          -0.220
## grad_rate               0.496     1.000          0.797    0.633          -0.216
## retention_rate          0.323     0.797          1.000    0.615          -0.289
## earn10yr                0.382     0.633          0.615    1.000          -0.368
## pct_edu_program        -0.220    -0.216         -0.289   -0.368           1.000
## avg_facsalary           0.242     0.617          0.652    0.690          -0.383
## ft_faculty             -0.076     0.103          0.064   -0.003           0.041
## instr_expend            0.246     0.549          0.488    0.513          -0.264
## pell_cat               -0.436    -0.631         -0.565   -0.499           0.187
##                 avg_facsalary ft_faculty instr_expend pell_cat
## unitid                 -0.025      0.028       -0.050   -0.019
## control                -0.152     -0.075        0.122   -0.088
## adm_rate               -0.326     -0.039       -0.452    0.053
## sat_avg                 0.662      0.136        0.648   -0.636
## enrollment              0.412      0.053        0.094   -0.112
## pctpell                -0.515     -0.082       -0.410    0.851
## pctfloan               -0.501     -0.063       -0.425    0.428
## avg_net_price           0.242     -0.076        0.246   -0.436
## grad_rate               0.617      0.103        0.549   -0.631
## retention_rate          0.652      0.064        0.488   -0.565
## earn10yr                0.690     -0.003        0.513   -0.499
## pct_edu_program        -0.383      0.041       -0.264    0.187
## avg_facsalary           1.000      0.031        0.646   -0.450
## ft_faculty              0.031      1.000        0.089   -0.078
## instr_expend            0.646      0.089        1.000   -0.340
## pell_cat               -0.450     -0.078       -0.340    1.000

# View p-values (rounded to 4 decimal places)
round(corr_results$P, 4)

##                 unitid control adm_rate sat_avg enrollment pctpell pctfloan
## unitid              NA  0.1222   0.0112  0.0796     0.0065  0.0592   0.2062
## control         0.1222      NA   0.0000  0.0000     0.0000  0.0015   0.0000
## adm_rate        0.0112  0.0000       NA  0.0000     0.2483  0.0039   0.0000
## sat_avg         0.0796  0.0000   0.0000      NA     0.0000  0.0000   0.0000
## enrollment      0.0065  0.0000   0.2483  0.0000         NA  0.0000   0.0000
## pctpell         0.0592  0.0015   0.0039  0.0000     0.0000      NA   0.0000
## pctfloan        0.2062  0.0000   0.0000  0.0000     0.0000  0.0000       NA
## avg_net_price   0.1219  0.0000   0.0014  0.0000     0.0000  0.0000   0.0000
## grad_rate       0.6055  0.0000   0.0000  0.0000     0.0000  0.0000   0.0000
## retention_rate  0.0045  0.6634   0.0000  0.0000     0.0000  0.0000   0.0000
## earn10yr        0.9423  0.0033   0.0000  0.0000     0.0000  0.0000   0.0000
## pct_edu_program 0.6475  0.8976   0.0000  0.0000     0.0000  0.0000   0.0000
## avg_facsalary   0.3717  0.0000   0.0000  0.0000     0.0000  0.0000   0.0000
## ft_faculty      0.3110  0.0060   0.1567  0.0000     0.0536  0.0029   0.0228
## instr_expend    0.0708  0.0000   0.0000  0.0000     0.0006  0.0000   0.0000
## pell_cat        0.4915  0.0013   0.0547  0.0000     0.0000  0.0000   0.0000
##                 avg_net_price grad_rate retention_rate earn10yr pct_edu_program
## unitid                 0.1219    0.6055         0.0045   0.9423          0.6475
## control                0.0000    0.0000         0.6634   0.0033          0.8976
## adm_rate               0.0014    0.0000         0.0000   0.0000          0.0000
## sat_avg                0.0000    0.0000         0.0000   0.0000          0.0000
## enrollment             0.0000    0.0000         0.0000   0.0000          0.0000
## pctpell                0.0000    0.0000         0.0000   0.0000          0.0000
## pctfloan               0.0000    0.0000         0.0000   0.0000          0.0000
## avg_net_price              NA    0.0000         0.0000   0.0000          0.0000
## grad_rate              0.0000        NA         0.0000   0.0000          0.0000
## retention_rate         0.0000    0.0000             NA   0.0000          0.0000
## earn10yr               0.0000    0.0000         0.0000       NA          0.0000
## pct_edu_program        0.0000    0.0000         0.0000   0.0000              NA
## avg_facsalary          0.0000    0.0000         0.0000   0.0000          0.0000
## ft_faculty             0.0054    0.0002         0.0195   0.8997          0.1316
## instr_expend           0.0000    0.0000         0.0000   0.0000          0.0000
## pell_cat               0.0000    0.0000         0.0000   0.0000          0.0000
##                 avg_facsalary ft_faculty instr_expend pell_cat
## unitid                 0.3717     0.3110       0.0708   0.4915
## control                0.0000     0.0060       0.0000   0.0013
## adm_rate               0.0000     0.1567       0.0000   0.0547
## sat_avg                0.0000     0.0000       0.0000   0.0000
## enrollment             0.0000     0.0536       0.0006   0.0000
## pctpell                0.0000     0.0029       0.0000   0.0000
## pctfloan               0.0000     0.0228       0.0000   0.0000
## avg_net_price          0.0000     0.0054       0.0000   0.0000
## grad_rate              0.0000     0.0002       0.0000   0.0000
## retention_rate         0.0000     0.0195       0.0000   0.0000
## earn10yr               0.0000     0.8997       0.0000   0.0000
## pct_edu_program        0.0000     0.1316       0.0000   0.0000
## avg_facsalary              NA     0.2628       0.0000   0.0000
## ft_faculty             0.2628         NA       0.0011   0.0043
## instr_expend           0.0000     0.0011           NA   0.0000
## pell_cat               0.0000     0.0043       0.0000       NA

Let’s save the correlation matrix to a CSV file so you can examine it more easily in Excel or Google Sheets.

write.csv(round(corr_results$r, 3), file = "Correlation Matrix College Scorecard.csv")

Visualization: Correlation Plot

corrplot(corr_results$r, method = "color", type = "upper",
         tl.cex = 0.7, tl.col = "black",
         addCoef.col = "black", number.cex = 0.5,
         title = "Correlation Matrix - College Scorecard",
         mar = c(0, 0, 2, 0))

Density Plot: Median Earnings by Pell Category

This plot shows how median earnings distributions differ across institutions grouped by the percentage of Pell grant recipients (a proxy for the socioeconomic status of the student body).

scorecard$pell_cat_label <- factor(scorecard$pell_cat,
                                    labels = c("Low", "Medium", "High"))

ggplot(scorecard, aes(x = earn10yr, group = pell_cat_label, col = pell_cat_label)) +
  geom_density() +
  labs(title = "Median Earnings Distribution by Pell Grant Category",
       x = "Median Earnings 10 Years After Entry",
       y = "Density",
       color = "Pell Category") +
  theme_minimal()

Question: Which variables have the strongest correlations with earn10yr? Are there any pairs of independent variables that are very highly correlated with each other (potential multicollinearity concerns)? How do median earnings distributions differ across Pell categories?

Looking at the correlation matrix, several variables have strong relationships with “earn10yr”. The strongest positive correlations with earnings are average faculty salary (.690), average SAT score (.638), and graduation rate (.633). Retention rate (.615) and instructional expenditure (.513) also show positive relationships with earnings. In contrast, percent of Pell grant recipients has a fairly strong negative correlation with earnings (-.569), suggesting that institutions with higher shares of Pell recipients tend to have lower median earnings ten years after entry.

There are also some independent variables that are highly correlated with each other, which could raise multicollinearity concerns. For example, average SAT score and graduation rate have a correlation of 0.832, and SAT score and retention rate have a correlation of 0.766. Graduation rate and retention rate are also strongly related (0.797). These variables measure different things conceptually, but they appear to move together in the data, likely because more selective institutions tend to admit academically stronger students and who also end up having higher completion rates. Percent Pell recipients is strongly negatively correlated with SAT score (-0.742) and graduation rate (-0.719), which suggests that institutions enrolling fewer low income students tend to be more selective and have higher completion rates.

When looking at the distribution of earnings across the Pell groups, schools with larger shares of Pell grant recipients tend to have lower median earnings, while schools with fewer Pell recipients tend to have higher earnings outcomes.

Section 5: Bivariate Regression

Now we move from exploration to modeling. We will start with a bivariate regression — one predictor and one outcome. In R, we use the lm() function (short for “linear model”). The syntax is:

lm(dependent_variable ~ independent_variable, data = your_data)

The tilde (~) means “predicted by.” So earn10yr ~ pctpell reads as “median earnings predicted by percent Pell recipients.”

Example: earn10yr predicted by pctpell

Here is what the bivariate regression looks like using pctpell as the predictor. This is just an example. You should replace pctpell with YOUR first-choice independent variable from Section 2.

# EXAMPLE ONLY — this shows you the syntax
regression_1 <- lm(earn10yr ~ pctpell, data = scorecard)
summary(regression_1)

Your Bivariate Regression

Replace pctpell below with the first variable from your ranked list in Section 2.

# TODO: Replace pctpell with YOUR first independent variable
regression_1 <- lm(earn10yr ~ sat_avg, data = scorecard)
summary(regression_1)

## 
## Call:
## lm(formula = earn10yr ~ sat_avg, data = scorecard)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -23281  -4836   -991   3789  71815 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -9444.869   1751.497  -5.392         0.0000000822 ***
## sat_avg        49.387      1.639  30.135 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7807 on 1324 degrees of freedom
## Multiple R-squared:  0.4068, Adjusted R-squared:  0.4064 
## F-statistic: 908.1 on 1 and 1324 DF,  p-value: < 0.00000000000000022

Question: Is the result in line with your expectations? Interpret the following:

What is the slope (coefficient) for your predictor, and what does it mean in plain English?
Is the predictor statistically significant?
What is the R-squared value, and what does it tell you about how much variance in median earnings is explained?

The result is generally in line with my expectations. In the earlier scatterplots and correlations, average SAT score appeared to have a strong positive relationship with median earnings, and the regression results show the same pattern. The slope is 49.387 which means that for every one point increase in average SAT score, median earnings ten years after entry increase by about $49 on average. For example, if one school has an average SAT score that is 100 points higher than another, the model predicts about $4,939 in higher median earnings. The predictor is statistically significant, with a p-value less than 2e-16. The R-squared value is 0.4068 which means that average SAT score alone explains about 40.7% of the variation in median earnings across the institutions in the dataset. This suggests that institutional selectivity, as measured by SAT scores, is strongly associated with differences in earnings outcomes.

Section 6: Regression Diagnostics – Bivariate

Before we trust our regression results, we need to check several assumptions. Let’s run through the key diagnostics for your bivariate model.

Histogram of Standardized Residuals

If the residuals are roughly normally distributed, the histogram should look approximately bell-shaped.

hist(scale(regression_1$residuals),
     main = "Histogram of Standardized Residuals",
     xlab = "Standardized Residuals",
     col = "lightblue",
     breaks = 20)

Diagnostic Plots

R provides four built-in diagnostic plots for regression models. These help you assess linearity, normality, homoscedasticity (equal variance), and influential observations.

par(mfrow = c(2, 2))
plot(regression_1)

par(mfrow = c(1, 1))

What to look for in each plot:

Residuals vs Fitted: Points should be randomly scattered around 0. Patterns suggest nonlinearity.
Normal Q-Q: Points should fall along the diagonal line. Deviations indicate non-normal residuals.
Scale-Location: The spread of residuals should be roughly constant across fitted values. A funnel shape indicates heteroscedasticity.
Residuals vs Leverage: Look for points with high leverage AND large residuals (upper or lower right corners). These are influential observations.

Standardized Coefficients

Standardized coefficients allow you to compare the relative importance of predictors (especially useful later with multiple regression). We standardize by scaling both variables to z-scores.

# TODO: Replace pctpell with YOUR variable
lm(scale(earn10yr) ~ scale(sat_avg), data = scorecard)

## 
## Call:
## lm(formula = scale(earn10yr) ~ scale(sat_avg), data = scorecard)
## 
## Coefficients:
##            (Intercept)          scale(sat_avg)  
## -0.0000000000000003479   0.6378355198648106850

Standardized Residuals — Identifying Outliers

A common rule of thumb: observations with standardized residuals greater than |2| may be outliers worth investigating.

regression_1$standardized.residuals <- rstandard(regression_1)
regression_1$large_residual <- abs(regression_1$standardized.residuals) > 2

# How many potential outliers?
sum(regression_1$large_residual)

## [1] 50

Durbin-Watson Test

The Durbin-Watson test checks for autocorrelation in the residuals (whether errors are independent). Values close to 2 indicate no autocorrelation. Values significantly below 2 suggest positive autocorrelation, and values above 2 suggest negative autocorrelation.

dwt(regression_1)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.2134893      1.572985       0
##  Alternative hypothesis: rho != 0

Question: Did you find any major assumption violations? Specifically comment on:

Normality of residuals (histogram and Q-Q plot)
Homoscedasticity (Scale-Location plot)
Any influential outliers?
Durbin-Watson result — is there autocorrelation?

No major assumption violations were observed. The histogram and Q-Q plot suggest that the residuals are approximately normally distributed, although there are a few observations in the upper tail that deviate slightly from the line. The Scale-Location plot shows a somewhat increasing spread in the residuals at higher fitted values, which suggests mild heteroscedasticity, but the pattern is not severe. A small number of observations have larger residuals, but none appear to be extremely influential in the leverage plot. The Durbin–Watson statistic is 1.57 with a significant p-value, which suggests some positive autocorrelation in the residuals. Because the dataset is cross-sectional rather than time-series, this result is less concerning and may reflect clustering across institutions rather than temporal dependence.

Section 7: Sequential Model Building

Now comes the fun part! You will build your regression model sequentially, adding one predictor at a time. This lets you see how each new variable contributes to explaining post-graduation earnings.

Instructions:

Start with your best single predictor from Section 5 (that is regression_1).
Add your second-ranked variable to create regression_2.
Continue adding variables one at a time through regression_final.
After each model, look at the R-squared — is it increasing meaningfully? Is the new variable significant?

Replace YOUR_VAR_1, YOUR_VAR_2, etc. with your actual variable names.

Model 2: Two Predictors

# TODO: Replace YOUR_VAR_1 and YOUR_VAR_2 with your variable names
regression_2 <- lm(earn10yr ~ sat_avg + pctpell, data = scorecard)
summary(regression_2)

## 
## Call:
## lm(formula = earn10yr ~ sat_avg + pctpell, data = scorecard)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -21499  -4723   -932   3613  71259 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 8949.893   3200.266   2.797              0.00524 ** 
## sat_avg       37.218      2.404  15.479 < 0.0000000000000002 ***
## pctpell     -150.387     22.053  -6.819      0.0000000000139 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7676 on 1323 degrees of freedom
## Multiple R-squared:  0.427,  Adjusted R-squared:  0.4261 
## F-statistic: 492.9 on 2 and 1323 DF,  p-value: < 0.00000000000000022

Model 3: Three Predictors

# TODO: Replace with your variable names
regression_3 <- lm(earn10yr ~ sat_avg + pctpell + grad_rate, data = scorecard)
summary(regression_3)

## 
## Call:
## lm(formula = earn10yr ~ sat_avg + pctpell + grad_rate, data = scorecard)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20265  -4413   -965   3553  69436 
## 
## Coefficients:
##              Estimate Std. Error t value          Pr(>|t|)    
## (Intercept) 13743.660   3203.309   4.290 0.000019128650303 ***
## sat_avg        22.642      3.072   7.372 0.000000000000296 ***
## pctpell      -105.099     22.467  -4.678 0.000003195105989 ***
## grad_rate     164.651     22.247   7.401 0.000000000000239 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7524 on 1322 degrees of freedom
## Multiple R-squared:  0.4498, Adjusted R-squared:  0.4485 
## F-statistic: 360.2 on 3 and 1322 DF,  p-value: < 0.00000000000000022

Model 4: Four Predictors

# TODO: Replace with your variable names
regression_4 <- lm(earn10yr ~ sat_avg + pctpell + grad_rate + retention_rate, data = scorecard)
summary(regression_4)

## 
## Call:
## lm(formula = earn10yr ~ sat_avg + pctpell + grad_rate + retention_rate, 
##     data = scorecard)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20269  -4187   -966   3324  68121 
## 
## Coefficients:
##                Estimate Std. Error t value       Pr(>|t|)    
## (Intercept)    8408.834   3270.142   2.571         0.0102 *  
## sat_avg          17.114      3.153   5.427 0.000000067944 ***
## pctpell        -102.443     22.152  -4.625 0.000004120461 ***
## grad_rate        99.042     24.294   4.077 0.000048374510 ***
## retention_rate  193.974     30.902   6.277 0.000000000467 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7417 on 1321 degrees of freedom
## Multiple R-squared:  0.4657, Adjusted R-squared:  0.4641 
## F-statistic: 287.9 on 4 and 1321 DF,  p-value: < 0.00000000000000022

Model 5: Five Predictors

# TODO: Replace with your variable names
regression_5 <- lm(earn10yr ~ sat_avg + pctpell + grad_rate + retention_rate + avg_net_price, data = scorecard)
summary(regression_5)

## 
## Call:
## lm(formula = earn10yr ~ sat_avg + pctpell + grad_rate + retention_rate + 
##     avg_net_price, data = scorecard)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20679  -4005   -785   3092  66018 
## 
## Coefficients:
##                  Estimate Std. Error t value        Pr(>|t|)    
## (Intercept)    4045.52154 3397.28187   1.191        0.233942    
## sat_avg          18.13862    3.14053   5.776 0.0000000095483 ***
## pctpell         -82.56622   22.46408  -3.675        0.000247 ***
## grad_rate        64.77687   25.36554   2.554        0.010769 *  
## retention_rate  212.36399   30.97733   6.855 0.0000000000109 ***
## avg_net_price     0.16100    0.03677   4.379 0.0000128630470 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7367 on 1320 degrees of freedom
## Multiple R-squared:  0.4734, Adjusted R-squared:  0.4714 
## F-statistic: 237.3 on 5 and 1320 DF,  p-value: < 0.00000000000000022

Final Model: Six (or More) Predictors

# TODO: Replace with your variable names
regression_final <- lm(earn10yr ~ sat_avg + pctpell + grad_rate + retention_rate + avg_net_price + instr_expend, data = scorecard)
summary(regression_final)

## 
## Call:
## lm(formula = earn10yr ~ sat_avg + pctpell + grad_rate + retention_rate + 
##     avg_net_price + instr_expend, data = scorecard)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20882  -3967   -714   3057  66432 
## 
## Coefficients:
##                   Estimate  Std. Error t value          Pr(>|t|)    
## (Intercept)    14236.83859  3596.76007   3.958 0.000079529629077 ***
## sat_avg            7.34118     3.39891   2.160            0.0310 *  
## pctpell         -107.32303    22.25809  -4.822 0.000001588249506 ***
## grad_rate         53.07165    24.90273   2.131            0.0333 *  
## retention_rate   218.06299    30.36159   7.182 0.000000000001143 ***
## avg_net_price      0.15887     0.03603   4.410 0.000011185332031 ***
## instr_expend       0.23817     0.03184   7.480 0.000000000000135 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7218 on 1319 degrees of freedom
## Multiple R-squared:  0.4948, Adjusted R-squared:  0.4925 
## F-statistic: 215.3 on 6 and 1319 DF,  p-value: < 0.00000000000000022

Example Final Model (for reference)

Here is an example of what a completed final model might look like. Yours will likely be different!

# EXAMPLE ONLY — do not copy this blindly!
regression_final <- lm(earn10yr ~ avg_facsalary + ft_faculty + instr_expend + enrollment + pctpell, data = scorecard)
summary(regression_final)

Question: How does R-squared change as you add variables? Does each new variable contribute significantly (check the p-value for each coefficient)? Did any variable become non-significant after adding others?

The explanatory power of the model increases as additional variables are added, though the improvements become smaller with each step. The first model using only sat_avg had an R-squared of 0.4068. After adding pctpell, the R-squared increased to 0.427. Including grad_rate raised it further to 0.4498, and adding retention_rate increased it to 0.4657. When avg_net_price was included, the R-squared rose to 0.4734, and the final model with instr_expend reached R-squared of 0.4948. This means the final model explains about 49.5% of the variation in median earnings across institutions.

Each new variable added to the model remained statistically significant. In the final model, sat_avg (p=.031), pctpell (p=1.59e-06), grad_rate (p=0.033), retention rate (p=1.14e-12), avg_net_price (p=1.12e-05), and instr_expend (p=1.35e-13) all have p-values below 0.05. While some coefficients become smaller as additional variables were added, none of the predictors became non-significant once the full set of variables was included.

Section 8: Final Model Diagnostics

Now repeat the diagnostic checks for your final model. This is critical — a model is only trustworthy if its assumptions are reasonably met.

Histogram of Residuals

hist(scale(regression_final$residuals),
     main = "Residuals - Final Model",
     xlab = "Standardized Residuals",
     col = "lightblue",
     breaks = 20)

Diagnostic Plots

par(mfrow = c(2, 2))
plot(regression_final)

par(mfrow = c(1, 1))

Standardized Residuals

regression_final$standardized.residuals <- rstandard(regression_final)
regression_final$large_residual <- abs(regression_final$standardized.residuals) > 2

# How many potential outliers in the final model?
sum(regression_final$large_residual)

## [1] 56

Standardized Coefficients

Standardized coefficients tell you which predictors have the largest effect in standard deviation units. Replace the variable names below with your actual variables.

# TODO: Replace with YOUR variables from the final model
lm(scale(earn10yr) ~ scale(sat_avg) + scale(pctpell) + scale(grad_rate)+ scale(retention_rate)+ scale(avg_net_price) + scale(instr_expend),
   data = scorecard)

## 
## Call:
## lm(formula = scale(earn10yr) ~ scale(sat_avg) + scale(pctpell) + 
##     scale(grad_rate) + scale(retention_rate) + scale(avg_net_price) + 
##     scale(instr_expend), data = scorecard)
## 
## Coefficients:
##           (Intercept)         scale(sat_avg)         scale(pctpell)  
##  0.000000000000000586   0.094811280558475930  -0.151123894332738984  
##      scale(grad_rate)  scale(retention_rate)   scale(avg_net_price)  
##  0.091239088130648238   0.246742178438977527   0.102414618198945331  
##   scale(instr_expend)  
##  0.194454116456967752

Example Standardized Coefficients (for reference)

# EXAMPLE ONLY — matches the example final model
lm(scale(earn10yr) ~ scale(sat_avg) + scale(pctpell) + scale(grad_rate)+ scale(retention_rate)+ scale(avg_net_price)+ scale(instr_expend),
   data = scorecard)

Variance Inflation Factor (VIF)

VIF checks for multicollinearity. As a rule of thumb:

VIF < 5: No concern
VIF 5-10: Moderate concern, investigate further
VIF > 10: Serious multicollinearity problem — consider removing a variable

vif(regression_final)

##        sat_avg        pctpell      grad_rate retention_rate  avg_net_price 
##       5.030848       2.564664       4.785257       3.081387       1.408157 
##   instr_expend 
##       1.764496

Durbin-Watson Test

dwt(regression_final)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.1045195      1.790548   0.002
##  Alternative hypothesis: rho != 0

Question: Discuss the following for your final model:

Multicollinearity: What are the VIF values? Are any concerning?
Independence of errors: What does the Durbin-Watson test tell you?
Normality: Do the residuals appear normally distributed (histogram and Q-Q plot)?
Homoscedasticity: Is the variance of residuals roughly constant (Scale-Location plot)?
Overall: Would you trust this model’s predictions? Why or why not?

Multicollinearity does not appear to be a major issue in the final model. The VIF values are sat_avg = 5.03, pctpell = 2.56, grad_rate = 4.79, retention_rate = 3.08, avg_net_price = 1.41, and instr_expend = 1.76. The highest value is for sat_avg at just over 5, which suggests some overlap with other predictors. Looking back at the correlations, this likely reflects the fact that more selective institutions tend to have higher graduation and retention rates. However, none of the values approach 10, so multicollinearity does not appear severe.

The Durbin–Watson statistic is 1.79 with a p-value of 0, which suggests some positive autocorrelation in the residuals. Since the dataset consists of different universities rather than observations over time, this result is less concerning than it would be in a time series setting.

The residual diagnostics suggest that the normality assumption is mostly satisfied. The histogram of standardized residuals is roughly centered around zero and has a generally bell shaped distribution, though there is a longer tail on the upper end. The Q-Q plot follows the diagonal line for most observations but bends slightly in the upper tail, indicating a few schools where earnings are higher than the model predicts.

The Scale-Location plot shows that the spread of the residuals increases slightly at higher fitted values. This suggests some mild heteroscedasticity, but the pattern is not extreme.

Overall, the model appears reasonably trustworthy. The assumptions are mostly satisfied, and the model explains about 49.5% of the variation in median earnings across institutions. While there are clearly other factors affecting earnings that are not captured here, the model provides a useful picture of how selectivity, socioeconomic composition, completion rates, and institutional resources relate to earnings outcomes.

Section 9: Summary and Interpretation

Take a step back and summarize what you found.

9a. The final model shows that institutional characteristics related to student outcomes and institutional investment are strongly associated with earnings ten years after entry. Schools with higher retention rates and higher instructional spending tend to produce higher median earnings, while institutions enrolling larger shares of Pell Grant recipients tend to have lower earnings outcomes even after accounting for the other variables in the model. Average SAT scores, graduation rates, and net price also remain positively related to earnings, though their effects are smaller once institutional performance measures are included. The model explains about 49.5% of the variation in earnings across institutions, which suggests these institutional factors capture a substantial portion of the differences in long term earnings outcomes.

9b. Were your initial hypotheses (from Section 2) supported? What surprised you?

The results mostly supported the initial hypotheses. Institutions with higher SAT scores, higher graduation rates, and higher retention rates were all associated with higher earnings outcomes, while institutions with larger shares of Pell Grant recipients tended to have lower median earnings. However, one result differed from the initial expectations. SAT scores were the strongest predictor in the bivariate model, but once other institutional variables were added, retention rate and instructional expenditure became stronger predictors. This suggests that institutional performance and investment in instruction may play an important role in shaping earnings outcomes beyond the academic preparation of incoming students.

9c. What are the limitations of this analysis? (Think about: causation vs. correlation, omitted variables, generalizability, etc.)

This analysis has several limitations. The model identifies associations between institutional characteristics and post graduation earnings, but it does not establish causal mechanisms. Students are sorted into institutions through long running processes that the dataset does not measure, so coefficients on variables such as SAT averages, retention, and instructional spending can reflect selection as much as institutional impact. The dataset also does not measure conditions that shape earnings trajectories before college, including childhood poverty, housing instability, food insecurity, domestic violence, and unequal access to primary and secondary education. Because these conditions are missing, the regression explains variation in earnings using institutional indicators while leaving the production of disadvantage largely outside the model.

The negative relationship between pctpell and earn10yr is a clear example. The result suggests that institutions serving larger shares of low income students sit within resource constraints and labor market structures that shape long term outcomes. This dataset is still useful because it flags where the earnings gap is most strongly patterned, and it motivates deeper study of how programs like Pell interact with institutional funding, student support, and local opportunity structures.

Before You Submit

Make sure your name is in the author field at the top of this document
Verify all code chunks run without errors (try Run All with Ctrl+Alt+R)
Answer ALL written questions in the markdown sections (search for “YOUR ANSWER HERE”)
Click Knit -> Knit to HTML (or press Ctrl+Shift+K)
Submit BOTH your .Rmd file AND the knitted .html file
Don’t forget to complete at least one Advanced Extension (separate file)

Lab 3: Linear Regression - College Scorecard Earnings Analysis

Dillon deKalands

2026-03-07

Section 0: Setup and Knitting Instructions

Section 1: Load and Inspect the Data

Section 2: Research Question

Background

Your Task

Section 3: Data Cleaning and Checks

Descriptive Statistics

Check for Missing Values

Check for Duplicates

Summary Statistics

Initial Scatterplots

Section 4: Exploratory Data Analysis

Correlation Matrix

Visualization: Correlation Plot

Density Plot: Median Earnings by Pell Category

Section 5: Bivariate Regression

Example: earn10yr predicted by pctpell

Your Bivariate Regression

Section 6: Regression Diagnostics – Bivariate

Histogram of Standardized Residuals

Diagnostic Plots

Standardized Coefficients

Standardized Residuals — Identifying Outliers

Durbin-Watson Test

Section 7: Sequential Model Building

Model 2: Two Predictors

Model 3: Three Predictors

Model 4: Four Predictors

Model 5: Five Predictors

Final Model: Six (or More) Predictors

Example Final Model (for reference)

Section 8: Final Model Diagnostics

Histogram of Residuals

Diagnostic Plots

Standardized Residuals

Standardized Coefficients

Example Standardized Coefficients (for reference)

Variance Inflation Factor (VIF)

Durbin-Watson Test

Section 9: Summary and Interpretation

Before You Submit