1 - Introduction

Given the current uncertainty centered around the future of U.S Higher Education Financial Aid programs, it is appropriate to explore the relationship between the coverage of financial aid(percentage of students who receive aid) and the depth of financial aid (the amount of aid provided). One of the key metrics utilized for ascertaining an institutions “success” is graduation rate. Financial aid programs are generally offered with a purpose of bridging the gaps in access to higher education for students of different income levels.

The data is sample of 1392 undergraduate degree granting public institutions located within the United States of America extracted from the 2022-2023 IPEDS survey. The observational study attempts to quantify the relationship between the precentage of students that receive financial aid and the average aid package awarded with an institutions graduation rate while accounting for variables including Pell Grant share, state, and institutions headcount.

The result of the multiple linear regression model chosen explains about 55.5% of the variance observed in graduation rate. R^2 = 55.5%, adjusted R^2 53.7% and a standard error = 11.5%. Given the variables included, a $1000 increase in the average grant award is associated with a 2.9 percentage‑point rise in the graduation rate. Expanding grant coverage by 10 percentage‑points predicts an additional 2 percentage‑point gain. A negative association is observed with Pell share and graduation rate. The model shows that both the percentage of students as well as the average aid package together while taking into consideration the rest of the variables, have a positive association with graduation rate.

2 - Importing The Data

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Importing the financial aid data

ipeds_fa2223 <- read.csv('https://raw.githubusercontent.com/william-forero/Data-607/refs/heads/main/sfa2223.csv')

How many rows and columns are included in the FA dataset?

dim(ipeds_fa2223)
## [1] 5653  691

Takeaway: There are 5,563 rows and 691 columns included in the 22-23 IPEDS FA dataset

What does the data look like?

head(ipeds_fa2223,5)
##   UNITID XSCUGRAD SCUGRAD XSCUGDGSK SCUGDGSK XSCUGNDGS SCUGNDGS XSCUGFFN
## 1 100654        R    5206         R     5201         R        5        R
## 2 100663        R   13032         R    12776         R      256        R
## 3 100690        R     228         R      228         R        0        R
## 4 100706        R    7169         R     6985         R      184        R
## 5 100724        R    3296         R     3296         R        0        R
##   SCUGFFN XSCUGFFP SCUGFFP XSCFA2 SCFA2 XSCFA2DG SCFA2DG XSCFA2ND SCFA2ND
## 1    1547        R      30      R  5206        R    5201        R       5
## 2    2172        R      17      R 13032        R   12776        R     256
## 3       0        R       0      R   228        R     228        R       0
## 4    1154        R      16      R  7169        R    6985        R     184
## 5     938        R      28      R  3296        R    3296        R       0
##   XSCFA1N SCFA1N XSCFA1P SCFA1P XSCFA11N SCFA11N XSCFA11P SCFA11P XSCFA12N
## 1       R   1547       R     30        R       0        R       0        R
## 2       R   2172       R     17        R       0        R       0        R
## 3       R      0       R      0        A      NA        A      NA        A
## 4       R   1154       R     16        R       0        R       0        R
## 5       R    938       R     28        R       0        R       0        R
##   SCFA12N XSCFA12P SCFA12P XSCFA13N SCFA13N XSCFA13P SCFA13P XSCFA14N SCFA14N
## 1     810        R      52        R     737        R      48        R       0
## 2    1881        R      87        R     291        R      13        R       0
## 3      NA        A      NA        A      NA        A      NA        A      NA
## 4     969        R      84        R     185        R      16        R       0
## 5     382        R      41        R     556        R      59        R       0
##   XSCFA14P SCFA14P XSCFY2 SCFY2 XSCFY2DG SCFY2DG XSCFY2ND SCFY2ND XSCFY1N
## 1        R       0      A    NA        A      NA        A      NA       A
## 2        R       0      A    NA        A      NA        A      NA       A
## 3        A      NA      A    NA        A      NA        A      NA       A
## 4        R       0      A    NA        A      NA        A      NA       A
## 5        R       0      A    NA        A      NA        A      NA       A
##   SCFY1N XSCFY1P SCFY1P XUAGRNTN UAGRNTN XUAGRNTP UAGRNTP XUAGRNTT   UAGRNTT
## 1     NA       A     NA        R    4326        R      83        R  50168163
## 2     NA       A     NA        R    9696        R      74        R 101942282
## 3     NA       A     NA        R     216        R      95        R    473411
## 4     NA       A     NA        R    4940        R      69        R  47598105
## 5     NA       A     NA        R    3159        R      96        R  32218913
##   XUAGRNTA UAGRNTA XUDGAGRNTN UDGAGRNTN XUDGAGRNTP UDGAGRNTP XUDGAGRNTT
## 1        R   11597          R      4326          R        83          R
## 2        R   10514          R      9645          R        75          R
## 3        R    2192          R       216          R        95          R
## 4        R    9635          R      4790          R        69          R
## 5        R   10199          R      3159          R        96          R
##   UDGAGRNTT XUDGAGRNTA UDGAGRNTA XUNDAGRNTN UNDAGRNTN XUNDAGRNTP UNDAGRNTP
## 1  50168163          R     11597          R         0          R         0
## 2 101800616          R     10555          R        51          R        20
## 3    473411          R      2192          R         0          Z         0
## 4  47009835          R      9814          R       150          R        82
## 5  32218913          R     10199          R         0          Z         0
##   XUNDAGRNTT UNDAGRNTT XUNDAGRNTA UNDAGRNTA XUPGRNTN UPGRNTN XUPGRNTP UPGRNTP
## 1          R         0          A        NA        R    3353        R      64
## 2          R    141666          R      2778        R    4324        R      33
## 3          R         0          A        NA        R     156        R      68
## 4          R    588270          R      3922        R    1613        R      22
## 5          R         0          A        NA        R    2374        R      72
##   XUPGRNTT  UPGRNTT XUPGRNTA UPGRNTA XUDGPGRNTN UDGPGRNTN XUDGPGRNTP UDGPGRNTP
## 1        R 19363202        R    5775          R      3353          R        64
## 2        R 26529097        R    6135          R      4324          R        34
## 3        R   329636        R    2113          R       156          R        68
## 4        R  7561775        R    4688          R      1612          R        23
## 5        R 11648634        R    4907          R      2374          R        72
##   XUDGPGRNTT UDGPGRNTT XUDGPGRNTA UDGPGRNTA XUNDPGRNTN UNDPGRNTN XUNDPGRNTP
## 1          R  19363202          R      5775          R         0          R
## 2          R  26529097          R      6135          R         0          R
## 3          R    329636          R      2113          R         0          Z
## 4          R   7559208          R      4689          R         1          R
## 5          R  11648634          R      4907          R         0          Z
##   UNDPGRNTP XUNDPGRNTT UNDPGRNTT XUNDPGRNTA UNDPGRNTA XUFLOANN UFLOANN XUFLOANP
## 1         0          R         0          A        NA        R    3050        R
## 2         0          R         0          A        NA        R    5339        R
## 3         0          R         0          A        NA        R     165        R
## 4         1          R      2567          R      2567        R    2431        R
## 5         0          R         0          A        NA        R    2563        R
##   UFLOANP XUFLOANT  UFLOANT XUFLOANA UFLOANA XUDGFLOANN UDGFLOANN XUDGFLOANP
## 1      59        R 20538077        R    6734          R      3050          R
## 2      41        R 36242972        R    6788          R      5339          R
## 3      72        R  1317500        R    7985          R       165          R
## 4      34        R 15656787        R    6440          R      2429          R
## 5      78        R 21589661        R    8424          R      2563          R
##   UDGFLOANP XUDGFLOANT UDGFLOANT XUDGFLOANA UDGFLOANA XUNDFLOANN UNDFLOANN
## 1        59          R  20538077          R      6734          R         0
## 2        42          R  36242972          R      6788          R         0
## 3        72          R   1317500          R      7985          R         0
## 4        35          R  15645787          R      6441          R         2
## 5        78          R  21589661          R      8424          R         0
##   XUNDFLOANP UNDFLOANP XUNDFLOANT UNDFLOANT XUNDFLOANA UNDFLOANA XANYAIDN
## 1          R         0          R         0          A        NA        R
## 2          R         0          R         0          A        NA        R
## 3          Z         0          R         0          A        NA        A
## 4          R         1          R     11000          R      5500        R
## 5          Z         0          R         0          A        NA        R
##   ANYAIDN XANYAIDP ANYAIDP XAIDFSIN AIDFSIN XAIDFSIP AIDFSIP XAGRNT_N AGRNT_N
## 1    1359        R      88        R    1359        R      88        R    1323
## 2    2107        R      97        R    2101        R      97        R    2059
## 3      NA        A      NA        A      NA        A      NA        A      NA
## 4    1121        R      97        R    1116        R      97        R    1101
## 5     862        R      92        R     862        R      92        R     827
##   XAGRNT_P AGRNT_P XAGRNT_T  AGRNT_T XAGRNT_A AGRNT_A XFGRNT_N FGRNT_N XFGRNT_P
## 1        R      86        R 15362449        R   11612        R    1033        R
## 2        R      95        R 23333975        R   11333        R     891        R
## 3        A      NA        A       NA        A      NA        A      NA        A
## 4        R      95        R  9619087        R    8737        R     279        R
## 5        R      88        R  9073315        R   10971        R     685        R
##   FGRNT_P XFGRNT_T FGRNT_T XFGRNT_A FGRNT_A XPGRNT_N PGRNT_N XPGRNT_P PGRNT_P
## 1      67        R 6366369        R    6163        R    1033        R      67
## 2      41        R 5581018        R    6264        R     891        R      41
## 3      NA        A      NA        A      NA        A      NA        A      NA
## 4      24        R 1490308        R    5342        R     279        R      24
## 5      73        R 4041503        R    5900        R     685        R      73
##   XPGRNT_T PGRNT_T XPGRNT_A PGRNT_A XOFGRT_N OFGRT_N XOFGRT_P OFGRT_P XOFGRT_T
## 1        R 6185419        R    5988        R      91        R       6        R
## 2        R 5335268        R    5988        R     177        R       8        R
## 3        A      NA        A      NA        A      NA        A      NA        A
## 4        R 1362862        R    4885        R      68        R       6        R
## 5        R 3782171        R    5521        R     265        R      28        R
##   OFGRT_T XOFGRT_A OFGRT_A XSGRNT_N SGRNT_N XSGRNT_P SGRNT_P XSGRNT_T SGRNT_T
## 1  180950        R    1988        R      29        R       2        R   50810
## 2  245750        R    1388        R      95        R       4        R   78369
## 3      NA        A      NA        A      NA        A      NA        A      NA
## 4  127446        R    1874        R      13        R       1        R   46582
## 5  259332        R     979        R     143        R      15        R  212371
##   XSGRNT_A SGRNT_A XIGRNT_N IGRNT_N XIGRNT_P IGRNT_P XIGRNT_T  IGRNT_T XIGRNT_A
## 1        R    1752        R    1024        R      66        R  8945270        R
## 2        R     825        R    1915        R      88        R 17674588        R
## 3        A      NA        A      NA        A      NA        A       NA        A
## 4        R    3583        R    1058        R      92        R  8082197        R
## 5        R    1485        R     556        R      59        R  4819441        R
##   IGRNT_A XLOAN_N LOAN_N XLOAN_P LOAN_P XLOAN_T  LOAN_T XLOAN_A LOAN_A XFLOAN_N
## 1    8736       R    887       R     57       R 5664417       R   6386        R
## 2    9230       R   1106       R     51       R 7271857       R   6575        R
## 3      NA       A     NA       A     NA       A      NA       A     NA        A
## 4    7639       R    455       R     39       R 3533785       R   7767        R
## 5    8668       R    640       R     68       R 4088888       R   6389        R
##   FLOAN_N XFLOAN_P FLOAN_P XFLOAN_T FLOAN_T XFLOAN_A FLOAN_A XOLOAN_N OLOAN_N
## 1     880        R      57        R 5322609        R    6048        R      26
## 2    1087        R      50        R 5565993        R    5121        R     124
## 3      NA        A      NA        A      NA        A      NA        A      NA
## 4     432        R      37        R 2198883        R    5090        R      97
## 5     635        R      68        R 3666952        R    5775        R      40
##   XOLOAN_P OLOAN_P XOLOAN_T OLOAN_T XOLOAN_A OLOAN_A XGISTN2 GISTN2 XGISTON2
## 1        R       2        R  341808        R   13146       R    712        R
## 2        R       6        R 1705864        R   13757       R   1779        R
## 3        A      NA        A      NA        A      NA       A     NA        A
## 4        R       8        R 1334902        R   13762       R    920        R
## 5        R       4        R  421936        R   10548       R    296        R
##   GISTON2 XGISTWF2 GISTWF2 XGISTOF2 GISTOF2 XGISTUN2 GISTUN2 XGISTT2   GISTT2
## 1     619        R      70        R      22        R       1       R  6598275
## 2    1522        R     145        R      70        R      42       R 18643927
## 3      NA        A      NA        A      NA        A      NA       A       NA
## 4     552        R     209        R      49        R     110       R  6588101
## 5     224        R      24        R      25        R      23       R  2516716
##   XGISTA2 GISTA2 XGISTN1 GISTN1 XGISTON1 GISTON1 XGISTWF1 GISTWF1 XGISTOF1
## 1       R   9267       R    612        R     471        R      78        R
## 2       R  10480       R   1800        R    1535        R     183        R
## 3       A     NA       A     NA        A      NA        A      NA        A
## 4       R   7161       R    919        R     559        R     184        R
## 5       R   8502       R    206        R     168        R      20        R
##   GISTOF1 XGISTUN1 GISTUN1 XGISTT1   GISTT1 XGISTA1 GISTA1 XGISTN0 GISTN0
## 1      63        R       0       R  5180341       R   8465       R    693
## 2      54        R      28       R 17145431       R   9525       R   1585
## 3      NA        A      NA       A       NA       A     NA       A     NA
## 4      44        R     132       R  6987147       R   7603       R    902
## 5      14        R       4       R  1795518       R   8716       R    430
##   XGISTON0 GISTON0 XGISTWF0 GISTWF0 XGISTOF0 GISTOF0 XGISTUN0 GISTUN0 XGISTT0
## 1        R     482        R     158        R      53        R       0       R
## 2        R    1325        R     142        R      56        R      62       R
## 3        A      NA        A      NA        A      NA        A      NA       A
## 4        R     408        R     226        R      52        R     216       R
## 5        R     367        R      26        R      29        R       8       R
##     GISTT0 XGISTA0 GISTA0 XGIS4N2 GIS4N2 XGIS4ON2 GIS4ON2 XGIS4WF2 GIS4WF2
## 1  6240985       R   9006       R    645        R     592        R      34
## 2 15046376       R   9493       R   1260        R    1095        R     112
## 3       NA       A     NA       A     NA        A      NA        A      NA
## 4  6724413       R   7455       R    464        R     298        R     111
## 5  3895022       R   9058       R    291        R     243        R      23
##   XGIS4OF2 GIS4OF2 XGIS4UN2 GIS4UN2 XGIS4G2 GIS4G2 XGIS4T2   GIS4T2 XGIS4A2
## 1        R      19        R       0       R    636       R  5928907       R
## 2        R      53        R       0       R   1217       R 12724450       R
## 3        A      NA        A      NA       A     NA       A       NA       A
## 4        R      35        R      20       R    452       R  3349380       R
## 5        R      25        R       0       R    283       R  2368392       R
##   GIS4A2 XGIS4N12 GIS4N12 XGIS4G12 GIS4G12 XGIS4T12 GIS4T12 XGIS4A12 GIS4A12
## 1   9192        R     328        R     328        R 3271951        R    9975
## 2  10099        R     410        R     410        R 5333964        R   13010
## 3     NA        A      NA        A      NA        A      NA        A      NA
## 4   7218        R      93        R      93        R  983620        R   10577
## 5   8139        R     174        R     173        R 1481759        R    8516
##   XGIS4N22 GIS4N22 XGIS4G22 GIS4G22 XGIS4T22 GIS4T22 XGIS4A22 GIS4A22 XGIS4N32
## 1        R     176        R     176        R 1729702        R    9828        R
## 2        R     210        R     209        R 2509572        R   11950        R
## 3        A      NA        A      NA        A      NA        A      NA        A
## 4        R      84        R      84        R  769664        R    9163        R
## 5        R      62        R      62        R  494836        R    7981        R
##   GIS4N32 XGIS4G32 GIS4G32 XGIS4T32 GIS4T32 XGIS4A32 GIS4A32 XGIS4N42 GIS4N42
## 1      94        R      91        R  684364        R    7280        R      21
## 2     216        R     214        R 2055503        R    9516        R     168
## 3      NA        A      NA        A      NA        A      NA        A      NA
## 4      79        R      77        R  466937        R    5911        R      76
## 5      33        R      29        R  226797        R    6873        R      15
##   XGIS4G42 GIS4G42 XGIS4T42 GIS4T42 XGIS4A42 GIS4A42 XGIS4N52 GIS4N52 XGIS4G52
## 1        R      18        R   98566        R    4694        R      26        R
## 2        R     149        R 1161418        R    6913        R     256        R
## 3        A      NA        A      NA        A      NA        A      NA        A
## 4        R      73        R  447853        R    5893        R     132        R
## 5        R      12        R   93848        R    6257        R       7        R
##   GIS4G52 XGIS4T52 GIS4T52 XGIS4A52 GIS4A52 XGIS4N1 GIS4N1 XGIS4ON1 GIS4ON1
## 1      23        R  144324        R    5551       R    567        R     439
## 2     235        R 1663993        R    6500       R   1264        R    1074
## 3      NA        A      NA        A      NA       A     NA        A      NA
## 4     125        R  681306        R    5161       R    485        R     265
## 5       7        R   71152        R   10165       R    210        R     176
##   XGIS4WF1 GIS4WF1 XGIS4OF1 GIS4OF1 XGIS4UN1 GIS4UN1 XGIS4G1 GIS4G1 XGIS4T1
## 1        R      66        R      62        R       0       R    549       R
## 2        R     141        R      49        R       0       R   1200       R
## 3        A      NA        A      NA        A      NA       A     NA       A
## 4        R     113        R      37        R      70       R    451       R
## 5        R      20        R      14        R       0       R    204       R
##     GIS4T1 XGIS4A1 GIS4A1 XGIS4N11 GIS4N11 XGIS4G11 GIS4G11 XGIS4T11 GIS4T11
## 1  4640730       R   8185        R     311        R     307        R 2787106
## 2 12010869       R   9502        R     426        R     426        R 4957183
## 3       NA       A     NA        A      NA        A      NA        A      NA
## 4  3655336       R   7537        R      98        R      98        R 1163769
## 5  1758424       R   8373        R     115        R     115        R 1015884
##   XGIS4A11 GIS4A11 XGIS4N21 GIS4N21 XGIS4G21 GIS4G21 XGIS4T21 GIS4T21 XGIS4A21
## 1        R    8962        R     143        R     142        R 1235673        R
## 2        R   11637        R     222        R     222        R 2470169        R
## 3        A      NA        A      NA        A      NA        A      NA        A
## 4        R   11875        R      66        R      66        R  664885        R
## 5        R    8834        R      46        R      46        R  375439        R
##   GIS4A21 XGIS4N31 GIS4N31 XGIS4G31 GIS4G31 XGIS4T31 GIS4T31 XGIS4A31 GIS4A31
## 1    8641        R      57        R      53        R  378171        R    6635
## 2   11127        R     217        R     210        R 1926464        R    8878
## 3      NA        A      NA        A      NA        A      NA        A      NA
## 4   10074        R      94        R      86        R  655627        R    6975
## 5    8162        R      21        R      21        R  176298        R    8395
##   XGIS4N41 GIS4N41 XGIS4G41 GIS4G41 XGIS4T41 GIS4T41 XGIS4A41 GIS4A41 XGIS4N51
## 1        R      29        R      22        R   87821        R    3028        R
## 2        R     145        R     124        R  900763        R    6212        R
## 3        A      NA        A      NA        A      NA        A      NA        A
## 4        R      83        R      73        R  406468        R    4897        R
## 5        R      12        R      11        R   89836        R    7486        R
##   GIS4N51 XGIS4G51 GIS4G51 XGIS4T51 GIS4T51 XGIS4A51 GIS4A51 XGIS4N0 GIS4N0
## 1      27        R      25        R  151959        R    5628       R    657
## 2     254        R     218        R 1756290        R    6915       R   1053
## 3      NA        A      NA        A      NA        A      NA       A     NA
## 4     144        R     128        R  764587        R    5310       R    515
## 5      16        R      11        R  100967        R    6310       R    418
##   XGIS4ON0 GIS4ON0 XGIS4WF0 GIS4WF0 XGIS4OF0 GIS4OF0 XGIS4UN0 GIS4UN0 XGIS4G0
## 1        R     457        R     150        R      50        R       0       R
## 2        R     883        R     118        R      50        R       2       R
## 3        A      NA        A      NA        A      NA        A      NA       A
## 4        R     264        R     140        R      35        R      76       R
## 5        R     362        R      25        R      28        R       3       R
##   GIS4G0 XGIS4T0   GIS4T0 XGIS4A0 GIS4A0 XGIS4N10 GIS4N10 XGIS4G10 GIS4G10
## 1    642       R  5825299       R   8867        R     381        R     380
## 2    992       R 10175657       R   9663        R     322        R     321
## 3     NA       A       NA       A     NA        A      NA        A      NA
## 4    492       R  3923644       R   7619        R     141        R     141
## 5    389       R  3488167       R   8345        R     284        R     284
##   XGIS4T10 GIS4T10 XGIS4A10 GIS4A10 XGIS4N20 GIS4N20 XGIS4G20 GIS4G20 XGIS4T20
## 1        R 3713735        R    9747        R     141        R     141        R
## 2        R 3988192        R   12386        R     195        R     195        R
## 3        A      NA        A      NA        A      NA        A      NA        A
## 4        R 1440379        R   10215        R      72        R      71        R
## 5        R 2573442        R    9061        R      87        R      81        R
##   GIS4T20 XGIS4A20 GIS4A20 XGIS4N30 GIS4N30 XGIS4G30 GIS4G30 XGIS4T30 GIS4T30
## 1 1357002        R    9624        R      76        R      73        R  487604
## 2 2121276        R   10878        R     161        R     155        R 1513529
## 3      NA        A      NA        A      NA        A      NA        A      NA
## 4  736070        R   10223        R      99        R      94        R  623404
## 5  675785        R    7768        R      28        R      23        R  222495
##   XGIS4A30 GIS4A30 XGIS4N40 GIS4N40 XGIS4G40 GIS4G40 XGIS4T40 GIS4T40 XGIS4A40
## 1        R    6416        R      40        R      34        R  208688        R
## 2        R    9401        R     126        R     112        R 1010300        R
## 3        A      NA        A      NA        A      NA        A      NA        A
## 4        R    6297        R      84        R      77        R  475706        R
## 5        R    7946        R      12        R       1        R   16445        R
##   GIS4A40 XGIS4N50 GIS4N50 XGIS4G50 GIS4G50 XGIS4T50 GIS4T50 XGIS4A50 GIS4A50
## 1    5217        R      19        R      14        R   58270        R    3067
## 2    8018        R     249        R     209        R 1542360        R    6194
## 3      NA        A      NA        A      NA        A      NA        A      NA
## 4    5663        R     119        R     109        R  648085        R    5446
## 5    1370        R       7        R       0        R       0        R       0
##   XNPIST2 NPIST2 XNPIST1 NPIST1 XNPIST0 NPIST0 XNPIS412 NPIS412 XNPIS422
## 1       R  14064       R  14600       R  12921        R   13776        R
## 2       R  17413       R  16834       R  16990        R   14816        R
## 3       A     NA       A     NA       A     NA        A      NA        A
## 4       R  19847       R  18588       R  17302        R   16521        R
## 5       R  13504       R  13176       R  12875        R   13512        R
##   NPIS422 XNPIS432 NPIS432 XNPIS442 NPIS442 XNPIS452 NPIS452 XNPIS411 NPIS411
## 1   13923        R   16471        R   19057        R   18200        R   14205
## 2   15876        R   18310        R   20913        R   21326        R   14620
## 3      NA        A      NA        A      NA        A      NA        A      NA
## 4   17935        R   21187        R   21205        R   21937        R   13902
## 5   14047        R   15155        R   15771        R   11863        R   13066
##   XNPIS421 NPIS421 XNPIS431 NPIS431 XNPIS441 NPIS441 XNPIS451 NPIS451 XNPIS410
## 1        R   14526        R   16532        R   20139        R   17539        R
## 2        R   15130        R   17379        R   20045        R   19342        R
## 3        A      NA        A      NA        A      NA        A      NA        A
## 4        R   15703        R   18802        R   20880        R   20467        R
## 5        R   13738        R   13505        R   14414        R   15590        R
##   NPIS410 XNPIS420 NPIS420 XNPIS430 NPIS430 XNPIS440 NPIS440 XNPIS450 NPIS450
## 1   12177        R   12300        R   15508        R   16707        R   18857
## 2   13862        R   15370        R   16847        R   18230        R   20054
## 3      NA        A      NA        A      NA        A      NA        A      NA
## 4   14654        R   14646        R   18572        R   19206        R   19423
## 5   12877        R   14170        R   13992        R   20568        R   21938
##   XGRNTN2 GRNTN2 XGRNTON2 GRNTON2 XGRNTWF2 GRNTWF2 XGRNTOF2 GRNTOF2 XGRNTUN2
## 1       A     NA        A      NA        A      NA        A      NA        A
## 2       A     NA        A      NA        A      NA        A      NA        A
## 3       A     NA        A      NA        A      NA        A      NA        A
## 4       A     NA        A      NA        A      NA        A      NA        A
## 5       A     NA        A      NA        A      NA        A      NA        A
##   GRNTUN2 XGRNTT2 GRNTT2 XGRNTA2 GRNTA2 XGRNTN1 GRNTN1 XGRNTON1 GRNTON1
## 1      NA       A     NA       A     NA       A     NA        A      NA
## 2      NA       A     NA       A     NA       A     NA        A      NA
## 3      NA       A     NA       A     NA       A     NA        A      NA
## 4      NA       A     NA       A     NA       A     NA        A      NA
## 5      NA       A     NA       A     NA       A     NA        A      NA
##   XGRNTWF1 GRNTWF1 XGRNTOF1 GRNTOF1 XGRNTUN1 GRNTUN1 XGRNTT1 GRNTT1 XGRNTA1
## 1        A      NA        A      NA        A      NA       A     NA       A
## 2        A      NA        A      NA        A      NA       A     NA       A
## 3        A      NA        A      NA        A      NA       A     NA       A
## 4        A      NA        A      NA        A      NA       A     NA       A
## 5        A      NA        A      NA        A      NA       A     NA       A
##   GRNTA1 XGRNTN0 GRNTN0 XGRNTON0 GRNTON0 XGRNTWF0 GRNTWF0 XGRNTOF0 GRNTOF0
## 1     NA       A     NA        A      NA        A      NA        A      NA
## 2     NA       A     NA        A      NA        A      NA        A      NA
## 3     NA       A     NA        A      NA        A      NA        A      NA
## 4     NA       A     NA        A      NA        A      NA        A      NA
## 5     NA       A     NA        A      NA        A      NA        A      NA
##   XGRNTUN0 GRNTUN0 XGRNTT0 GRNTT0 XGRNTA0 GRNTA0 XGRN4N2 GRN4N2 XGRN4ON2
## 1        A      NA       A     NA       A     NA       A     NA        A
## 2        A      NA       A     NA       A     NA       A     NA        A
## 3        A      NA       A     NA       A     NA       A     NA        A
## 4        A      NA       A     NA       A     NA       A     NA        A
## 5        A      NA       A     NA       A     NA       A     NA        A
##   GRN4ON2 XGRN4WF2 GRN4WF2 XGRN4OF2 GRN4OF2 XGRN4UN2 GRN4UN2 XGRN4G2 GRN4G2
## 1      NA        A      NA        A      NA        A      NA       A     NA
## 2      NA        A      NA        A      NA        A      NA       A     NA
## 3      NA        A      NA        A      NA        A      NA       A     NA
## 4      NA        A      NA        A      NA        A      NA       A     NA
## 5      NA        A      NA        A      NA        A      NA       A     NA
##   XGRN4T2 GRN4T2 XGRN4A2 GRN4A2 XGRN4N12 GRN4N12 XGRN4G12 GRN4G12 XGRN4T12
## 1       A     NA       A     NA        A      NA        A      NA        A
## 2       A     NA       A     NA        A      NA        A      NA        A
## 3       A     NA       A     NA        A      NA        A      NA        A
## 4       A     NA       A     NA        A      NA        A      NA        A
## 5       A     NA       A     NA        A      NA        A      NA        A
##   GRN4T12 XGRN4A12 GRN4A12 XGRN4N22 GRN4N22 XGRN4G22 GRN4G22 XGRN4T22 GRN4T22
## 1      NA        A      NA        A      NA        A      NA        A      NA
## 2      NA        A      NA        A      NA        A      NA        A      NA
## 3      NA        A      NA        A      NA        A      NA        A      NA
## 4      NA        A      NA        A      NA        A      NA        A      NA
## 5      NA        A      NA        A      NA        A      NA        A      NA
##   XGRN4A22 GRN4A22 XGRN4N32 GRN4N32 XGRN4G32 GRN4G32 XGRN4T32 GRN4T32 XGRN4A32
## 1        A      NA        A      NA        A      NA        A      NA        A
## 2        A      NA        A      NA        A      NA        A      NA        A
## 3        A      NA        A      NA        A      NA        A      NA        A
## 4        A      NA        A      NA        A      NA        A      NA        A
## 5        A      NA        A      NA        A      NA        A      NA        A
##   GRN4A32 XGRN4N42 GRN4N42 XGRN4G42 GRN4G42 XGRN4T42 GRN4T42 XGRN4A42 GRN4A42
## 1      NA        A      NA        A      NA        A      NA        A      NA
## 2      NA        A      NA        A      NA        A      NA        A      NA
## 3      NA        A      NA        A      NA        A      NA        A      NA
## 4      NA        A      NA        A      NA        A      NA        A      NA
## 5      NA        A      NA        A      NA        A      NA        A      NA
##   XGRN4N52 GRN4N52 XGRN4G52 GRN4G52 XGRN4T52 GRN4T52 XGRN4A52 GRN4A52 XGRN4N1
## 1        A      NA        A      NA        A      NA        A      NA       A
## 2        A      NA        A      NA        A      NA        A      NA       A
## 3        A      NA        A      NA        A      NA        A      NA       A
## 4        A      NA        A      NA        A      NA        A      NA       A
## 5        A      NA        A      NA        A      NA        A      NA       A
##   GRN4N1 XGRN4ON1 GRN4ON1 XGRN4WF1 GRN4WF1 XGRN4OF1 GRN4OF1 XGRN4UN1 GRN4UN1
## 1     NA        A      NA        A      NA        A      NA        A      NA
## 2     NA        A      NA        A      NA        A      NA        A      NA
## 3     NA        A      NA        A      NA        A      NA        A      NA
## 4     NA        A      NA        A      NA        A      NA        A      NA
## 5     NA        A      NA        A      NA        A      NA        A      NA
##   XGRN4G1 GRN4G1 XGRN4T1 GRN4T1 XGRN4A1 GRN4A1 XGRN4N11 GRN4N11 XGRN4G11
## 1       A     NA       A     NA       A     NA        A      NA        A
## 2       A     NA       A     NA       A     NA        A      NA        A
## 3       A     NA       A     NA       A     NA        A      NA        A
## 4       A     NA       A     NA       A     NA        A      NA        A
## 5       A     NA       A     NA       A     NA        A      NA        A
##   GRN4G11 XGRN4T11 GRN4T11 XGRN4A11 GRN4A11 XGRN4N21 GRN4N21 XGRN4G21 GRN4G21
## 1      NA        A      NA        A      NA        A      NA        A      NA
## 2      NA        A      NA        A      NA        A      NA        A      NA
## 3      NA        A      NA        A      NA        A      NA        A      NA
## 4      NA        A      NA        A      NA        A      NA        A      NA
## 5      NA        A      NA        A      NA        A      NA        A      NA
##   XGRN4T21 GRN4T21 XGRN4A21 GRN4A21 XGRN4N31 GRN4N31 XGRN4G31 GRN4G31 XGRN4T31
## 1        A      NA        A      NA        A      NA        A      NA        A
## 2        A      NA        A      NA        A      NA        A      NA        A
## 3        A      NA        A      NA        A      NA        A      NA        A
## 4        A      NA        A      NA        A      NA        A      NA        A
## 5        A      NA        A      NA        A      NA        A      NA        A
##   GRN4T31 XGRN4A31 GRN4A31 XGRN4N41 GRN4N41 XGRN4G41 GRN4G41 XGRN4T41 GRN4T41
## 1      NA        A      NA        A      NA        A      NA        A      NA
## 2      NA        A      NA        A      NA        A      NA        A      NA
## 3      NA        A      NA        A      NA        A      NA        A      NA
## 4      NA        A      NA        A      NA        A      NA        A      NA
## 5      NA        A      NA        A      NA        A      NA        A      NA
##   XGRN4A41 GRN4A41 XGRN4N51 GRN4N51 XGRN4G51 GRN4G51 XGRN4T51 GRN4T51 XGRN4A51
## 1        A      NA        A      NA        A      NA        A      NA        A
## 2        A      NA        A      NA        A      NA        A      NA        A
## 3        A      NA        A      NA        A      NA        A      NA        A
## 4        A      NA        A      NA        A      NA        A      NA        A
## 5        A      NA        A      NA        A      NA        A      NA        A
##   GRN4A51 XGRN4N0 GRN4N0 XGRN4ON0 GRN4ON0 XGRN4WF0 GRN4WF0 XGRN4OF0 GRN4OF0
## 1      NA       A     NA        A      NA        A      NA        A      NA
## 2      NA       A     NA        A      NA        A      NA        A      NA
## 3      NA       A     NA        A      NA        A      NA        A      NA
## 4      NA       A     NA        A      NA        A      NA        A      NA
## 5      NA       A     NA        A      NA        A      NA        A      NA
##   XGRN4UN0 GRN4UN0 XGRN4G0 GRN4G0 XGRN4T0 GRN4T0 XGRN4A0 GRN4A0 XGRN4N10
## 1        A      NA       A     NA       A     NA       A     NA        A
## 2        A      NA       A     NA       A     NA       A     NA        A
## 3        A      NA       A     NA       A     NA       A     NA        A
## 4        A      NA       A     NA       A     NA       A     NA        A
## 5        A      NA       A     NA       A     NA       A     NA        A
##   GRN4N10 XGRN4G10 GRN4G10 XGRN4T10 GRN4T10 XGRN4A10 GRN4A10 XGRN4N20 GRN4N20
## 1      NA        A      NA        A      NA        A      NA        A      NA
## 2      NA        A      NA        A      NA        A      NA        A      NA
## 3      NA        A      NA        A      NA        A      NA        A      NA
## 4      NA        A      NA        A      NA        A      NA        A      NA
## 5      NA        A      NA        A      NA        A      NA        A      NA
##   XGRN4G20 GRN4G20 XGRN4T20 GRN4T20 XGRN4A20 GRN4A20 XGRN4N30 GRN4N30 XGRN4G30
## 1        A      NA        A      NA        A      NA        A      NA        A
## 2        A      NA        A      NA        A      NA        A      NA        A
## 3        A      NA        A      NA        A      NA        A      NA        A
## 4        A      NA        A      NA        A      NA        A      NA        A
## 5        A      NA        A      NA        A      NA        A      NA        A
##   GRN4G30 XGRN4T30 GRN4T30 XGRN4A30 GRN4A30 XGRN4N40 GRN4N40 XGRN4G40 GRN4G40
## 1      NA        A      NA        A      NA        A      NA        A      NA
## 2      NA        A      NA        A      NA        A      NA        A      NA
## 3      NA        A      NA        A      NA        A      NA        A      NA
## 4      NA        A      NA        A      NA        A      NA        A      NA
## 5      NA        A      NA        A      NA        A      NA        A      NA
##   XGRN4T40 GRN4T40 XGRN4A40 GRN4A40 XGRN4N50 GRN4N50 XGRN4G50 GRN4G50 XGRN4T50
## 1        A      NA        A      NA        A      NA        A      NA        A
## 2        A      NA        A      NA        A      NA        A      NA        A
## 3        A      NA        A      NA        A      NA        A      NA        A
## 4        A      NA        A      NA        A      NA        A      NA        A
## 5        A      NA        A      NA        A      NA        A      NA        A
##   GRN4T50 XGRN4A50 GRN4A50 XNPGRN2 NPGRN2 XNPGRN1 NPGRN1 XNPGRN0 NPGRN0 XNPT412
## 1      NA        A      NA       A     NA       A     NA       A     NA       A
## 2      NA        A      NA       A     NA       A     NA       A     NA       A
## 3      NA        A      NA       A     NA       A     NA       A     NA       A
## 4      NA        A      NA       A     NA       A     NA       A     NA       A
## 5      NA        A      NA       A     NA       A     NA       A     NA       A
##   NPT412 XNPT422 NPT422 XNPT432 NPT432 XNPT442 NPT442 XNPT452 NPT452 XNPT411
## 1     NA       A     NA       A     NA       A     NA       A     NA       A
## 2     NA       A     NA       A     NA       A     NA       A     NA       A
## 3     NA       A     NA       A     NA       A     NA       A     NA       A
## 4     NA       A     NA       A     NA       A     NA       A     NA       A
## 5     NA       A     NA       A     NA       A     NA       A     NA       A
##   NPT411 XNPT421 NPT421 XNPT431 NPT431 XNPT441 NPT441 XNPT451 NPT451 XNPT410
## 1     NA       A     NA       A     NA       A     NA       A     NA       A
## 2     NA       A     NA       A     NA       A     NA       A     NA       A
## 3     NA       A     NA       A     NA       A     NA       A     NA       A
## 4     NA       A     NA       A     NA       A     NA       A     NA       A
## 5     NA       A     NA       A     NA       A     NA       A     NA       A
##   NPT410 XNPT420 NPT420 XNPT430 NPT430 XNPT440 NPT440 XNPT450 NPT450
## 1     NA       A     NA       A     NA       A     NA       A     NA
## 2     NA       A     NA       A     NA       A     NA       A     NA
## 3     NA       A     NA       A     NA       A     NA       A     NA
## 4     NA       A     NA       A     NA       A     NA       A     NA
## 5     NA       A     NA       A     NA       A     NA       A     NA

Importing the institutional characteristics

ipeds_inst2223 <- read.csv('https://raw.githubusercontent.com/william-forero/Data-607/refs/heads/main/hd2023.csv')

How many rows and columns are included in the Inst Characteristic Dataset?

dim(ipeds_inst2223)
## [1] 6163   73

Takeaway: 6,163 rows and 73 columns in the 22-23 IPEDS Institutional Characteristic Dataset

What does the data look like?

head(ipeds_inst2223,5)
##   UNITID                              INSTNM
## 1 100654            Alabama A & M University
## 2 100663 University of Alabama at Birmingham
## 3 100690                  Amridge University
## 4 100706 University of Alabama in Huntsville
## 5 100724            Alabama State University
##                                              IALIAS
## 1                                              AAMU
## 2                                               UAB
## 3 Southern Christian University  Regions University
## 4             UAH  University of Alabama Huntsville
## 5                                                  
##                             ADDR       CITY STABBR        ZIP FIPS OBEREG
## 1           4900 Meridian Street     Normal     AL      35762    1      5
## 2 Administration Bldg Suite 1070 Birmingham     AL 35294-0110    1      5
## 3                 1200 Taylor Rd Montgomery     AL 36117-3553    1      5
## 4                301 Sparkman Dr Huntsville     AL      35899    1      5
## 5           915 S Jackson Street Montgomery     AL 36104-0271    1      5
##                CHFNM  CHFTITLE      GENTELE       EIN         UEIS   OPEID
## 1 Dr. Daniel K. Wims President 2.563725e+09 636001109 JDVGS67MSLH7  100200
## 2       Ray L. Watts President 2.059344e+09 636005396 YND4PLMC9AN7  105200
## 3   Michael C.Turner President 3.343874e+13 237034324 RB27R4GLDKE7 2503400
## 4         Chuck Karr President 2.568246e+09 630520830 HB6KNGVNJRU1  105500
## 5    Quinton T. Ross President 3.342294e+09 636001101 DLJWLMSNK627  100500
##   OPEFLAG                            WEBADDR
## 1       1                      www.aamu.edu/
## 2       1               https://www.uab.edu/
## 3       1 https://www.amridgeuniversity.edu/
## 4       1                       www.uah.edu/
## 5       1                     www.alasu.edu/
##                                         ADMINURL
## 1 https://www.aamu.edu/admissions-aid/index.html
## 2                https://www.uab.edu/admissions/
## 3  https://www.amridgeuniversity.edu/admissions/
## 4                 https://www.uah.edu/admissions
## 5            www.alasu.edu/admissions/index.aspx
##                                                                          FAIDURL
## 1                             https://www.aamu.edu/admissions-aid/financial-aid/
## 2                                                  https://www.uab.edu/cost-aid/
## 3                                https://www.amridgeuniversity.edu/financialaid/
## 4                                                                finaid.uah.edu/
## 5 www.alasu.edu/undergraduate/expensesandfinancialaid/expenses-and-financial-aid
##                                                                         APPLURL
## 1 https://www.aamu.edu/admissions-aid/undergraduate-admissions/apply-today.html
## 2                                          https://www.uab.edu/admissions/apply
## 3                                  https://www.amridgeuniversity.edu/myamridge/
## 4                                                             register.uah.edu/
## 5                      www.alasu.edu/admissions/undergrad-admissions/index.aspx
##                                                                          NPRICURL
## 1              www.aamu.edu/admissions-aid/tuition-fees/net-price-calculator.html
## 2 https://tcc.ruffalonl.com/University of Alabama at Birmingham/Freshman-Students
## 3                                        https://www2.amridgeuniversity.edu:9091/
## 4                                                                 finaid.uah.edu/
## 5                       www.alasu.edu/cost-aid/tuition-costs/net-price-calculator
##                                                   VETURL
## 1                                                       
## 2                  https://www.uab.edu/students/veterans
## 3 https://www.amridgeuniversity.edu/admissions/military/
## 4 www.uah.edu/admissions/graduate/financial-aid/veterans
## 5                                                       
##                                                      ATHURL
## 1                                                          
## 2                    https://www.uab.edu/registrar/students
## 3                                                          
## 4                                          www.uah.edu/heoa
## 5 https://www.alasu.edu/administration/consumer-information
##                                                                          DISAURL
## 1 https://www.aamu.edu/administrativeoffices/VADS/Pages/Disability-Services.aspx
## 2                                       https://www.uab.edu/students/disability/
## 3                             https://www.amridgeuniversity.edu/studentservices/
## 4                             www.uah.edu/health-and-wellness/disability-support
## 5 www.alasu.edu/academics/researchcenters/alabama-alliance-students-disabilities
##   SECTOR ICLEVEL CONTROL HLOFFER UGOFFER GROFFER HDEGOFR1 DEGGRANT HBCU
## 1      1       1       1       9       1       1       12        1    1
## 2      1       1       1       9       1       1       11        1    2
## 3      2       1       2       9       1       1       12        1    2
## 4      1       1       1       9       1       1       11        1    2
## 5      1       1       1       9       1       1       11        1    1
##   HOSPITAL MEDICAL TRIBAL LOCALE OPENPUBL ACT NEWID DEATHYR CLOSEDAT CYACTIVE
## 1        2       2      2     12        1  A     -2      -2       -2        1
## 2        1       1      2     12        1  A     -2      -2       -2        1
## 3        2       2      2     12        1  A     -2      -2       -2        1
## 4        2       2      2     12        1  A     -2      -2       -2        1
## 5        2       2      2     12        1  A     -2      -2       -2        1
##   POSTSEC PSEFLAG PSET4FLG RPTMTH INSTCAT C21BASIC C21IPUG C21IPGRD C21UGPRF
## 1       1       1        1      1       2       18      16       18       10
## 2       1       1        1      1       2       15      14       14        9
## 3       1       1        1      1       2       20      20       18        5
## 4       1       1        1      1       2       15      17       17       15
## 5       1       1        1      1       2       17      13       18       10
##   C21ENPRF C21SZSET C18BASIC C15BASIC CCBASIC CARNEGIE LANDGRNT INSTSIZE
## 1        4       14       18       18      18       16        1        3
## 2        5       15       15       15      15       15        2        5
## 3        6        6       20       20      21       51        2        1
## 4        4       13       16       16      15       16        2        3
## 5        3       14       19       19      18       21        2        2
##   F1SYSTYP                         F1SYSNAM F1SYSCOD  CBSA CBSATYPE CSA
## 1        2                               -2       -2 26620        1 290
## 2        1 The University of Alabama System   101050 13820        1 142
## 3        2                               -2       -2 33860        1 388
## 4        1 The University of Alabama System   101050 26620        1 290
## 5        2                               -2       -2 33860        1 388
##   COUNTYCD          COUNTYNM CNGDSTCD  LONGITUD LATITUDE DFRCGID DFRCUSCG
## 1     1089    Madison County      105 -86.56850 34.78337     106        1
## 2     1073  Jefferson County      107 -86.79935 33.50570      91        1
## 3     1101 Montgomery County      102 -86.17401 32.36261     124        2
## 4     1089    Madison County      105 -86.64045 34.72456      91        1
## 5     1101 Montgomery County      107 -86.29568 32.36432      97        1

Importing the graduation rate data

ipeds_grad2223 <- read.csv('https://raw.githubusercontent.com/william-forero/Data-607/refs/heads/main/2223_IPEDS_CUSTOM_DATA_FILE.csv')

How many rows and columns are included in the Grad Rate Dataset?

dim(ipeds_grad2223)
## [1] 1563   38

Takeaway: 1,563 rows and 38 columns in the 22-23 IPEDS Dataset

3 - Cleaning The Data

Using the data dictionary provided by IPEDS, I will first trim down each dataframe to the columns I intend to use for the project.

Trimming down the FA columns

Any counts below are for undergraduate, degree seeking students.

fa_clean <- ipeds_fa2223 |> 
  select(
    UNITID,                        # Inst ID
    tot_degr_ugrd    = SCUGDGSK,   # Tot Nbr Degr Seeking Ugrd
    nbr_awd_any_aid  = UDGAGRNTN,  # Nbr Ugrd Degr Awarded Any Aid
    tot_aid_awd      = UDGAGRNTT,  # Total aid awarded
    nbr_awd_pell_aid = UDGPGRNTN,  # Nbr Stdnts awarded Pell Grant
    tot_pell_awd     = UDGPGRNTT,  # Total Pell awarded
    nbr_awd_loans    = UDGFLOANN,  # Nbr Stdnts awarded Loans
    tot_loans_awd    = UDGFLOANT,  # Total Loans awarded
     ) |>
  mutate(
    # Creating columns that show what percent of degree seeking undergraduate 
    # students received each type of aid
    perc_awd_any_aid = round(nbr_awd_any_aid  / tot_degr_ugrd , 2),
    perc_awd_pell    = round(nbr_awd_pell_aid / tot_degr_ugrd , 2),
    perc_awd_loans   = round(nbr_awd_loans    / tot_degr_ugrd , 2),
    # Creating columns that show the average aid package a student received
    avg_awd_any_aid  = round(tot_aid_awd      / nbr_awd_any_aid, 2),
    avg_awd_pell     = round(tot_pell_awd     / nbr_awd_pell_aid, 2),
    avg_awd_loans    = round(tot_loans_awd    / nbr_awd_loans, 2)
  )
head(fa_clean,5)
##   UNITID tot_degr_ugrd nbr_awd_any_aid tot_aid_awd nbr_awd_pell_aid
## 1 100654          5201            4326    50168163             3353
## 2 100663         12776            9645   101800616             4324
## 3 100690           228             216      473411              156
## 4 100706          6985            4790    47009835             1612
## 5 100724          3296            3159    32218913             2374
##   tot_pell_awd nbr_awd_loans tot_loans_awd perc_awd_any_aid perc_awd_pell
## 1     19363202          3050      20538077             0.83          0.64
## 2     26529097          5339      36242972             0.75          0.34
## 3       329636           165       1317500             0.95          0.68
## 4      7559208          2429      15645787             0.69          0.23
## 5     11648634          2563      21589661             0.96          0.72
##   perc_awd_loans avg_awd_any_aid avg_awd_pell avg_awd_loans
## 1           0.59        11596.89      5774.89       6733.80
## 2           0.42        10554.76      6135.31       6788.34
## 3           0.72         2191.72      2113.05       7984.85
## 4           0.35         9814.16      4689.33       6441.25
## 5           0.78        10199.09      4906.75       8423.59

Selecting the columns from the institutional characteristics

This table will later be used to with joins to limit which colleges are included in the model

inst_clean <- ipeds_inst2223 |> 
  filter(
    SECTOR    %in%   c(1, 4),   # To limit public 2 year and 4 year
    OBEREG     !=    9,         # To filter out territories that are not the 50 states and DC
    ICLEVEL   %in%   c(1, 2),   # Keep 2 and 4 year colleges
    UGOFFER    ==     1,        # To limit that they offer undergraduate degrees
    DEGGRANT   ==     1,        # Indicates the inst is degr granting
    OPENPUBL   ==     1         # Denotes the college is open to the public
  ) |>
  select(
    UNITID,
    name       =      INSTNM,
    state      =      STABBR,
    sector     =      SECTOR  
  ) |>
  # This will make it easier to interpret the values 
  mutate(sector = case_when(
      sector == 1 ~ "Public, 4‑Year",
      sector == 4 ~ "Public, 2‑Year"
    ))

head(inst_clean,5)
##   UNITID                                name state         sector
## 1 100654            Alabama A & M University    AL Public, 4‑Year
## 2 100663 University of Alabama at Birmingham    AL Public, 4‑Year
## 3 100706 University of Alabama in Huntsville    AL Public, 4‑Year
## 4 100724            Alabama State University    AL Public, 4‑Year
## 5 100751           The University of Alabama    AL Public, 4‑Year

Cleaning the graduation rate dataset

grad_clean <- ipeds_grad2223 |>
  select(
    unitid,
    tot_grad_rt = `DRVGR2023.Graduation.rate..total.cohort`
  ) |>
  mutate(
    # convert 0–100 percentages → 0–1 proportions which matches what was done in FA
    tot_grad_rt = tot_grad_rt / 100
  ) |>
  filter(
    !is.na(tot_grad_rt)
  )

head(grad_clean,5)
##   unitid tot_grad_rt
## 1 100654        0.29
## 2 100663        0.63
## 3 100706        0.62
## 4 100724        0.30
## 5 100751        0.74

Now the tables should be joined to create a comprehensive dataset for our regression model

full_data <- inst_clean |>
  inner_join(grad_clean, by = c("UNITID" = "unitid")) |>
  inner_join(fa_clean,   by = "UNITID")

head(full_data)
##   UNITID                                name state         sector tot_grad_rt
## 1 100654            Alabama A & M University    AL Public, 4‑Year        0.29
## 2 100663 University of Alabama at Birmingham    AL Public, 4‑Year        0.63
## 3 100706 University of Alabama in Huntsville    AL Public, 4‑Year        0.62
## 4 100724            Alabama State University    AL Public, 4‑Year        0.30
## 5 100751           The University of Alabama    AL Public, 4‑Year        0.74
## 6 100760   Central Alabama Community College    AL Public, 2‑Year        0.24
##   tot_degr_ugrd nbr_awd_any_aid tot_aid_awd nbr_awd_pell_aid tot_pell_awd
## 1          5201            4326    50168163             3353     19363202
## 2         12776            9645   101800616             4324     26529097
## 3          6985            4790    47009835             1612      7559208
## 4          3296            3159    32218913             2374     11648634
## 5         31360           21244   333218271             5834     30350429
## 6           963             666     3442170              502      2275601
##   nbr_awd_loans tot_loans_awd perc_awd_any_aid perc_awd_pell perc_awd_loans
## 1          3050      20538077             0.83          0.64           0.59
## 2          5339      36242972             0.75          0.34           0.42
## 3          2429      15645787             0.69          0.23           0.35
## 4          2563      21589661             0.96          0.72           0.78
## 5         10896      67924516             0.68          0.19           0.35
## 6           283       1058559             0.69          0.52           0.29
##   avg_awd_any_aid avg_awd_pell avg_awd_loans
## 1        11596.89      5774.89       6733.80
## 2        10554.76      6135.31       6788.34
## 3         9814.16      4689.33       6441.25
## 4        10199.09      4906.75       8423.59
## 5        15685.29      5202.34       6233.89
## 6         5168.42      4533.07       3740.49

what shape did the combined data take?

dim(full_data)
## [1] 1536   18

There are 1,536 rows and 18 columns

What do the variables look like?

summary(full_data)
##      UNITID           name              state              sector         
##  Min.   :100654   Length:1536        Length:1536        Length:1536       
##  1st Qu.:148194   Class :character   Class :character   Class :character  
##  Median :185386   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :194418                                                           
##  3rd Qu.:218726                                                           
##  Max.   :498571                                                           
##                                                                           
##   tot_grad_rt     tot_degr_ugrd   nbr_awd_any_aid  tot_aid_awd       
##  Min.   :0.0000   Min.   :   23   Min.   :   23   Min.   :   264791  
##  1st Qu.:0.2900   1st Qu.: 1480   1st Qu.: 1063   1st Qu.:  6133458  
##  Median :0.3800   Median : 3494   Median : 2266   Median : 13582987  
##  Mean   :0.4072   Mean   : 6441   Mean   : 4296   Mean   : 34632662  
##  3rd Qu.:0.5100   3rd Qu.: 7754   3rd Qu.: 5130   3rd Qu.: 33574125  
##  Max.   :1.0000   Max.   :64778   Max.   :50410   Max.   :625717141  
##                                                                      
##  nbr_awd_pell_aid  tot_pell_awd       nbr_awd_loans     tot_loans_awd      
##  Min.   :   12    Min.   :    55293   Min.   :    0.0   Min.   :        0  
##  1st Qu.:  649    1st Qu.:  2949220   1st Qu.:  179.8   1st Qu.:   976710  
##  Median : 1316    Median :  6193948   Median :  618.0   Median :  3464485  
##  Mean   : 2266    Mean   : 11096009   Mean   : 1681.2   Mean   : 10483300  
##  3rd Qu.: 2871    3rd Qu.: 13498318   3rd Qu.: 2053.2   3rd Qu.: 12670293  
##  Max.   :21132    Max.   :120317715   Max.   :18597.0   Max.   :123504447  
##                                                                            
##  perc_awd_any_aid perc_awd_pell    perc_awd_loans   avg_awd_any_aid
##  Min.   :0.1100   Min.   :0.0500   Min.   :0.0000   Min.   : 1689  
##  1st Qu.:0.6000   1st Qu.:0.3100   1st Qu.:0.0700   1st Qu.: 4841  
##  Median :0.7100   Median :0.4000   Median :0.2500   Median : 6057  
##  Mean   :0.7055   Mean   :0.4068   Mean   :0.2616   Mean   : 6882  
##  3rd Qu.:0.8200   3rd Qu.:0.5000   3rd Qu.:0.4200   3rd Qu.: 8427  
##  Max.   :1.0000   Max.   :0.9300   Max.   :0.8600   Max.   :26499  
##                                                                    
##   avg_awd_pell  avg_awd_loans  
##  Min.   :1824   Min.   : 1075  
##  1st Qu.:4213   1st Qu.: 5388  
##  Median :4790   Median : 6043  
##  Mean   :4811   Mean   : 5910  
##  3rd Qu.:5253   3rd Qu.: 6516  
##  Max.   :8791   Max.   :11902  
##                 NA's   :144

Below I do a check to see if there are NAs in the newly combined dataset

missing_summary <- full_data |>
  summarise(across(everything(), ~sum(is.na(.)))) |>
  pivot_longer(everything(),
               names_to  = "variable",
               values_to = "n_missing") |>
  mutate(pct_missing = round(100 * n_missing / nrow(full_data), 2)) |>
  filter(pct_missing > 0) |>
  arrange(desc(pct_missing))

missing_summary
## # A tibble: 1 × 3
##   variable      n_missing pct_missing
##   <chr>             <int>       <dbl>
## 1 avg_awd_loans       144        9.38

Are these just colleges where no loans were awarded?

full_data |> 
  filter(is.na(avg_awd_loans)) |>               
  summarise(
    all_zero = all(nbr_awd_loans == 0 | is.na(nbr_awd_loans)),
    min_tot_loans = min(tot_loans_awd, na.rm = TRUE),
    max_tot_loans = max(tot_loans_awd, na.rm = TRUE)
  )
##   all_zero min_tot_loans max_tot_loans
## 1     TRUE             0             0

This presents a challenge because we hit a point where we must decide whether to impute values or delete the rows. To simply replace these values with $0 may bias the analysis as the fact the loan column was NA could be an indication that these colleges do not offer loans, this is an indication of access to loans. Deleting these rows however also introduces its own bias as now I am limiting the analysis to colleges that offered loans. What I will do is create a variable for “offered_loans” and then replace these missing values with 0.

full_data <- full_data %>%
  mutate(
    offered_loans     = nbr_awd_loans > 0,
    avg_awd_loans = if_else(offered_loans,
                            tot_loans_awd / nbr_awd_loans,
                            0)
  )
head(full_data,5)
##   UNITID                                name state         sector tot_grad_rt
## 1 100654            Alabama A & M University    AL Public, 4‑Year        0.29
## 2 100663 University of Alabama at Birmingham    AL Public, 4‑Year        0.63
## 3 100706 University of Alabama in Huntsville    AL Public, 4‑Year        0.62
## 4 100724            Alabama State University    AL Public, 4‑Year        0.30
## 5 100751           The University of Alabama    AL Public, 4‑Year        0.74
##   tot_degr_ugrd nbr_awd_any_aid tot_aid_awd nbr_awd_pell_aid tot_pell_awd
## 1          5201            4326    50168163             3353     19363202
## 2         12776            9645   101800616             4324     26529097
## 3          6985            4790    47009835             1612      7559208
## 4          3296            3159    32218913             2374     11648634
## 5         31360           21244   333218271             5834     30350429
##   nbr_awd_loans tot_loans_awd perc_awd_any_aid perc_awd_pell perc_awd_loans
## 1          3050      20538077             0.83          0.64           0.59
## 2          5339      36242972             0.75          0.34           0.42
## 3          2429      15645787             0.69          0.23           0.35
## 4          2563      21589661             0.96          0.72           0.78
## 5         10896      67924516             0.68          0.19           0.35
##   avg_awd_any_aid avg_awd_pell avg_awd_loans offered_loans
## 1        11596.89      5774.89      6733.796          TRUE
## 2        10554.76      6135.31      6788.345          TRUE
## 3         9814.16      4689.33      6441.246          TRUE
## 4        10199.09      4906.75      8423.590          TRUE
## 5        15685.29      5202.34      6233.895          TRUE

Now we run the missing value check again, and see that we have handled the missing values.

missing_summary <- full_data |>
  summarise(across(everything(), ~sum(is.na(.)))) |>
  pivot_longer(everything(),
               names_to  = "variable",
               values_to = "n_missing") |>
  mutate(pct_missing = round(100 * n_missing / nrow(full_data), 2)) |>
  filter(pct_missing > 0) |>
  arrange(desc(pct_missing))

missing_summary
## # A tibble: 0 × 3
## # ℹ 3 variables: variable <chr>, n_missing <int>, pct_missing <dbl>

4 - Exploratory Data Analysis

Now I want to dive into what kind of story this data tells.

To start off, what kind of information do we see at the sector level? 2-Year vs 4-Year

sector_summary <- full_data |>
  group_by(sector) |>             
  summarise(
    nbr_colleges = n(), 
    avg_fin_aid_pkg  = round(mean(avg_awd_any_aid, na.rm = TRUE),2),
    med_fin_aid_pkg  = round(median(avg_awd_any_aid, na.rm = TRUE),2),
    avg_grad_rate = round(mean(tot_grad_rt,na.rm = TRUE),2),
    med_grad_rate = round(median(tot_grad_rt,na.rm = TRUE),2)
  )

sector_summary
## # A tibble: 2 × 6
##   sector         nbr_colleges avg_fin_aid_pkg med_fin_aid_pkg avg_grad_rate
##   <chr>                 <int>           <dbl>           <dbl>         <dbl>
## 1 Public, 2‑Year          786           5291.           5135.          0.35
## 2 Public, 4‑Year          750           8551.           8351.          0.47
## # ℹ 1 more variable: med_grad_rate <dbl>

Takeaway: There are 36 more 2-Year institutions included in the data. 4-Year colleges on average award more in aid pakcages to a student, and have a higher graduation rate

What are the 5 colleges that provide aid to the most students (% of students offered any aid)?

full_data |>
  select(UNITID, name,tot_degr_ugrd,tot_grad_rt, perc_awd_any_aid,avg_awd_any_aid) |>
  arrange(desc (perc_awd_any_aid),desc(avg_awd_any_aid)) |>
  slice_head(n = 5) 
##   UNITID                                 name tot_degr_ugrd tot_grad_rt
## 1 215309  University of Pittsburgh-Titusville            23        0.00
## 2 380377             Northwest Indian College           595        0.27
## 3 163338 University of Maryland Eastern Shore          1874        0.35
## 4 219277                Oglala Lakota College          1110        0.11
## 5 199926             Wilkes Community College          1250        0.52
##   perc_awd_any_aid avg_awd_any_aid
## 1                1        11512.65
## 2                1         9116.75
## 3                1         8679.81
## 4                1         8355.33
## 5                1         8335.52

We see that there are colleges where 100% of the students included in the 2023 data were awarded some type of aid. This is interesting because it raises the question of whether these are true outliers, data errors, or just a representation of reality. It also points out a limitation of the data, as its possible that the aid being provided is loans or unique institutional scholarships. I will be removing the college where only 23 students were reported for the college as the sample size seems too small and their 100% of students receiving aid can skew the data.

full_data <- full_data |>
  filter(UNITID != 215309)

What are the 5 colleges that provide aid to the least students (% of students offered any aid)?

full_data |>
  select(UNITID, name,tot_degr_ugrd,tot_grad_rt, perc_awd_any_aid,avg_awd_any_aid) |>
  arrange(perc_awd_any_aid,avg_awd_any_aid) |>
  slice_head(n = 5) 
##   UNITID                                       name tot_degr_ugrd tot_grad_rt
## 1 197027      United States Merchant Marine Academy           931        0.81
## 2 187976 University of New Mexico-Los Alamos Campus           867        0.30
## 3 204440          Northwest State Community College          1028        0.47
## 4 234669                           Bellevue College          7823        0.33
## 5 240125          Waukesha County Technical College          4180        0.46
##   perc_awd_any_aid avg_awd_any_aid
## 1             0.11         4509.29
## 2             0.19         3408.54
## 3             0.21         5010.46
## 4             0.25         6894.19
## 5             0.26         4169.63

What are the 5 colleges that provided the most average $ amount in aid packaged to a student?

full_data |>
  select(UNITID, name,tot_degr_ugrd,tot_grad_rt, perc_awd_any_aid,avg_awd_any_aid) |>
  arrange(desc(avg_awd_any_aid),desc(perc_awd_any_aid)) |>
  slice_head(n = 5) 
##   UNITID                               name tot_degr_ugrd tot_grad_rt
## 1 234076 University of Virginia-Main Campus         17125        0.95
## 2 170976   University of Michigan-Ann Arbor         32448        0.93
## 3 445188    University of California-Merced          8343        0.69
## 4 110635  University of California-Berkeley         32475        0.93
## 5 110680 University of California-San Diego         33092        0.88
##   perc_awd_any_aid avg_awd_any_aid
## 1             0.42        26498.53
## 2             0.52        24544.15
## 3             0.85        23874.35
## 4             0.57        22994.79
## 5             0.57        21788.43

What are the 5 colleges that provided the least average $ amount in aid packaged to a student?

full_data |>
  select(UNITID, name,tot_degr_ugrd,tot_grad_rt, perc_awd_any_aid,avg_awd_any_aid) |>
  arrange(avg_awd_any_aid,perc_awd_any_aid) |>
  slice_head(n = 5) 
##   UNITID                        name tot_degr_ugrd tot_grad_rt perc_awd_any_aid
## 1 120953          Palo Verde College          4053        0.14             0.56
## 2 147800              Oakton College          5918        0.31             0.94
## 3 112385 Coastline Community College          6567        0.28             0.63
## 4 108667          College of Alameda          4077        0.34             0.62
## 5 230746 Salt Lake Community College         17290        0.30             1.00
##   avg_awd_any_aid
## 1         1688.87
## 2         1753.73
## 3         1760.99
## 4         1825.57
## 5         1870.00

What can we tell about the data at the state level

What 5 states have the most colleges present?

# The 5 states with the most number of colleges included

state_summary <- full_data |>
  group_by(state) |>             
  summarise(
    nbr_colleges = n(), 
    avg_fin_aid_pkg  = round(mean(avg_awd_any_aid, na.rm = TRUE),2),
    med_fin_aid_pkg  = round(median(avg_awd_any_aid, na.rm = TRUE),2),
    avg_grad_rate    = round(mean(tot_grad_rt,na.rm = TRUE),2),
    med_grad_rate    = round(median(tot_grad_rt,na.rm = TRUE),2)
  ) |>
  arrange(desc(nbr_colleges))  

head(state_summary,5)
## # A tibble: 5 × 6
##   state nbr_colleges avg_fin_aid_pkg med_fin_aid_pkg avg_grad_rate med_grad_rate
##   <chr>        <int>           <dbl>           <dbl>         <dbl>         <dbl>
## 1 CA             148           6145.           4538.          0.42          0.38
## 2 TX              93           6258.           5439.          0.36          0.34
## 3 NC              75           6826.           6896.          0.45          0.42
## 4 NY              75           7279.           7277.          0.42          0.34
## 5 IL              60           6798.           5739.          0.41          0.41
# The 5 states with the least amount of colleges included
head(state_summary |> arrange(nbr_colleges),5)
## # A tibble: 5 × 6
##   state nbr_colleges avg_fin_aid_pkg med_fin_aid_pkg avg_grad_rate med_grad_rate
##   <chr>        <int>           <dbl>           <dbl>         <dbl>         <dbl>
## 1 DC               1           7279.           7279.          0.26          0.26
## 2 DE               3           9728.          11951.          0.46          0.39
## 3 RI               3           7976.           6725.          0.47          0.46
## 4 VT               3          11743.          12450.          0.51          0.49
## 5 AK               4           6596.           6769.          0.32          0.32
# The 5 states with the most amount of aid awarded to a student on average
head(state_summary |> arrange(desc(avg_fin_aid_pkg)),5)
## # A tibble: 5 × 6
##   state nbr_colleges avg_fin_aid_pkg med_fin_aid_pkg avg_grad_rate med_grad_rate
##   <chr>        <int>           <dbl>           <dbl>         <dbl>         <dbl>
## 1 VT               3          11743.          12450.          0.51          0.49
## 2 CT              10          10309.          10614.          0.55          0.57
## 3 DE               3           9728.          11951.          0.46          0.39
## 4 CO              27           8784.           8199.          0.41          0.38
## 5 PA              54           8327.           8708.          0.38          0.29
# The 5 states with the least amount of aid awarded to a student on average
head(state_summary |> arrange(avg_fin_aid_pkg),5)
## # A tibble: 5 × 6
##   state nbr_colleges avg_fin_aid_pkg med_fin_aid_pkg avg_grad_rate med_grad_rate
##   <chr>        <int>           <dbl>           <dbl>         <dbl>         <dbl>
## 1 AZ              24           5167.           3972.          0.26          0.23
## 2 HI              10           5296.           4818.          0.36          0.32
## 3 NV               7           5369.           5046.          0.4           0.37
## 4 WI              30           5459.           5192.          0.5           0.5 
## 5 OH              59           5812.           5063.          0.34          0.29

4.1 - Univariate Analysis

I want to see the distributions of the variables I have included in the full data. I will look at the distributions for the dependent variable (grad rate), the columns that represent the share of students meeting a condition, and the average aid awarded. I am not as interested in just the count of students or the total sum of aid at a college as it does not tell as complete a story as the percents do. The proportion of students receiving aid takes into consideration the total number of students and the number of students receviing aid by college.

ggplot(full_data, aes(x = tot_degr_ugrd)) +
  geom_histogram(bins = 35,  
                 fill = "steelblue", colour = "white") +
  labs(title = "Distribution of Degree Seeking Undergraduate Students",
       x = "Students",
       y = "Number of colleges") +
  theme_minimal()

There is a natural floor to the number of students since there cannot be less than 0 students seeking an undergraduate degree at a college. The variable is heavily right skewed which can impact the performance of the model so we will apply a log transformation.

ggplot(full_data, aes(x = log10(tot_degr_ugrd))) +
  geom_histogram(bins = 50,  
                 fill = "steelblue", colour = "white") +
  labs(title = "Distribution of Degree Seeking Undergraduate Students",
       x = "Students (log)",
       y = "Number of colleges") +
  theme_minimal()

We see that this now takes on a more normal distrubution bell like shape. we will add a column for the log value

Then we look at how the dependent variable is distributed, the graduation rate. Graduation rate is slightly right skewed, but is generally bell shaped

ggplot(full_data, aes(x = tot_grad_rt)) +
  geom_histogram(bins = 35,  
                 fill = "steelblue", colour = "white") +
  labs(title = "Distribution of Graduation Rate",
       x = "Graduation Rate (percentage)",
       y = "Number of colleges") +
  theme_minimal()

We have natrual ceilings and floors as you cannot offer less than 0 % of students aid, and you cannot give more than 100% of your students aid. The shape is roughly bell shaped with a left skew

ggplot(full_data, aes(x = perc_awd_any_aid)) +
  geom_histogram(bins = 35,  
                 fill = "steelblue", colour = "white") +
  labs(title = "Distribution of Share of Students Receiving Any Aid",
       x = "Receives Any Aid (percentage)",
       y = "Number of colleges") +
  theme_minimal()

We have a nearly normal bell shape forming for this distribution of percent of students receiving pell

ggplot(full_data, aes(x = perc_awd_pell)) +
  geom_histogram(bins = 35,  
                 fill = "steelblue", colour = "white") +
  labs(title = "Distribution of Share of Students Receiving Pell",
       x = "Receives Pell (percentage)",
       y = "Number of colleges") +
  theme_minimal()

The percent share of loans has a heavy right skew and a large proportion at 0. This is something that will be worth taking into consideration as we build the model. It may turn out to be a characteristic that should be evaluated seperately and out of the scope of this analysis.

ggplot(full_data, aes(x = perc_awd_loans)) +
  geom_histogram(bins = 35,  
                 fill = "steelblue", colour = "white") +
  labs(title = "Distribution of Share of Students Receiving Loans",
       x = "Receives Loans (percentage)",
       y = "Number of colleges") +
  theme_minimal()

The data are generally right skewed, however we cannot just delete them as they may be indications of the real world and not errors. It is possible that the colleges have high cost of attendance and offer more scholarships to balance out the cost.

full_data |>
  select(unitid = UNITID,
         avg_any_aid  = avg_awd_any_aid,
         avg_pell     = avg_awd_pell,
         avg_loans    = avg_awd_loans) |>
  pivot_longer(
    cols      = starts_with("avg_"),             
    names_to  = "award_type",
    values_to = "amount"
  ) |>
  ggplot(aes(x = award_type,
             y = amount,
             fill = award_type)) +
  geom_boxplot(width = .5, outlier.colour = "red") +
  labs(title = "Distribution of Average Awards per Recipient",
       x = NULL, y = "Amount (USD)") +

  theme_minimal(base_size = 13)

4.2 - Multivariate Analysis

What relationship can be observed between the share of students receiving any aid and graduation rate? We do not see a very strong linear relationship between the proportion of students receiving aid and the graduation rate

ggplot(data = full_data, aes(x = perc_awd_any_aid, y = tot_grad_rt)) +
  geom_jitter() +
  stat_smooth(method = "lm", se = FALSE) +
  labs(title = "Percent Stdnts Awarded vs Graduation Rate",
       x = "Students Awarded (percentage)",
       y = "Graduation Rate") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

What relationship can be observed between the average aid students received and graduation rate? Here we see a much stronger positive trending relationship between the two variables

ggplot(data = full_data, aes(x = avg_awd_any_aid, y = tot_grad_rt)) +
  geom_jitter() +
  stat_smooth(method = "lm", se = FALSE) +
  labs(title = "Avg Aid Awarded vs Graduation Rate",
       x = "Avg Aid Awarded",
       y = "Graduation Rate") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

final_data <- full_data |>
  select(sector,state,tot_grad_rt,tot_degr_ugrd,perc_awd_any_aid,avg_awd_pell,avg_awd_loans,avg_awd_any_aid,perc_awd_pell,perc_awd_loans) |>
  filter(!is.na(avg_awd_loans)) |>
  transmute(
    sector,
    state,
    tot_grad_rt,
    perc_awd_any_aid,
    perc_awd_pell,
    perc_awd_loans,
    avg_aid     = avg_awd_any_aid / 1000,
    avg_pell    = avg_awd_pell / 1000,
    avg_loans   = avg_awd_loans / 1000,
    log_tot_ugrd = log10(tot_degr_ugrd)
  )

head(final_data,5)
##           sector state tot_grad_rt perc_awd_any_aid perc_awd_pell
## 1 Public, 4‑Year    AL        0.29             0.83          0.64
## 2 Public, 4‑Year    AL        0.63             0.75          0.34
## 3 Public, 4‑Year    AL        0.62             0.69          0.23
## 4 Public, 4‑Year    AL        0.30             0.96          0.72
## 5 Public, 4‑Year    AL        0.74             0.68          0.19
##   perc_awd_loans  avg_aid avg_pell avg_loans log_tot_ugrd
## 1           0.59 11.59689  5.77489  6.733796     3.716087
## 2           0.42 10.55476  6.13531  6.788345     4.106395
## 3           0.35  9.81416  4.68933  6.441246     3.844166
## 4           0.78 10.19909  4.90675  8.423590     3.517987
## 5           0.35 15.68529  5.20234  6.233895     4.496376
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
corr_vars <- final_data |>
  select(
    tot_grad_rt,
    log_tot_ugrd,
    perc_awd_any_aid,
    perc_awd_pell,
    perc_awd_loans,
    avg_aid,
    avg_pell,
    avg_loans
  )

GGally::ggpairs(corr_vars)

5 - Regression Modeling

I first want to explore some simple linear regressions with the share of students receiving any aid (% of students) vs grad rate, and the depth of aid (average amount of aid recevied) vs grad rate.

ipeds_lm_1 <- lm(tot_grad_rt ~ perc_awd_any_aid, data = final_data)
summary(ipeds_lm_1)
## 
## Call:
## lm(formula = tot_grad_rt ~ perc_awd_any_aid, data = final_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41814 -0.11948 -0.02443  0.09962  0.60712 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.38206    0.01942  19.670   <2e-16 ***
## perc_awd_any_aid  0.03608    0.02685   1.344    0.179    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.169 on 1533 degrees of freedom
## Multiple R-squared:  0.001177,   Adjusted R-squared:  0.000525 
## F-statistic: 1.806 on 1 and 1533 DF,  p-value: 0.1792

The R-squared value is near 0 This model is generally not a great predictor for graduation rate

ipeds_lm_2 <- lm(tot_grad_rt ~ avg_aid, data = final_data)
summary(ipeds_lm_2)
## 
## Call:
## lm(formula = tot_grad_rt ~ avg_aid, data = final_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53038 -0.09103 -0.00393  0.08624  0.65557 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.191682   0.009076   21.12   <2e-16 ***
## avg_aid     0.031373   0.001211   25.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.141 on 1533 degrees of freedom
## Multiple R-squared:  0.3044, Adjusted R-squared:  0.304 
## F-statistic: 670.9 on 1 and 1533 DF,  p-value: < 2.2e-16

This model is a much better fit than the previous model. The R-squared value went up to .316, where we can estiamte that about 31% of the variance in graduation rate is explained by the average amount of aid. Based on this model we would expect that an increase in $1000 of aid on average would be associated with about 3.1 percentage point gain in graduation rate.

Next I want to move onto exploring some multiple linear regression models.

ipeds_lm_3 <- lm(tot_grad_rt ~ perc_awd_any_aid + avg_aid + sector + state + log_tot_ugrd + perc_awd_pell + perc_awd_loans, data = final_data)
summary(ipeds_lm_3)
## 
## Call:
## lm(formula = tot_grad_rt ~ perc_awd_any_aid + avg_aid + sector + 
##     state + log_tot_ugrd + perc_awd_pell + perc_awd_loans, data = final_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48239 -0.07581 -0.00008  0.06834  0.72678 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.057931   0.066569   0.870 0.384311    
## perc_awd_any_aid      0.184345   0.026177   7.042 2.89e-12 ***
## avg_aid               0.027981   0.001380  20.278  < 2e-16 ***
## sectorPublic, 4‑Year -0.008776   0.009385  -0.935 0.349855    
## stateAL               0.124186   0.063496   1.956 0.050677 .  
## stateAR               0.194557   0.063842   3.047 0.002349 ** 
## stateAZ               0.013722   0.064495   0.213 0.831544    
## stateCA               0.111404   0.060843   1.831 0.067298 .  
## stateCO               0.053847   0.063882   0.843 0.399412    
## stateCT               0.152225   0.070714   2.153 0.031506 *  
## stateDC              -0.013561   0.132699  -0.102 0.918614    
## stateDE               0.041056   0.091064   0.451 0.652164    
## stateFL               0.239504   0.062744   3.817 0.000141 ***
## stateGA               0.127901   0.062618   2.043 0.041272 *  
## stateHI               0.122800   0.070378   1.745 0.081219 .  
## stateIA               0.171571   0.066414   2.583 0.009880 ** 
## stateID               0.111959   0.073173   1.530 0.126214    
## stateIL               0.134741   0.061954   2.175 0.029800 *  
## stateIN               0.117530   0.066909   1.757 0.079201 .  
## stateKS               0.208498   0.063475   3.285 0.001045 ** 
## stateKY               0.199434   0.065356   3.052 0.002317 ** 
## stateLA               0.115795   0.064798   1.787 0.074141 .  
## stateMA               0.084678   0.063977   1.324 0.185852    
## stateMD               0.102557   0.063968   1.603 0.109093    
## stateME               0.101965   0.067808   1.504 0.132864    
## stateMI               0.043068   0.062452   0.690 0.490541    
## stateMN               0.159639   0.062954   2.536 0.011322 *  
## stateMO               0.125657   0.064239   1.956 0.050642 .  
## stateMS               0.180259   0.065491   2.752 0.005988 ** 
## stateMT               0.112001   0.066177   1.692 0.090775 .  
## stateNC               0.204866   0.061619   3.325 0.000907 ***
## stateND               0.122430   0.067584   1.812 0.070263 .  
## stateNE               0.126551   0.067234   1.882 0.059998 .  
## stateNH               0.088761   0.069205   1.283 0.199838    
## stateNJ               0.132840   0.063860   2.080 0.037683 *  
## stateNM               0.016982   0.063767   0.266 0.790038    
## stateNV               0.130980   0.074605   1.756 0.079354 .  
## stateNY               0.146417   0.061699   2.373 0.017767 *  
## stateOH               0.068403   0.061684   1.109 0.267648    
## stateOK               0.034877   0.064166   0.544 0.586838    
## stateOR               0.052379   0.064426   0.813 0.416341    
## statePA               0.015858   0.062338   0.254 0.799235    
## stateRI               0.107299   0.091244   1.176 0.239798    
## stateSC               0.058581   0.063872   0.917 0.359204    
## stateSD               0.142788   0.068280   2.091 0.036679 *  
## stateTN               0.036092   0.065222   0.553 0.580095    
## stateTX               0.105593   0.061284   1.723 0.085098 .  
## stateUT               0.109848   0.073267   1.499 0.134016    
## stateVA               0.167927   0.062993   2.666 0.007764 ** 
## stateVT               0.008754   0.091385   0.096 0.923695    
## stateWA               0.060030   0.062170   0.966 0.334415    
## stateWI               0.225549   0.063694   3.541 0.000411 ***
## stateWV               0.075837   0.065608   1.156 0.247907    
## stateWY               0.104351   0.072827   1.433 0.152109    
## log_tot_ugrd          0.030241   0.007919   3.819 0.000140 ***
## perc_awd_pell        -0.507952   0.031498 -16.127  < 2e-16 ***
## perc_awd_loans        0.055341   0.023194   2.386 0.017159 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1185 on 1478 degrees of freedom
## Multiple R-squared:  0.5263, Adjusted R-squared:  0.5084 
## F-statistic: 29.33 on 56 and 1478 DF,  p-value: < 2.2e-16

We have a statistically significant P-Value, and increased the R-Squared value to .557, the model explains about 55.7% of the variance in graduation rate.

After controlling for sector, state, school size, and other aid metrics, every $1 000 increase in the average aid package predicts ~2.8 percentage-points higher graduation rates, while a larger share of Pell recipients predicts lower rates (–5.4 pp per 10 pp), and boosting the overall share of students who receive any aid still confers a modest positive benefit (~1.9 pp per 10 pp).

I will remove a variable, starting with the percent of students that receive loans.

ipeds_lm_4 <- lm(tot_grad_rt ~ perc_awd_any_aid + avg_aid + sector + state + log_tot_ugrd + perc_awd_pell , data = final_data)
summary(ipeds_lm_4)
## 
## Call:
## lm(formula = tot_grad_rt ~ perc_awd_any_aid + avg_aid + sector + 
##     state + log_tot_ugrd + perc_awd_pell, data = final_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47379 -0.07679  0.00059  0.06781  0.73679 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.0519299  0.0666268   0.779 0.435860    
## perc_awd_any_aid      0.1881399  0.0261703   7.189 1.03e-12 ***
## avg_aid               0.0286971  0.0013490  21.273  < 2e-16 ***
## sectorPublic, 4‑Year -0.0003344  0.0087059  -0.038 0.969368    
## stateAL               0.1352966  0.0634256   2.133 0.033076 *  
## stateAR               0.2057312  0.0637712   3.226 0.001282 ** 
## stateAZ               0.0164678  0.0645869   0.255 0.798781    
## stateCA               0.1113685  0.0609390   1.828 0.067820 .  
## stateCO               0.0593746  0.0639413   0.929 0.353258    
## stateCT               0.1629975  0.0706819   2.306 0.021244 *  
## stateDC              -0.0086239  0.1328934  -0.065 0.948267    
## stateDE               0.0508534  0.0911154   0.558 0.576847    
## stateFL               0.2408940  0.0628409   3.833 0.000132 ***
## stateGA               0.1347728  0.0626507   2.151 0.031624 *  
## stateHI               0.1258651  0.0704782   1.786 0.074325 .  
## stateIA               0.1903419  0.0660511   2.882 0.004012 ** 
## stateID               0.1236531  0.0731244   1.691 0.091048 .  
## stateIL               0.1393707  0.0620221   2.247 0.024780 *  
## stateIN               0.1281417  0.0668672   1.916 0.055512 .  
## stateKS               0.2226469  0.0632974   3.517 0.000449 ***
## stateKY               0.2101991  0.0653031   3.219 0.001315 ** 
## stateLA               0.1342787  0.0644353   2.084 0.037338 *  
## stateMA               0.0996413  0.0637702   1.563 0.118383    
## stateMD               0.1127101  0.0639276   1.763 0.078092 .  
## stateME               0.1141748  0.0677217   1.686 0.092018 .  
## stateMI               0.0524525  0.0624269   0.840 0.400920    
## stateMN               0.1753116  0.0627100   2.796 0.005247 ** 
## stateMO               0.1351954  0.0642159   2.105 0.035431 *  
## stateMS               0.1911386  0.0654353   2.921 0.003542 ** 
## stateMT               0.1216786  0.0661578   1.839 0.066083 .  
## stateNC               0.2084749  0.0616981   3.379 0.000747 ***
## stateND               0.1353244  0.0674743   2.006 0.045084 *  
## stateNE               0.1378933  0.0671716   2.053 0.040263 *  
## stateNH               0.1088452  0.0688002   1.582 0.113852    
## stateNJ               0.1403861  0.0638830   2.198 0.028136 *  
## stateNM               0.0176477  0.0638674   0.276 0.782342    
## stateNV               0.1306948  0.0747228   1.749 0.080488 .  
## stateNY               0.1573859  0.0616248   2.554 0.010751 *  
## stateOH               0.0820049  0.0615177   1.333 0.182727    
## stateOK               0.0476745  0.0640428   0.744 0.456743    
## stateOR               0.0634605  0.0643603   0.986 0.324285    
## statePA               0.0357847  0.0618740   0.578 0.563117    
## stateRI               0.1193314  0.0912485   1.308 0.191157    
## stateSC               0.0693567  0.0638129   1.087 0.277269    
## stateSD               0.1593410  0.0680339   2.342 0.019309 *  
## stateTN               0.0406440  0.0652974   0.622 0.533746    
## stateTX               0.1124338  0.0613142   1.834 0.066895 .  
## stateUT               0.1111228  0.0733815   1.514 0.130159    
## stateVA               0.1732132  0.0630542   2.747 0.006086 ** 
## stateVT               0.0177906  0.0914515   0.195 0.845783    
## stateWA               0.0573604  0.0622589   0.921 0.357033    
## stateWI               0.2400187  0.0635054   3.780 0.000163 ***
## stateWV               0.0887331  0.0654890   1.355 0.175646    
## stateWY               0.1096617  0.0729085   1.504 0.132769    
## log_tot_ugrd          0.0303936  0.0079315   3.832 0.000132 ***
## perc_awd_pell        -0.5085752  0.0315466 -16.121  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1187 on 1479 degrees of freedom
## Multiple R-squared:  0.5245, Adjusted R-squared:  0.5068 
## F-statistic: 29.66 on 55 and 1479 DF,  p-value: < 2.2e-16

We do not see a major change in the R-Squared value, and still have a statistically significant model, while trimming down on the features included.

Next I will trim down on the sector

ipeds_lm_5 <- lm(tot_grad_rt ~ perc_awd_any_aid + avg_aid + state + log_tot_ugrd + perc_awd_pell , data = final_data)
summary(ipeds_lm_5)
## 
## Call:
## lm(formula = tot_grad_rt ~ perc_awd_any_aid + avg_aid + state + 
##     log_tot_ugrd + perc_awd_pell, data = final_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47400 -0.07677  0.00063  0.06777  0.73686 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.052031   0.066552   0.782 0.434447    
## perc_awd_any_aid  0.187977   0.025814   7.282 5.33e-13 ***
## avg_aid           0.028670   0.001146  25.011  < 2e-16 ***
## stateAL           0.135532   0.063106   2.148 0.031901 *  
## stateAR           0.205947   0.063503   3.243 0.001209 ** 
## stateAZ           0.016651   0.064389   0.259 0.795982    
## stateCA           0.111631   0.060534   1.844 0.065368 .  
## stateCO           0.059517   0.063811   0.933 0.351122    
## stateCT           0.163150   0.070547   2.313 0.020880 *  
## stateDC          -0.008603   0.132847  -0.065 0.948375    
## stateDE           0.050995   0.091010   0.560 0.575343    
## stateFL           0.240917   0.062817   3.835 0.000131 ***
## stateGA           0.134944   0.062471   2.160 0.030925 *  
## stateHI           0.126015   0.070346   1.791 0.073439 .  
## stateIA           0.190630   0.065601   2.906 0.003716 ** 
## stateID           0.123812   0.072982   1.696 0.090007 .  
## stateIL           0.139650   0.061573   2.268 0.023471 *  
## stateIN           0.128248   0.066788   1.920 0.055020 .  
## stateKS           0.222884   0.062975   3.539 0.000414 ***
## stateKY           0.210467   0.064909   3.243 0.001211 ** 
## stateLA           0.134472   0.064217   2.094 0.036427 *  
## stateMA           0.099835   0.063548   1.571 0.116392    
## stateMD           0.112915   0.063683   1.773 0.076420 .  
## stateME           0.114353   0.067540   1.693 0.090647 .  
## stateMI           0.052669   0.062150   0.847 0.396881    
## stateMN           0.175522   0.062449   2.811 0.005009 ** 
## stateMO           0.135398   0.063977   2.116 0.034482 *  
## stateMS           0.191378   0.065116   2.939 0.003344 ** 
## stateMT           0.121796   0.066065   1.844 0.065447 .  
## stateNC           0.208732   0.061314   3.404 0.000681 ***
## stateND           0.135434   0.067392   2.010 0.044650 *  
## stateNE           0.138080   0.066973   2.062 0.039409 *  
## stateNH           0.109063   0.068544   1.591 0.111792    
## stateNJ           0.140648   0.063496   2.215 0.026908 *  
## stateNM           0.017880   0.063558   0.281 0.778505    
## stateNV           0.130693   0.074698   1.750 0.080390 .  
## stateNY           0.157572   0.061414   2.566 0.010394 *  
## stateOH           0.082065   0.061477   1.335 0.182114    
## stateOK           0.047862   0.063836   0.750 0.453518    
## stateOR           0.063677   0.064092   0.994 0.320615    
## statePA           0.035952   0.061700   0.583 0.560194    
## stateRI           0.119539   0.091057   1.313 0.189454    
## stateSC           0.069607   0.063458   1.097 0.272866    
## stateSD           0.159444   0.067958   2.346 0.019097 *  
## stateTN           0.040909   0.064910   0.630 0.528632    
## stateTX           0.112585   0.061167   1.841 0.065877 .  
## stateUT           0.111250   0.073282   1.518 0.129204    
## stateVA           0.173477   0.062659   2.769 0.005700 ** 
## stateVT           0.018108   0.091046   0.199 0.842373    
## stateWA           0.057436   0.062207   0.923 0.355997    
## stateWI           0.240164   0.063371   3.790 0.000157 ***
## stateWV           0.088898   0.065326   1.361 0.173774    
## stateWY           0.109808   0.072785   1.509 0.131598    
## log_tot_ugrd      0.030328   0.007745   3.916 9.42e-05 ***
## perc_awd_pell    -0.508393   0.031176 -16.307  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1187 on 1480 degrees of freedom
## Multiple R-squared:  0.5245, Adjusted R-squared:  0.5072 
## F-statistic: 30.23 on 54 and 1480 DF,  p-value: < 2.2e-16
car::vif(ipeds_lm_5)
##                      GVIF Df GVIF^(1/(2*Df))
## perc_awd_any_aid 1.874214  1        1.369019
## avg_aid          1.265029  1        1.124735
## state            2.478467 50        1.009118
## log_tot_ugrd     1.705701  1        1.306025
## perc_awd_pell    2.010798  1        1.418026

What happens if the state is removed? Can we compare the with and without models using an anova?

ipeds_lm_6 <- lm(
  tot_grad_rt ~ perc_awd_any_aid + avg_aid +
                log_tot_ugrd + perc_awd_pell,
  data = final_data
)
summary(ipeds_lm_6)
## 
## Call:
## lm(formula = tot_grad_rt ~ perc_awd_any_aid + avg_aid + log_tot_ugrd + 
##     perc_awd_pell, data = final_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48951 -0.08609 -0.00387  0.07658  0.65970 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.116669   0.033422   3.491 0.000495 ***
## perc_awd_any_aid  0.170607   0.025581   6.669 3.58e-11 ***
## avg_aid           0.027183   0.001184  22.968  < 2e-16 ***
## log_tot_ugrd      0.041990   0.007228   5.810 7.60e-09 ***
## perc_awd_pell    -0.405228   0.030081 -13.471  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1296 on 1530 degrees of freedom
## Multiple R-squared:  0.4132, Adjusted R-squared:  0.4117 
## F-statistic: 269.4 on 4 and 1530 DF,  p-value: < 2.2e-16
car::vif(ipeds_lm_6)
## perc_awd_any_aid          avg_aid     log_tot_ugrd    perc_awd_pell 
##         1.541813         1.129703         1.244408         1.568178
anova(ipeds_lm_6, ipeds_lm_5)
## Analysis of Variance Table
## 
## Model 1: tot_grad_rt ~ perc_awd_any_aid + avg_aid + log_tot_ugrd + perc_awd_pell
## Model 2: tot_grad_rt ~ perc_awd_any_aid + avg_aid + state + log_tot_ugrd + 
##     perc_awd_pell
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1530 25.716                                  
## 2   1480 20.839 50     4.877 6.9273 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the Anova results, the state variable captures significant unique signal, so it would be better to keep the state included.

6 - Model Diagnostics

Below we plot the residuals of the model in a historgram and observe that there are nearly normal residuals. This assumption is satisfied.

ggplot(data = ipeds_lm_5, aes(x = .resid)) +
  geom_histogram(bins = 25) +
  xlab("Residuals")

Most of our residuals below fall on a line, and are nearly normal, however we do see for some values on both extremes a tail that trails off. The model may not be as reliable when handling those extreme values.

ggplot(data = ipeds_lm_5, aes(sample = .resid)) +
  stat_qq()

Here we do not see any major patterns but there are observed outliers again and a denser cloud around .3 - .5. The assumption is mildly violated but generally still seems to hold. It may be worth looking into this further in future analysis.

ggplot(data = ipeds_lm_5, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

7 - Conclusions

The final model chosen uses the proportion of students awarded any aid, the average aid amount by student, the percent of students that receive Pell, the state of the college, and the total number of students as predictors for the colleges total graduation rate.

The model produced an R-Squared value of .555 with a signiifcant p-value. About 55.5% of the variance is explained.

For every $1000 increase in the average aid awarded, it is expected that the graduation rate would rise 2.9 percentage points.

For every 10 percentage point increase in proportion of students that recevive aid attributes to 2 percentage point gain in graduation rate.

Interestingly enough we observe a drop of 5.4 percentage points when the percent of pell recipients increases 10 percentage points

Overall, the breadth of aid (proportion of students awarded) along with the depth (average aid package amount) are strongly associated with student success(graduation rate) when factoring in State effects, proportion of pell recipients, and student count.

8 - Further Considerations

It is important to note that there can be underlying factors which are influencing the results, such as net price, types of aid offered (Predominantly private scholarships, state aid, institutional scholarships? etc), admissions criteria and more that may better explain fluctuations in graduation rate. Further considerations may also be given to the data itself, are the outliers a result of error or do they truly represent the real world? What types of events may have influenced the data, for example would the years reported have accounted for any aid provided to aid with pandemic related issues, are the admissions or enrollment data influenced by post pandemic effects?