Given the current uncertainty centered around the future of U.S Higher Education Financial Aid programs, it is appropriate to explore the relationship between the coverage of financial aid(percentage of students who receive aid) and the depth of financial aid (the amount of aid provided). One of the key metrics utilized for ascertaining an institutions “success” is graduation rate. Financial aid programs are generally offered with a purpose of bridging the gaps in access to higher education for students of different income levels.
The data is sample of 1392 undergraduate degree granting public institutions located within the United States of America extracted from the 2022-2023 IPEDS survey. The observational study attempts to quantify the relationship between the precentage of students that receive financial aid and the average aid package awarded with an institutions graduation rate while accounting for variables including Pell Grant share, state, and institutions headcount.
The result of the multiple linear regression model chosen explains about 55.5% of the variance observed in graduation rate. R^2 = 55.5%, adjusted R^2 53.7% and a standard error = 11.5%. Given the variables included, a $1000 increase in the average grant award is associated with a 2.9 percentage‑point rise in the graduation rate. Expanding grant coverage by 10 percentage‑points predicts an additional 2 percentage‑point gain. A negative association is observed with Pell share and graduation rate. The model shows that both the percentage of students as well as the average aid package together while taking into consideration the rest of the variables, have a positive association with graduation rate.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Importing the financial aid data
ipeds_fa2223 <- read.csv('https://raw.githubusercontent.com/william-forero/Data-607/refs/heads/main/sfa2223.csv')
How many rows and columns are included in the FA dataset?
dim(ipeds_fa2223)
## [1] 5653 691
Takeaway: There are 5,563 rows and 691 columns included in the 22-23 IPEDS FA dataset
What does the data look like?
head(ipeds_fa2223,5)
## UNITID XSCUGRAD SCUGRAD XSCUGDGSK SCUGDGSK XSCUGNDGS SCUGNDGS XSCUGFFN
## 1 100654 R 5206 R 5201 R 5 R
## 2 100663 R 13032 R 12776 R 256 R
## 3 100690 R 228 R 228 R 0 R
## 4 100706 R 7169 R 6985 R 184 R
## 5 100724 R 3296 R 3296 R 0 R
## SCUGFFN XSCUGFFP SCUGFFP XSCFA2 SCFA2 XSCFA2DG SCFA2DG XSCFA2ND SCFA2ND
## 1 1547 R 30 R 5206 R 5201 R 5
## 2 2172 R 17 R 13032 R 12776 R 256
## 3 0 R 0 R 228 R 228 R 0
## 4 1154 R 16 R 7169 R 6985 R 184
## 5 938 R 28 R 3296 R 3296 R 0
## XSCFA1N SCFA1N XSCFA1P SCFA1P XSCFA11N SCFA11N XSCFA11P SCFA11P XSCFA12N
## 1 R 1547 R 30 R 0 R 0 R
## 2 R 2172 R 17 R 0 R 0 R
## 3 R 0 R 0 A NA A NA A
## 4 R 1154 R 16 R 0 R 0 R
## 5 R 938 R 28 R 0 R 0 R
## SCFA12N XSCFA12P SCFA12P XSCFA13N SCFA13N XSCFA13P SCFA13P XSCFA14N SCFA14N
## 1 810 R 52 R 737 R 48 R 0
## 2 1881 R 87 R 291 R 13 R 0
## 3 NA A NA A NA A NA A NA
## 4 969 R 84 R 185 R 16 R 0
## 5 382 R 41 R 556 R 59 R 0
## XSCFA14P SCFA14P XSCFY2 SCFY2 XSCFY2DG SCFY2DG XSCFY2ND SCFY2ND XSCFY1N
## 1 R 0 A NA A NA A NA A
## 2 R 0 A NA A NA A NA A
## 3 A NA A NA A NA A NA A
## 4 R 0 A NA A NA A NA A
## 5 R 0 A NA A NA A NA A
## SCFY1N XSCFY1P SCFY1P XUAGRNTN UAGRNTN XUAGRNTP UAGRNTP XUAGRNTT UAGRNTT
## 1 NA A NA R 4326 R 83 R 50168163
## 2 NA A NA R 9696 R 74 R 101942282
## 3 NA A NA R 216 R 95 R 473411
## 4 NA A NA R 4940 R 69 R 47598105
## 5 NA A NA R 3159 R 96 R 32218913
## XUAGRNTA UAGRNTA XUDGAGRNTN UDGAGRNTN XUDGAGRNTP UDGAGRNTP XUDGAGRNTT
## 1 R 11597 R 4326 R 83 R
## 2 R 10514 R 9645 R 75 R
## 3 R 2192 R 216 R 95 R
## 4 R 9635 R 4790 R 69 R
## 5 R 10199 R 3159 R 96 R
## UDGAGRNTT XUDGAGRNTA UDGAGRNTA XUNDAGRNTN UNDAGRNTN XUNDAGRNTP UNDAGRNTP
## 1 50168163 R 11597 R 0 R 0
## 2 101800616 R 10555 R 51 R 20
## 3 473411 R 2192 R 0 Z 0
## 4 47009835 R 9814 R 150 R 82
## 5 32218913 R 10199 R 0 Z 0
## XUNDAGRNTT UNDAGRNTT XUNDAGRNTA UNDAGRNTA XUPGRNTN UPGRNTN XUPGRNTP UPGRNTP
## 1 R 0 A NA R 3353 R 64
## 2 R 141666 R 2778 R 4324 R 33
## 3 R 0 A NA R 156 R 68
## 4 R 588270 R 3922 R 1613 R 22
## 5 R 0 A NA R 2374 R 72
## XUPGRNTT UPGRNTT XUPGRNTA UPGRNTA XUDGPGRNTN UDGPGRNTN XUDGPGRNTP UDGPGRNTP
## 1 R 19363202 R 5775 R 3353 R 64
## 2 R 26529097 R 6135 R 4324 R 34
## 3 R 329636 R 2113 R 156 R 68
## 4 R 7561775 R 4688 R 1612 R 23
## 5 R 11648634 R 4907 R 2374 R 72
## XUDGPGRNTT UDGPGRNTT XUDGPGRNTA UDGPGRNTA XUNDPGRNTN UNDPGRNTN XUNDPGRNTP
## 1 R 19363202 R 5775 R 0 R
## 2 R 26529097 R 6135 R 0 R
## 3 R 329636 R 2113 R 0 Z
## 4 R 7559208 R 4689 R 1 R
## 5 R 11648634 R 4907 R 0 Z
## UNDPGRNTP XUNDPGRNTT UNDPGRNTT XUNDPGRNTA UNDPGRNTA XUFLOANN UFLOANN XUFLOANP
## 1 0 R 0 A NA R 3050 R
## 2 0 R 0 A NA R 5339 R
## 3 0 R 0 A NA R 165 R
## 4 1 R 2567 R 2567 R 2431 R
## 5 0 R 0 A NA R 2563 R
## UFLOANP XUFLOANT UFLOANT XUFLOANA UFLOANA XUDGFLOANN UDGFLOANN XUDGFLOANP
## 1 59 R 20538077 R 6734 R 3050 R
## 2 41 R 36242972 R 6788 R 5339 R
## 3 72 R 1317500 R 7985 R 165 R
## 4 34 R 15656787 R 6440 R 2429 R
## 5 78 R 21589661 R 8424 R 2563 R
## UDGFLOANP XUDGFLOANT UDGFLOANT XUDGFLOANA UDGFLOANA XUNDFLOANN UNDFLOANN
## 1 59 R 20538077 R 6734 R 0
## 2 42 R 36242972 R 6788 R 0
## 3 72 R 1317500 R 7985 R 0
## 4 35 R 15645787 R 6441 R 2
## 5 78 R 21589661 R 8424 R 0
## XUNDFLOANP UNDFLOANP XUNDFLOANT UNDFLOANT XUNDFLOANA UNDFLOANA XANYAIDN
## 1 R 0 R 0 A NA R
## 2 R 0 R 0 A NA R
## 3 Z 0 R 0 A NA A
## 4 R 1 R 11000 R 5500 R
## 5 Z 0 R 0 A NA R
## ANYAIDN XANYAIDP ANYAIDP XAIDFSIN AIDFSIN XAIDFSIP AIDFSIP XAGRNT_N AGRNT_N
## 1 1359 R 88 R 1359 R 88 R 1323
## 2 2107 R 97 R 2101 R 97 R 2059
## 3 NA A NA A NA A NA A NA
## 4 1121 R 97 R 1116 R 97 R 1101
## 5 862 R 92 R 862 R 92 R 827
## XAGRNT_P AGRNT_P XAGRNT_T AGRNT_T XAGRNT_A AGRNT_A XFGRNT_N FGRNT_N XFGRNT_P
## 1 R 86 R 15362449 R 11612 R 1033 R
## 2 R 95 R 23333975 R 11333 R 891 R
## 3 A NA A NA A NA A NA A
## 4 R 95 R 9619087 R 8737 R 279 R
## 5 R 88 R 9073315 R 10971 R 685 R
## FGRNT_P XFGRNT_T FGRNT_T XFGRNT_A FGRNT_A XPGRNT_N PGRNT_N XPGRNT_P PGRNT_P
## 1 67 R 6366369 R 6163 R 1033 R 67
## 2 41 R 5581018 R 6264 R 891 R 41
## 3 NA A NA A NA A NA A NA
## 4 24 R 1490308 R 5342 R 279 R 24
## 5 73 R 4041503 R 5900 R 685 R 73
## XPGRNT_T PGRNT_T XPGRNT_A PGRNT_A XOFGRT_N OFGRT_N XOFGRT_P OFGRT_P XOFGRT_T
## 1 R 6185419 R 5988 R 91 R 6 R
## 2 R 5335268 R 5988 R 177 R 8 R
## 3 A NA A NA A NA A NA A
## 4 R 1362862 R 4885 R 68 R 6 R
## 5 R 3782171 R 5521 R 265 R 28 R
## OFGRT_T XOFGRT_A OFGRT_A XSGRNT_N SGRNT_N XSGRNT_P SGRNT_P XSGRNT_T SGRNT_T
## 1 180950 R 1988 R 29 R 2 R 50810
## 2 245750 R 1388 R 95 R 4 R 78369
## 3 NA A NA A NA A NA A NA
## 4 127446 R 1874 R 13 R 1 R 46582
## 5 259332 R 979 R 143 R 15 R 212371
## XSGRNT_A SGRNT_A XIGRNT_N IGRNT_N XIGRNT_P IGRNT_P XIGRNT_T IGRNT_T XIGRNT_A
## 1 R 1752 R 1024 R 66 R 8945270 R
## 2 R 825 R 1915 R 88 R 17674588 R
## 3 A NA A NA A NA A NA A
## 4 R 3583 R 1058 R 92 R 8082197 R
## 5 R 1485 R 556 R 59 R 4819441 R
## IGRNT_A XLOAN_N LOAN_N XLOAN_P LOAN_P XLOAN_T LOAN_T XLOAN_A LOAN_A XFLOAN_N
## 1 8736 R 887 R 57 R 5664417 R 6386 R
## 2 9230 R 1106 R 51 R 7271857 R 6575 R
## 3 NA A NA A NA A NA A NA A
## 4 7639 R 455 R 39 R 3533785 R 7767 R
## 5 8668 R 640 R 68 R 4088888 R 6389 R
## FLOAN_N XFLOAN_P FLOAN_P XFLOAN_T FLOAN_T XFLOAN_A FLOAN_A XOLOAN_N OLOAN_N
## 1 880 R 57 R 5322609 R 6048 R 26
## 2 1087 R 50 R 5565993 R 5121 R 124
## 3 NA A NA A NA A NA A NA
## 4 432 R 37 R 2198883 R 5090 R 97
## 5 635 R 68 R 3666952 R 5775 R 40
## XOLOAN_P OLOAN_P XOLOAN_T OLOAN_T XOLOAN_A OLOAN_A XGISTN2 GISTN2 XGISTON2
## 1 R 2 R 341808 R 13146 R 712 R
## 2 R 6 R 1705864 R 13757 R 1779 R
## 3 A NA A NA A NA A NA A
## 4 R 8 R 1334902 R 13762 R 920 R
## 5 R 4 R 421936 R 10548 R 296 R
## GISTON2 XGISTWF2 GISTWF2 XGISTOF2 GISTOF2 XGISTUN2 GISTUN2 XGISTT2 GISTT2
## 1 619 R 70 R 22 R 1 R 6598275
## 2 1522 R 145 R 70 R 42 R 18643927
## 3 NA A NA A NA A NA A NA
## 4 552 R 209 R 49 R 110 R 6588101
## 5 224 R 24 R 25 R 23 R 2516716
## XGISTA2 GISTA2 XGISTN1 GISTN1 XGISTON1 GISTON1 XGISTWF1 GISTWF1 XGISTOF1
## 1 R 9267 R 612 R 471 R 78 R
## 2 R 10480 R 1800 R 1535 R 183 R
## 3 A NA A NA A NA A NA A
## 4 R 7161 R 919 R 559 R 184 R
## 5 R 8502 R 206 R 168 R 20 R
## GISTOF1 XGISTUN1 GISTUN1 XGISTT1 GISTT1 XGISTA1 GISTA1 XGISTN0 GISTN0
## 1 63 R 0 R 5180341 R 8465 R 693
## 2 54 R 28 R 17145431 R 9525 R 1585
## 3 NA A NA A NA A NA A NA
## 4 44 R 132 R 6987147 R 7603 R 902
## 5 14 R 4 R 1795518 R 8716 R 430
## XGISTON0 GISTON0 XGISTWF0 GISTWF0 XGISTOF0 GISTOF0 XGISTUN0 GISTUN0 XGISTT0
## 1 R 482 R 158 R 53 R 0 R
## 2 R 1325 R 142 R 56 R 62 R
## 3 A NA A NA A NA A NA A
## 4 R 408 R 226 R 52 R 216 R
## 5 R 367 R 26 R 29 R 8 R
## GISTT0 XGISTA0 GISTA0 XGIS4N2 GIS4N2 XGIS4ON2 GIS4ON2 XGIS4WF2 GIS4WF2
## 1 6240985 R 9006 R 645 R 592 R 34
## 2 15046376 R 9493 R 1260 R 1095 R 112
## 3 NA A NA A NA A NA A NA
## 4 6724413 R 7455 R 464 R 298 R 111
## 5 3895022 R 9058 R 291 R 243 R 23
## XGIS4OF2 GIS4OF2 XGIS4UN2 GIS4UN2 XGIS4G2 GIS4G2 XGIS4T2 GIS4T2 XGIS4A2
## 1 R 19 R 0 R 636 R 5928907 R
## 2 R 53 R 0 R 1217 R 12724450 R
## 3 A NA A NA A NA A NA A
## 4 R 35 R 20 R 452 R 3349380 R
## 5 R 25 R 0 R 283 R 2368392 R
## GIS4A2 XGIS4N12 GIS4N12 XGIS4G12 GIS4G12 XGIS4T12 GIS4T12 XGIS4A12 GIS4A12
## 1 9192 R 328 R 328 R 3271951 R 9975
## 2 10099 R 410 R 410 R 5333964 R 13010
## 3 NA A NA A NA A NA A NA
## 4 7218 R 93 R 93 R 983620 R 10577
## 5 8139 R 174 R 173 R 1481759 R 8516
## XGIS4N22 GIS4N22 XGIS4G22 GIS4G22 XGIS4T22 GIS4T22 XGIS4A22 GIS4A22 XGIS4N32
## 1 R 176 R 176 R 1729702 R 9828 R
## 2 R 210 R 209 R 2509572 R 11950 R
## 3 A NA A NA A NA A NA A
## 4 R 84 R 84 R 769664 R 9163 R
## 5 R 62 R 62 R 494836 R 7981 R
## GIS4N32 XGIS4G32 GIS4G32 XGIS4T32 GIS4T32 XGIS4A32 GIS4A32 XGIS4N42 GIS4N42
## 1 94 R 91 R 684364 R 7280 R 21
## 2 216 R 214 R 2055503 R 9516 R 168
## 3 NA A NA A NA A NA A NA
## 4 79 R 77 R 466937 R 5911 R 76
## 5 33 R 29 R 226797 R 6873 R 15
## XGIS4G42 GIS4G42 XGIS4T42 GIS4T42 XGIS4A42 GIS4A42 XGIS4N52 GIS4N52 XGIS4G52
## 1 R 18 R 98566 R 4694 R 26 R
## 2 R 149 R 1161418 R 6913 R 256 R
## 3 A NA A NA A NA A NA A
## 4 R 73 R 447853 R 5893 R 132 R
## 5 R 12 R 93848 R 6257 R 7 R
## GIS4G52 XGIS4T52 GIS4T52 XGIS4A52 GIS4A52 XGIS4N1 GIS4N1 XGIS4ON1 GIS4ON1
## 1 23 R 144324 R 5551 R 567 R 439
## 2 235 R 1663993 R 6500 R 1264 R 1074
## 3 NA A NA A NA A NA A NA
## 4 125 R 681306 R 5161 R 485 R 265
## 5 7 R 71152 R 10165 R 210 R 176
## XGIS4WF1 GIS4WF1 XGIS4OF1 GIS4OF1 XGIS4UN1 GIS4UN1 XGIS4G1 GIS4G1 XGIS4T1
## 1 R 66 R 62 R 0 R 549 R
## 2 R 141 R 49 R 0 R 1200 R
## 3 A NA A NA A NA A NA A
## 4 R 113 R 37 R 70 R 451 R
## 5 R 20 R 14 R 0 R 204 R
## GIS4T1 XGIS4A1 GIS4A1 XGIS4N11 GIS4N11 XGIS4G11 GIS4G11 XGIS4T11 GIS4T11
## 1 4640730 R 8185 R 311 R 307 R 2787106
## 2 12010869 R 9502 R 426 R 426 R 4957183
## 3 NA A NA A NA A NA A NA
## 4 3655336 R 7537 R 98 R 98 R 1163769
## 5 1758424 R 8373 R 115 R 115 R 1015884
## XGIS4A11 GIS4A11 XGIS4N21 GIS4N21 XGIS4G21 GIS4G21 XGIS4T21 GIS4T21 XGIS4A21
## 1 R 8962 R 143 R 142 R 1235673 R
## 2 R 11637 R 222 R 222 R 2470169 R
## 3 A NA A NA A NA A NA A
## 4 R 11875 R 66 R 66 R 664885 R
## 5 R 8834 R 46 R 46 R 375439 R
## GIS4A21 XGIS4N31 GIS4N31 XGIS4G31 GIS4G31 XGIS4T31 GIS4T31 XGIS4A31 GIS4A31
## 1 8641 R 57 R 53 R 378171 R 6635
## 2 11127 R 217 R 210 R 1926464 R 8878
## 3 NA A NA A NA A NA A NA
## 4 10074 R 94 R 86 R 655627 R 6975
## 5 8162 R 21 R 21 R 176298 R 8395
## XGIS4N41 GIS4N41 XGIS4G41 GIS4G41 XGIS4T41 GIS4T41 XGIS4A41 GIS4A41 XGIS4N51
## 1 R 29 R 22 R 87821 R 3028 R
## 2 R 145 R 124 R 900763 R 6212 R
## 3 A NA A NA A NA A NA A
## 4 R 83 R 73 R 406468 R 4897 R
## 5 R 12 R 11 R 89836 R 7486 R
## GIS4N51 XGIS4G51 GIS4G51 XGIS4T51 GIS4T51 XGIS4A51 GIS4A51 XGIS4N0 GIS4N0
## 1 27 R 25 R 151959 R 5628 R 657
## 2 254 R 218 R 1756290 R 6915 R 1053
## 3 NA A NA A NA A NA A NA
## 4 144 R 128 R 764587 R 5310 R 515
## 5 16 R 11 R 100967 R 6310 R 418
## XGIS4ON0 GIS4ON0 XGIS4WF0 GIS4WF0 XGIS4OF0 GIS4OF0 XGIS4UN0 GIS4UN0 XGIS4G0
## 1 R 457 R 150 R 50 R 0 R
## 2 R 883 R 118 R 50 R 2 R
## 3 A NA A NA A NA A NA A
## 4 R 264 R 140 R 35 R 76 R
## 5 R 362 R 25 R 28 R 3 R
## GIS4G0 XGIS4T0 GIS4T0 XGIS4A0 GIS4A0 XGIS4N10 GIS4N10 XGIS4G10 GIS4G10
## 1 642 R 5825299 R 8867 R 381 R 380
## 2 992 R 10175657 R 9663 R 322 R 321
## 3 NA A NA A NA A NA A NA
## 4 492 R 3923644 R 7619 R 141 R 141
## 5 389 R 3488167 R 8345 R 284 R 284
## XGIS4T10 GIS4T10 XGIS4A10 GIS4A10 XGIS4N20 GIS4N20 XGIS4G20 GIS4G20 XGIS4T20
## 1 R 3713735 R 9747 R 141 R 141 R
## 2 R 3988192 R 12386 R 195 R 195 R
## 3 A NA A NA A NA A NA A
## 4 R 1440379 R 10215 R 72 R 71 R
## 5 R 2573442 R 9061 R 87 R 81 R
## GIS4T20 XGIS4A20 GIS4A20 XGIS4N30 GIS4N30 XGIS4G30 GIS4G30 XGIS4T30 GIS4T30
## 1 1357002 R 9624 R 76 R 73 R 487604
## 2 2121276 R 10878 R 161 R 155 R 1513529
## 3 NA A NA A NA A NA A NA
## 4 736070 R 10223 R 99 R 94 R 623404
## 5 675785 R 7768 R 28 R 23 R 222495
## XGIS4A30 GIS4A30 XGIS4N40 GIS4N40 XGIS4G40 GIS4G40 XGIS4T40 GIS4T40 XGIS4A40
## 1 R 6416 R 40 R 34 R 208688 R
## 2 R 9401 R 126 R 112 R 1010300 R
## 3 A NA A NA A NA A NA A
## 4 R 6297 R 84 R 77 R 475706 R
## 5 R 7946 R 12 R 1 R 16445 R
## GIS4A40 XGIS4N50 GIS4N50 XGIS4G50 GIS4G50 XGIS4T50 GIS4T50 XGIS4A50 GIS4A50
## 1 5217 R 19 R 14 R 58270 R 3067
## 2 8018 R 249 R 209 R 1542360 R 6194
## 3 NA A NA A NA A NA A NA
## 4 5663 R 119 R 109 R 648085 R 5446
## 5 1370 R 7 R 0 R 0 R 0
## XNPIST2 NPIST2 XNPIST1 NPIST1 XNPIST0 NPIST0 XNPIS412 NPIS412 XNPIS422
## 1 R 14064 R 14600 R 12921 R 13776 R
## 2 R 17413 R 16834 R 16990 R 14816 R
## 3 A NA A NA A NA A NA A
## 4 R 19847 R 18588 R 17302 R 16521 R
## 5 R 13504 R 13176 R 12875 R 13512 R
## NPIS422 XNPIS432 NPIS432 XNPIS442 NPIS442 XNPIS452 NPIS452 XNPIS411 NPIS411
## 1 13923 R 16471 R 19057 R 18200 R 14205
## 2 15876 R 18310 R 20913 R 21326 R 14620
## 3 NA A NA A NA A NA A NA
## 4 17935 R 21187 R 21205 R 21937 R 13902
## 5 14047 R 15155 R 15771 R 11863 R 13066
## XNPIS421 NPIS421 XNPIS431 NPIS431 XNPIS441 NPIS441 XNPIS451 NPIS451 XNPIS410
## 1 R 14526 R 16532 R 20139 R 17539 R
## 2 R 15130 R 17379 R 20045 R 19342 R
## 3 A NA A NA A NA A NA A
## 4 R 15703 R 18802 R 20880 R 20467 R
## 5 R 13738 R 13505 R 14414 R 15590 R
## NPIS410 XNPIS420 NPIS420 XNPIS430 NPIS430 XNPIS440 NPIS440 XNPIS450 NPIS450
## 1 12177 R 12300 R 15508 R 16707 R 18857
## 2 13862 R 15370 R 16847 R 18230 R 20054
## 3 NA A NA A NA A NA A NA
## 4 14654 R 14646 R 18572 R 19206 R 19423
## 5 12877 R 14170 R 13992 R 20568 R 21938
## XGRNTN2 GRNTN2 XGRNTON2 GRNTON2 XGRNTWF2 GRNTWF2 XGRNTOF2 GRNTOF2 XGRNTUN2
## 1 A NA A NA A NA A NA A
## 2 A NA A NA A NA A NA A
## 3 A NA A NA A NA A NA A
## 4 A NA A NA A NA A NA A
## 5 A NA A NA A NA A NA A
## GRNTUN2 XGRNTT2 GRNTT2 XGRNTA2 GRNTA2 XGRNTN1 GRNTN1 XGRNTON1 GRNTON1
## 1 NA A NA A NA A NA A NA
## 2 NA A NA A NA A NA A NA
## 3 NA A NA A NA A NA A NA
## 4 NA A NA A NA A NA A NA
## 5 NA A NA A NA A NA A NA
## XGRNTWF1 GRNTWF1 XGRNTOF1 GRNTOF1 XGRNTUN1 GRNTUN1 XGRNTT1 GRNTT1 XGRNTA1
## 1 A NA A NA A NA A NA A
## 2 A NA A NA A NA A NA A
## 3 A NA A NA A NA A NA A
## 4 A NA A NA A NA A NA A
## 5 A NA A NA A NA A NA A
## GRNTA1 XGRNTN0 GRNTN0 XGRNTON0 GRNTON0 XGRNTWF0 GRNTWF0 XGRNTOF0 GRNTOF0
## 1 NA A NA A NA A NA A NA
## 2 NA A NA A NA A NA A NA
## 3 NA A NA A NA A NA A NA
## 4 NA A NA A NA A NA A NA
## 5 NA A NA A NA A NA A NA
## XGRNTUN0 GRNTUN0 XGRNTT0 GRNTT0 XGRNTA0 GRNTA0 XGRN4N2 GRN4N2 XGRN4ON2
## 1 A NA A NA A NA A NA A
## 2 A NA A NA A NA A NA A
## 3 A NA A NA A NA A NA A
## 4 A NA A NA A NA A NA A
## 5 A NA A NA A NA A NA A
## GRN4ON2 XGRN4WF2 GRN4WF2 XGRN4OF2 GRN4OF2 XGRN4UN2 GRN4UN2 XGRN4G2 GRN4G2
## 1 NA A NA A NA A NA A NA
## 2 NA A NA A NA A NA A NA
## 3 NA A NA A NA A NA A NA
## 4 NA A NA A NA A NA A NA
## 5 NA A NA A NA A NA A NA
## XGRN4T2 GRN4T2 XGRN4A2 GRN4A2 XGRN4N12 GRN4N12 XGRN4G12 GRN4G12 XGRN4T12
## 1 A NA A NA A NA A NA A
## 2 A NA A NA A NA A NA A
## 3 A NA A NA A NA A NA A
## 4 A NA A NA A NA A NA A
## 5 A NA A NA A NA A NA A
## GRN4T12 XGRN4A12 GRN4A12 XGRN4N22 GRN4N22 XGRN4G22 GRN4G22 XGRN4T22 GRN4T22
## 1 NA A NA A NA A NA A NA
## 2 NA A NA A NA A NA A NA
## 3 NA A NA A NA A NA A NA
## 4 NA A NA A NA A NA A NA
## 5 NA A NA A NA A NA A NA
## XGRN4A22 GRN4A22 XGRN4N32 GRN4N32 XGRN4G32 GRN4G32 XGRN4T32 GRN4T32 XGRN4A32
## 1 A NA A NA A NA A NA A
## 2 A NA A NA A NA A NA A
## 3 A NA A NA A NA A NA A
## 4 A NA A NA A NA A NA A
## 5 A NA A NA A NA A NA A
## GRN4A32 XGRN4N42 GRN4N42 XGRN4G42 GRN4G42 XGRN4T42 GRN4T42 XGRN4A42 GRN4A42
## 1 NA A NA A NA A NA A NA
## 2 NA A NA A NA A NA A NA
## 3 NA A NA A NA A NA A NA
## 4 NA A NA A NA A NA A NA
## 5 NA A NA A NA A NA A NA
## XGRN4N52 GRN4N52 XGRN4G52 GRN4G52 XGRN4T52 GRN4T52 XGRN4A52 GRN4A52 XGRN4N1
## 1 A NA A NA A NA A NA A
## 2 A NA A NA A NA A NA A
## 3 A NA A NA A NA A NA A
## 4 A NA A NA A NA A NA A
## 5 A NA A NA A NA A NA A
## GRN4N1 XGRN4ON1 GRN4ON1 XGRN4WF1 GRN4WF1 XGRN4OF1 GRN4OF1 XGRN4UN1 GRN4UN1
## 1 NA A NA A NA A NA A NA
## 2 NA A NA A NA A NA A NA
## 3 NA A NA A NA A NA A NA
## 4 NA A NA A NA A NA A NA
## 5 NA A NA A NA A NA A NA
## XGRN4G1 GRN4G1 XGRN4T1 GRN4T1 XGRN4A1 GRN4A1 XGRN4N11 GRN4N11 XGRN4G11
## 1 A NA A NA A NA A NA A
## 2 A NA A NA A NA A NA A
## 3 A NA A NA A NA A NA A
## 4 A NA A NA A NA A NA A
## 5 A NA A NA A NA A NA A
## GRN4G11 XGRN4T11 GRN4T11 XGRN4A11 GRN4A11 XGRN4N21 GRN4N21 XGRN4G21 GRN4G21
## 1 NA A NA A NA A NA A NA
## 2 NA A NA A NA A NA A NA
## 3 NA A NA A NA A NA A NA
## 4 NA A NA A NA A NA A NA
## 5 NA A NA A NA A NA A NA
## XGRN4T21 GRN4T21 XGRN4A21 GRN4A21 XGRN4N31 GRN4N31 XGRN4G31 GRN4G31 XGRN4T31
## 1 A NA A NA A NA A NA A
## 2 A NA A NA A NA A NA A
## 3 A NA A NA A NA A NA A
## 4 A NA A NA A NA A NA A
## 5 A NA A NA A NA A NA A
## GRN4T31 XGRN4A31 GRN4A31 XGRN4N41 GRN4N41 XGRN4G41 GRN4G41 XGRN4T41 GRN4T41
## 1 NA A NA A NA A NA A NA
## 2 NA A NA A NA A NA A NA
## 3 NA A NA A NA A NA A NA
## 4 NA A NA A NA A NA A NA
## 5 NA A NA A NA A NA A NA
## XGRN4A41 GRN4A41 XGRN4N51 GRN4N51 XGRN4G51 GRN4G51 XGRN4T51 GRN4T51 XGRN4A51
## 1 A NA A NA A NA A NA A
## 2 A NA A NA A NA A NA A
## 3 A NA A NA A NA A NA A
## 4 A NA A NA A NA A NA A
## 5 A NA A NA A NA A NA A
## GRN4A51 XGRN4N0 GRN4N0 XGRN4ON0 GRN4ON0 XGRN4WF0 GRN4WF0 XGRN4OF0 GRN4OF0
## 1 NA A NA A NA A NA A NA
## 2 NA A NA A NA A NA A NA
## 3 NA A NA A NA A NA A NA
## 4 NA A NA A NA A NA A NA
## 5 NA A NA A NA A NA A NA
## XGRN4UN0 GRN4UN0 XGRN4G0 GRN4G0 XGRN4T0 GRN4T0 XGRN4A0 GRN4A0 XGRN4N10
## 1 A NA A NA A NA A NA A
## 2 A NA A NA A NA A NA A
## 3 A NA A NA A NA A NA A
## 4 A NA A NA A NA A NA A
## 5 A NA A NA A NA A NA A
## GRN4N10 XGRN4G10 GRN4G10 XGRN4T10 GRN4T10 XGRN4A10 GRN4A10 XGRN4N20 GRN4N20
## 1 NA A NA A NA A NA A NA
## 2 NA A NA A NA A NA A NA
## 3 NA A NA A NA A NA A NA
## 4 NA A NA A NA A NA A NA
## 5 NA A NA A NA A NA A NA
## XGRN4G20 GRN4G20 XGRN4T20 GRN4T20 XGRN4A20 GRN4A20 XGRN4N30 GRN4N30 XGRN4G30
## 1 A NA A NA A NA A NA A
## 2 A NA A NA A NA A NA A
## 3 A NA A NA A NA A NA A
## 4 A NA A NA A NA A NA A
## 5 A NA A NA A NA A NA A
## GRN4G30 XGRN4T30 GRN4T30 XGRN4A30 GRN4A30 XGRN4N40 GRN4N40 XGRN4G40 GRN4G40
## 1 NA A NA A NA A NA A NA
## 2 NA A NA A NA A NA A NA
## 3 NA A NA A NA A NA A NA
## 4 NA A NA A NA A NA A NA
## 5 NA A NA A NA A NA A NA
## XGRN4T40 GRN4T40 XGRN4A40 GRN4A40 XGRN4N50 GRN4N50 XGRN4G50 GRN4G50 XGRN4T50
## 1 A NA A NA A NA A NA A
## 2 A NA A NA A NA A NA A
## 3 A NA A NA A NA A NA A
## 4 A NA A NA A NA A NA A
## 5 A NA A NA A NA A NA A
## GRN4T50 XGRN4A50 GRN4A50 XNPGRN2 NPGRN2 XNPGRN1 NPGRN1 XNPGRN0 NPGRN0 XNPT412
## 1 NA A NA A NA A NA A NA A
## 2 NA A NA A NA A NA A NA A
## 3 NA A NA A NA A NA A NA A
## 4 NA A NA A NA A NA A NA A
## 5 NA A NA A NA A NA A NA A
## NPT412 XNPT422 NPT422 XNPT432 NPT432 XNPT442 NPT442 XNPT452 NPT452 XNPT411
## 1 NA A NA A NA A NA A NA A
## 2 NA A NA A NA A NA A NA A
## 3 NA A NA A NA A NA A NA A
## 4 NA A NA A NA A NA A NA A
## 5 NA A NA A NA A NA A NA A
## NPT411 XNPT421 NPT421 XNPT431 NPT431 XNPT441 NPT441 XNPT451 NPT451 XNPT410
## 1 NA A NA A NA A NA A NA A
## 2 NA A NA A NA A NA A NA A
## 3 NA A NA A NA A NA A NA A
## 4 NA A NA A NA A NA A NA A
## 5 NA A NA A NA A NA A NA A
## NPT410 XNPT420 NPT420 XNPT430 NPT430 XNPT440 NPT440 XNPT450 NPT450
## 1 NA A NA A NA A NA A NA
## 2 NA A NA A NA A NA A NA
## 3 NA A NA A NA A NA A NA
## 4 NA A NA A NA A NA A NA
## 5 NA A NA A NA A NA A NA
Importing the institutional characteristics
ipeds_inst2223 <- read.csv('https://raw.githubusercontent.com/william-forero/Data-607/refs/heads/main/hd2023.csv')
How many rows and columns are included in the Inst Characteristic Dataset?
dim(ipeds_inst2223)
## [1] 6163 73
Takeaway: 6,163 rows and 73 columns in the 22-23 IPEDS Institutional Characteristic Dataset
What does the data look like?
head(ipeds_inst2223,5)
## UNITID INSTNM
## 1 100654 Alabama A & M University
## 2 100663 University of Alabama at Birmingham
## 3 100690 Amridge University
## 4 100706 University of Alabama in Huntsville
## 5 100724 Alabama State University
## IALIAS
## 1 AAMU
## 2 UAB
## 3 Southern Christian University Regions University
## 4 UAH University of Alabama Huntsville
## 5
## ADDR CITY STABBR ZIP FIPS OBEREG
## 1 4900 Meridian Street Normal AL 35762 1 5
## 2 Administration Bldg Suite 1070 Birmingham AL 35294-0110 1 5
## 3 1200 Taylor Rd Montgomery AL 36117-3553 1 5
## 4 301 Sparkman Dr Huntsville AL 35899 1 5
## 5 915 S Jackson Street Montgomery AL 36104-0271 1 5
## CHFNM CHFTITLE GENTELE EIN UEIS OPEID
## 1 Dr. Daniel K. Wims President 2.563725e+09 636001109 JDVGS67MSLH7 100200
## 2 Ray L. Watts President 2.059344e+09 636005396 YND4PLMC9AN7 105200
## 3 Michael C.Turner President 3.343874e+13 237034324 RB27R4GLDKE7 2503400
## 4 Chuck Karr President 2.568246e+09 630520830 HB6KNGVNJRU1 105500
## 5 Quinton T. Ross President 3.342294e+09 636001101 DLJWLMSNK627 100500
## OPEFLAG WEBADDR
## 1 1 www.aamu.edu/
## 2 1 https://www.uab.edu/
## 3 1 https://www.amridgeuniversity.edu/
## 4 1 www.uah.edu/
## 5 1 www.alasu.edu/
## ADMINURL
## 1 https://www.aamu.edu/admissions-aid/index.html
## 2 https://www.uab.edu/admissions/
## 3 https://www.amridgeuniversity.edu/admissions/
## 4 https://www.uah.edu/admissions
## 5 www.alasu.edu/admissions/index.aspx
## FAIDURL
## 1 https://www.aamu.edu/admissions-aid/financial-aid/
## 2 https://www.uab.edu/cost-aid/
## 3 https://www.amridgeuniversity.edu/financialaid/
## 4 finaid.uah.edu/
## 5 www.alasu.edu/undergraduate/expensesandfinancialaid/expenses-and-financial-aid
## APPLURL
## 1 https://www.aamu.edu/admissions-aid/undergraduate-admissions/apply-today.html
## 2 https://www.uab.edu/admissions/apply
## 3 https://www.amridgeuniversity.edu/myamridge/
## 4 register.uah.edu/
## 5 www.alasu.edu/admissions/undergrad-admissions/index.aspx
## NPRICURL
## 1 www.aamu.edu/admissions-aid/tuition-fees/net-price-calculator.html
## 2 https://tcc.ruffalonl.com/University of Alabama at Birmingham/Freshman-Students
## 3 https://www2.amridgeuniversity.edu:9091/
## 4 finaid.uah.edu/
## 5 www.alasu.edu/cost-aid/tuition-costs/net-price-calculator
## VETURL
## 1
## 2 https://www.uab.edu/students/veterans
## 3 https://www.amridgeuniversity.edu/admissions/military/
## 4 www.uah.edu/admissions/graduate/financial-aid/veterans
## 5
## ATHURL
## 1
## 2 https://www.uab.edu/registrar/students
## 3
## 4 www.uah.edu/heoa
## 5 https://www.alasu.edu/administration/consumer-information
## DISAURL
## 1 https://www.aamu.edu/administrativeoffices/VADS/Pages/Disability-Services.aspx
## 2 https://www.uab.edu/students/disability/
## 3 https://www.amridgeuniversity.edu/studentservices/
## 4 www.uah.edu/health-and-wellness/disability-support
## 5 www.alasu.edu/academics/researchcenters/alabama-alliance-students-disabilities
## SECTOR ICLEVEL CONTROL HLOFFER UGOFFER GROFFER HDEGOFR1 DEGGRANT HBCU
## 1 1 1 1 9 1 1 12 1 1
## 2 1 1 1 9 1 1 11 1 2
## 3 2 1 2 9 1 1 12 1 2
## 4 1 1 1 9 1 1 11 1 2
## 5 1 1 1 9 1 1 11 1 1
## HOSPITAL MEDICAL TRIBAL LOCALE OPENPUBL ACT NEWID DEATHYR CLOSEDAT CYACTIVE
## 1 2 2 2 12 1 A -2 -2 -2 1
## 2 1 1 2 12 1 A -2 -2 -2 1
## 3 2 2 2 12 1 A -2 -2 -2 1
## 4 2 2 2 12 1 A -2 -2 -2 1
## 5 2 2 2 12 1 A -2 -2 -2 1
## POSTSEC PSEFLAG PSET4FLG RPTMTH INSTCAT C21BASIC C21IPUG C21IPGRD C21UGPRF
## 1 1 1 1 1 2 18 16 18 10
## 2 1 1 1 1 2 15 14 14 9
## 3 1 1 1 1 2 20 20 18 5
## 4 1 1 1 1 2 15 17 17 15
## 5 1 1 1 1 2 17 13 18 10
## C21ENPRF C21SZSET C18BASIC C15BASIC CCBASIC CARNEGIE LANDGRNT INSTSIZE
## 1 4 14 18 18 18 16 1 3
## 2 5 15 15 15 15 15 2 5
## 3 6 6 20 20 21 51 2 1
## 4 4 13 16 16 15 16 2 3
## 5 3 14 19 19 18 21 2 2
## F1SYSTYP F1SYSNAM F1SYSCOD CBSA CBSATYPE CSA
## 1 2 -2 -2 26620 1 290
## 2 1 The University of Alabama System 101050 13820 1 142
## 3 2 -2 -2 33860 1 388
## 4 1 The University of Alabama System 101050 26620 1 290
## 5 2 -2 -2 33860 1 388
## COUNTYCD COUNTYNM CNGDSTCD LONGITUD LATITUDE DFRCGID DFRCUSCG
## 1 1089 Madison County 105 -86.56850 34.78337 106 1
## 2 1073 Jefferson County 107 -86.79935 33.50570 91 1
## 3 1101 Montgomery County 102 -86.17401 32.36261 124 2
## 4 1089 Madison County 105 -86.64045 34.72456 91 1
## 5 1101 Montgomery County 107 -86.29568 32.36432 97 1
Importing the graduation rate data
ipeds_grad2223 <- read.csv('https://raw.githubusercontent.com/william-forero/Data-607/refs/heads/main/2223_IPEDS_CUSTOM_DATA_FILE.csv')
How many rows and columns are included in the Grad Rate Dataset?
dim(ipeds_grad2223)
## [1] 1563 38
Takeaway: 1,563 rows and 38 columns in the 22-23 IPEDS Dataset
Using the data dictionary provided by IPEDS, I will first trim down each dataframe to the columns I intend to use for the project.
Trimming down the FA columns
Any counts below are for undergraduate, degree seeking students.
fa_clean <- ipeds_fa2223 |>
select(
UNITID, # Inst ID
tot_degr_ugrd = SCUGDGSK, # Tot Nbr Degr Seeking Ugrd
nbr_awd_any_aid = UDGAGRNTN, # Nbr Ugrd Degr Awarded Any Aid
tot_aid_awd = UDGAGRNTT, # Total aid awarded
nbr_awd_pell_aid = UDGPGRNTN, # Nbr Stdnts awarded Pell Grant
tot_pell_awd = UDGPGRNTT, # Total Pell awarded
nbr_awd_loans = UDGFLOANN, # Nbr Stdnts awarded Loans
tot_loans_awd = UDGFLOANT, # Total Loans awarded
) |>
mutate(
# Creating columns that show what percent of degree seeking undergraduate
# students received each type of aid
perc_awd_any_aid = round(nbr_awd_any_aid / tot_degr_ugrd , 2),
perc_awd_pell = round(nbr_awd_pell_aid / tot_degr_ugrd , 2),
perc_awd_loans = round(nbr_awd_loans / tot_degr_ugrd , 2),
# Creating columns that show the average aid package a student received
avg_awd_any_aid = round(tot_aid_awd / nbr_awd_any_aid, 2),
avg_awd_pell = round(tot_pell_awd / nbr_awd_pell_aid, 2),
avg_awd_loans = round(tot_loans_awd / nbr_awd_loans, 2)
)
head(fa_clean,5)
## UNITID tot_degr_ugrd nbr_awd_any_aid tot_aid_awd nbr_awd_pell_aid
## 1 100654 5201 4326 50168163 3353
## 2 100663 12776 9645 101800616 4324
## 3 100690 228 216 473411 156
## 4 100706 6985 4790 47009835 1612
## 5 100724 3296 3159 32218913 2374
## tot_pell_awd nbr_awd_loans tot_loans_awd perc_awd_any_aid perc_awd_pell
## 1 19363202 3050 20538077 0.83 0.64
## 2 26529097 5339 36242972 0.75 0.34
## 3 329636 165 1317500 0.95 0.68
## 4 7559208 2429 15645787 0.69 0.23
## 5 11648634 2563 21589661 0.96 0.72
## perc_awd_loans avg_awd_any_aid avg_awd_pell avg_awd_loans
## 1 0.59 11596.89 5774.89 6733.80
## 2 0.42 10554.76 6135.31 6788.34
## 3 0.72 2191.72 2113.05 7984.85
## 4 0.35 9814.16 4689.33 6441.25
## 5 0.78 10199.09 4906.75 8423.59
Selecting the columns from the institutional characteristics
This table will later be used to with joins to limit which colleges are included in the model
inst_clean <- ipeds_inst2223 |>
filter(
SECTOR %in% c(1, 4), # To limit public 2 year and 4 year
OBEREG != 9, # To filter out territories that are not the 50 states and DC
ICLEVEL %in% c(1, 2), # Keep 2 and 4 year colleges
UGOFFER == 1, # To limit that they offer undergraduate degrees
DEGGRANT == 1, # Indicates the inst is degr granting
OPENPUBL == 1 # Denotes the college is open to the public
) |>
select(
UNITID,
name = INSTNM,
state = STABBR,
sector = SECTOR
) |>
# This will make it easier to interpret the values
mutate(sector = case_when(
sector == 1 ~ "Public, 4‑Year",
sector == 4 ~ "Public, 2‑Year"
))
head(inst_clean,5)
## UNITID name state sector
## 1 100654 Alabama A & M University AL Public, 4‑Year
## 2 100663 University of Alabama at Birmingham AL Public, 4‑Year
## 3 100706 University of Alabama in Huntsville AL Public, 4‑Year
## 4 100724 Alabama State University AL Public, 4‑Year
## 5 100751 The University of Alabama AL Public, 4‑Year
Cleaning the graduation rate dataset
grad_clean <- ipeds_grad2223 |>
select(
unitid,
tot_grad_rt = `DRVGR2023.Graduation.rate..total.cohort`
) |>
mutate(
# convert 0–100 percentages → 0–1 proportions which matches what was done in FA
tot_grad_rt = tot_grad_rt / 100
) |>
filter(
!is.na(tot_grad_rt)
)
head(grad_clean,5)
## unitid tot_grad_rt
## 1 100654 0.29
## 2 100663 0.63
## 3 100706 0.62
## 4 100724 0.30
## 5 100751 0.74
Now the tables should be joined to create a comprehensive dataset for our regression model
full_data <- inst_clean |>
inner_join(grad_clean, by = c("UNITID" = "unitid")) |>
inner_join(fa_clean, by = "UNITID")
head(full_data)
## UNITID name state sector tot_grad_rt
## 1 100654 Alabama A & M University AL Public, 4‑Year 0.29
## 2 100663 University of Alabama at Birmingham AL Public, 4‑Year 0.63
## 3 100706 University of Alabama in Huntsville AL Public, 4‑Year 0.62
## 4 100724 Alabama State University AL Public, 4‑Year 0.30
## 5 100751 The University of Alabama AL Public, 4‑Year 0.74
## 6 100760 Central Alabama Community College AL Public, 2‑Year 0.24
## tot_degr_ugrd nbr_awd_any_aid tot_aid_awd nbr_awd_pell_aid tot_pell_awd
## 1 5201 4326 50168163 3353 19363202
## 2 12776 9645 101800616 4324 26529097
## 3 6985 4790 47009835 1612 7559208
## 4 3296 3159 32218913 2374 11648634
## 5 31360 21244 333218271 5834 30350429
## 6 963 666 3442170 502 2275601
## nbr_awd_loans tot_loans_awd perc_awd_any_aid perc_awd_pell perc_awd_loans
## 1 3050 20538077 0.83 0.64 0.59
## 2 5339 36242972 0.75 0.34 0.42
## 3 2429 15645787 0.69 0.23 0.35
## 4 2563 21589661 0.96 0.72 0.78
## 5 10896 67924516 0.68 0.19 0.35
## 6 283 1058559 0.69 0.52 0.29
## avg_awd_any_aid avg_awd_pell avg_awd_loans
## 1 11596.89 5774.89 6733.80
## 2 10554.76 6135.31 6788.34
## 3 9814.16 4689.33 6441.25
## 4 10199.09 4906.75 8423.59
## 5 15685.29 5202.34 6233.89
## 6 5168.42 4533.07 3740.49
what shape did the combined data take?
dim(full_data)
## [1] 1536 18
There are 1,536 rows and 18 columns
What do the variables look like?
summary(full_data)
## UNITID name state sector
## Min. :100654 Length:1536 Length:1536 Length:1536
## 1st Qu.:148194 Class :character Class :character Class :character
## Median :185386 Mode :character Mode :character Mode :character
## Mean :194418
## 3rd Qu.:218726
## Max. :498571
##
## tot_grad_rt tot_degr_ugrd nbr_awd_any_aid tot_aid_awd
## Min. :0.0000 Min. : 23 Min. : 23 Min. : 264791
## 1st Qu.:0.2900 1st Qu.: 1480 1st Qu.: 1063 1st Qu.: 6133458
## Median :0.3800 Median : 3494 Median : 2266 Median : 13582987
## Mean :0.4072 Mean : 6441 Mean : 4296 Mean : 34632662
## 3rd Qu.:0.5100 3rd Qu.: 7754 3rd Qu.: 5130 3rd Qu.: 33574125
## Max. :1.0000 Max. :64778 Max. :50410 Max. :625717141
##
## nbr_awd_pell_aid tot_pell_awd nbr_awd_loans tot_loans_awd
## Min. : 12 Min. : 55293 Min. : 0.0 Min. : 0
## 1st Qu.: 649 1st Qu.: 2949220 1st Qu.: 179.8 1st Qu.: 976710
## Median : 1316 Median : 6193948 Median : 618.0 Median : 3464485
## Mean : 2266 Mean : 11096009 Mean : 1681.2 Mean : 10483300
## 3rd Qu.: 2871 3rd Qu.: 13498318 3rd Qu.: 2053.2 3rd Qu.: 12670293
## Max. :21132 Max. :120317715 Max. :18597.0 Max. :123504447
##
## perc_awd_any_aid perc_awd_pell perc_awd_loans avg_awd_any_aid
## Min. :0.1100 Min. :0.0500 Min. :0.0000 Min. : 1689
## 1st Qu.:0.6000 1st Qu.:0.3100 1st Qu.:0.0700 1st Qu.: 4841
## Median :0.7100 Median :0.4000 Median :0.2500 Median : 6057
## Mean :0.7055 Mean :0.4068 Mean :0.2616 Mean : 6882
## 3rd Qu.:0.8200 3rd Qu.:0.5000 3rd Qu.:0.4200 3rd Qu.: 8427
## Max. :1.0000 Max. :0.9300 Max. :0.8600 Max. :26499
##
## avg_awd_pell avg_awd_loans
## Min. :1824 Min. : 1075
## 1st Qu.:4213 1st Qu.: 5388
## Median :4790 Median : 6043
## Mean :4811 Mean : 5910
## 3rd Qu.:5253 3rd Qu.: 6516
## Max. :8791 Max. :11902
## NA's :144
Below I do a check to see if there are NAs in the newly combined dataset
missing_summary <- full_data |>
summarise(across(everything(), ~sum(is.na(.)))) |>
pivot_longer(everything(),
names_to = "variable",
values_to = "n_missing") |>
mutate(pct_missing = round(100 * n_missing / nrow(full_data), 2)) |>
filter(pct_missing > 0) |>
arrange(desc(pct_missing))
missing_summary
## # A tibble: 1 × 3
## variable n_missing pct_missing
## <chr> <int> <dbl>
## 1 avg_awd_loans 144 9.38
Are these just colleges where no loans were awarded?
full_data |>
filter(is.na(avg_awd_loans)) |>
summarise(
all_zero = all(nbr_awd_loans == 0 | is.na(nbr_awd_loans)),
min_tot_loans = min(tot_loans_awd, na.rm = TRUE),
max_tot_loans = max(tot_loans_awd, na.rm = TRUE)
)
## all_zero min_tot_loans max_tot_loans
## 1 TRUE 0 0
This presents a challenge because we hit a point where we must decide whether to impute values or delete the rows. To simply replace these values with $0 may bias the analysis as the fact the loan column was NA could be an indication that these colleges do not offer loans, this is an indication of access to loans. Deleting these rows however also introduces its own bias as now I am limiting the analysis to colleges that offered loans. What I will do is create a variable for “offered_loans” and then replace these missing values with 0.
full_data <- full_data %>%
mutate(
offered_loans = nbr_awd_loans > 0,
avg_awd_loans = if_else(offered_loans,
tot_loans_awd / nbr_awd_loans,
0)
)
head(full_data,5)
## UNITID name state sector tot_grad_rt
## 1 100654 Alabama A & M University AL Public, 4‑Year 0.29
## 2 100663 University of Alabama at Birmingham AL Public, 4‑Year 0.63
## 3 100706 University of Alabama in Huntsville AL Public, 4‑Year 0.62
## 4 100724 Alabama State University AL Public, 4‑Year 0.30
## 5 100751 The University of Alabama AL Public, 4‑Year 0.74
## tot_degr_ugrd nbr_awd_any_aid tot_aid_awd nbr_awd_pell_aid tot_pell_awd
## 1 5201 4326 50168163 3353 19363202
## 2 12776 9645 101800616 4324 26529097
## 3 6985 4790 47009835 1612 7559208
## 4 3296 3159 32218913 2374 11648634
## 5 31360 21244 333218271 5834 30350429
## nbr_awd_loans tot_loans_awd perc_awd_any_aid perc_awd_pell perc_awd_loans
## 1 3050 20538077 0.83 0.64 0.59
## 2 5339 36242972 0.75 0.34 0.42
## 3 2429 15645787 0.69 0.23 0.35
## 4 2563 21589661 0.96 0.72 0.78
## 5 10896 67924516 0.68 0.19 0.35
## avg_awd_any_aid avg_awd_pell avg_awd_loans offered_loans
## 1 11596.89 5774.89 6733.796 TRUE
## 2 10554.76 6135.31 6788.345 TRUE
## 3 9814.16 4689.33 6441.246 TRUE
## 4 10199.09 4906.75 8423.590 TRUE
## 5 15685.29 5202.34 6233.895 TRUE
Now we run the missing value check again, and see that we have handled the missing values.
missing_summary <- full_data |>
summarise(across(everything(), ~sum(is.na(.)))) |>
pivot_longer(everything(),
names_to = "variable",
values_to = "n_missing") |>
mutate(pct_missing = round(100 * n_missing / nrow(full_data), 2)) |>
filter(pct_missing > 0) |>
arrange(desc(pct_missing))
missing_summary
## # A tibble: 0 × 3
## # ℹ 3 variables: variable <chr>, n_missing <int>, pct_missing <dbl>
Now I want to dive into what kind of story this data tells.
To start off, what kind of information do we see at the sector level? 2-Year vs 4-Year
sector_summary <- full_data |>
group_by(sector) |>
summarise(
nbr_colleges = n(),
avg_fin_aid_pkg = round(mean(avg_awd_any_aid, na.rm = TRUE),2),
med_fin_aid_pkg = round(median(avg_awd_any_aid, na.rm = TRUE),2),
avg_grad_rate = round(mean(tot_grad_rt,na.rm = TRUE),2),
med_grad_rate = round(median(tot_grad_rt,na.rm = TRUE),2)
)
sector_summary
## # A tibble: 2 × 6
## sector nbr_colleges avg_fin_aid_pkg med_fin_aid_pkg avg_grad_rate
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Public, 2‑Year 786 5291. 5135. 0.35
## 2 Public, 4‑Year 750 8551. 8351. 0.47
## # ℹ 1 more variable: med_grad_rate <dbl>
Takeaway: There are 36 more 2-Year institutions included in the data. 4-Year colleges on average award more in aid pakcages to a student, and have a higher graduation rate
What are the 5 colleges that provide aid to the most students (% of students offered any aid)?
full_data |>
select(UNITID, name,tot_degr_ugrd,tot_grad_rt, perc_awd_any_aid,avg_awd_any_aid) |>
arrange(desc (perc_awd_any_aid),desc(avg_awd_any_aid)) |>
slice_head(n = 5)
## UNITID name tot_degr_ugrd tot_grad_rt
## 1 215309 University of Pittsburgh-Titusville 23 0.00
## 2 380377 Northwest Indian College 595 0.27
## 3 163338 University of Maryland Eastern Shore 1874 0.35
## 4 219277 Oglala Lakota College 1110 0.11
## 5 199926 Wilkes Community College 1250 0.52
## perc_awd_any_aid avg_awd_any_aid
## 1 1 11512.65
## 2 1 9116.75
## 3 1 8679.81
## 4 1 8355.33
## 5 1 8335.52
We see that there are colleges where 100% of the students included in the 2023 data were awarded some type of aid. This is interesting because it raises the question of whether these are true outliers, data errors, or just a representation of reality. It also points out a limitation of the data, as its possible that the aid being provided is loans or unique institutional scholarships. I will be removing the college where only 23 students were reported for the college as the sample size seems too small and their 100% of students receiving aid can skew the data.
full_data <- full_data |>
filter(UNITID != 215309)
What are the 5 colleges that provide aid to the least students (% of students offered any aid)?
full_data |>
select(UNITID, name,tot_degr_ugrd,tot_grad_rt, perc_awd_any_aid,avg_awd_any_aid) |>
arrange(perc_awd_any_aid,avg_awd_any_aid) |>
slice_head(n = 5)
## UNITID name tot_degr_ugrd tot_grad_rt
## 1 197027 United States Merchant Marine Academy 931 0.81
## 2 187976 University of New Mexico-Los Alamos Campus 867 0.30
## 3 204440 Northwest State Community College 1028 0.47
## 4 234669 Bellevue College 7823 0.33
## 5 240125 Waukesha County Technical College 4180 0.46
## perc_awd_any_aid avg_awd_any_aid
## 1 0.11 4509.29
## 2 0.19 3408.54
## 3 0.21 5010.46
## 4 0.25 6894.19
## 5 0.26 4169.63
What are the 5 colleges that provided the most average $ amount in aid packaged to a student?
full_data |>
select(UNITID, name,tot_degr_ugrd,tot_grad_rt, perc_awd_any_aid,avg_awd_any_aid) |>
arrange(desc(avg_awd_any_aid),desc(perc_awd_any_aid)) |>
slice_head(n = 5)
## UNITID name tot_degr_ugrd tot_grad_rt
## 1 234076 University of Virginia-Main Campus 17125 0.95
## 2 170976 University of Michigan-Ann Arbor 32448 0.93
## 3 445188 University of California-Merced 8343 0.69
## 4 110635 University of California-Berkeley 32475 0.93
## 5 110680 University of California-San Diego 33092 0.88
## perc_awd_any_aid avg_awd_any_aid
## 1 0.42 26498.53
## 2 0.52 24544.15
## 3 0.85 23874.35
## 4 0.57 22994.79
## 5 0.57 21788.43
What are the 5 colleges that provided the least average $ amount in aid packaged to a student?
full_data |>
select(UNITID, name,tot_degr_ugrd,tot_grad_rt, perc_awd_any_aid,avg_awd_any_aid) |>
arrange(avg_awd_any_aid,perc_awd_any_aid) |>
slice_head(n = 5)
## UNITID name tot_degr_ugrd tot_grad_rt perc_awd_any_aid
## 1 120953 Palo Verde College 4053 0.14 0.56
## 2 147800 Oakton College 5918 0.31 0.94
## 3 112385 Coastline Community College 6567 0.28 0.63
## 4 108667 College of Alameda 4077 0.34 0.62
## 5 230746 Salt Lake Community College 17290 0.30 1.00
## avg_awd_any_aid
## 1 1688.87
## 2 1753.73
## 3 1760.99
## 4 1825.57
## 5 1870.00
What can we tell about the data at the state level
What 5 states have the most colleges present?
# The 5 states with the most number of colleges included
state_summary <- full_data |>
group_by(state) |>
summarise(
nbr_colleges = n(),
avg_fin_aid_pkg = round(mean(avg_awd_any_aid, na.rm = TRUE),2),
med_fin_aid_pkg = round(median(avg_awd_any_aid, na.rm = TRUE),2),
avg_grad_rate = round(mean(tot_grad_rt,na.rm = TRUE),2),
med_grad_rate = round(median(tot_grad_rt,na.rm = TRUE),2)
) |>
arrange(desc(nbr_colleges))
head(state_summary,5)
## # A tibble: 5 × 6
## state nbr_colleges avg_fin_aid_pkg med_fin_aid_pkg avg_grad_rate med_grad_rate
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 CA 148 6145. 4538. 0.42 0.38
## 2 TX 93 6258. 5439. 0.36 0.34
## 3 NC 75 6826. 6896. 0.45 0.42
## 4 NY 75 7279. 7277. 0.42 0.34
## 5 IL 60 6798. 5739. 0.41 0.41
# The 5 states with the least amount of colleges included
head(state_summary |> arrange(nbr_colleges),5)
## # A tibble: 5 × 6
## state nbr_colleges avg_fin_aid_pkg med_fin_aid_pkg avg_grad_rate med_grad_rate
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 DC 1 7279. 7279. 0.26 0.26
## 2 DE 3 9728. 11951. 0.46 0.39
## 3 RI 3 7976. 6725. 0.47 0.46
## 4 VT 3 11743. 12450. 0.51 0.49
## 5 AK 4 6596. 6769. 0.32 0.32
# The 5 states with the most amount of aid awarded to a student on average
head(state_summary |> arrange(desc(avg_fin_aid_pkg)),5)
## # A tibble: 5 × 6
## state nbr_colleges avg_fin_aid_pkg med_fin_aid_pkg avg_grad_rate med_grad_rate
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 VT 3 11743. 12450. 0.51 0.49
## 2 CT 10 10309. 10614. 0.55 0.57
## 3 DE 3 9728. 11951. 0.46 0.39
## 4 CO 27 8784. 8199. 0.41 0.38
## 5 PA 54 8327. 8708. 0.38 0.29
# The 5 states with the least amount of aid awarded to a student on average
head(state_summary |> arrange(avg_fin_aid_pkg),5)
## # A tibble: 5 × 6
## state nbr_colleges avg_fin_aid_pkg med_fin_aid_pkg avg_grad_rate med_grad_rate
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 AZ 24 5167. 3972. 0.26 0.23
## 2 HI 10 5296. 4818. 0.36 0.32
## 3 NV 7 5369. 5046. 0.4 0.37
## 4 WI 30 5459. 5192. 0.5 0.5
## 5 OH 59 5812. 5063. 0.34 0.29
The current full_data allows us to view each college observation in a single row, this is a wide format and can be helpful when performing the linear regression later on. How can we make this data tidy-er by taking the types of aid and moving them into a long format. This would be very beneficial as adding in new aid programs will not need new columns, it would be an addition of a row by college/aid program added.
tidy_full <- full_data |>
rename(
nbr_awd_pell = nbr_awd_pell_aid,
tot_awd_pell = tot_pell_awd,
avg_awd_pell = avg_awd_pell,
perc_awd_pell = perc_awd_pell
) |>
pivot_longer(
cols = matches("^(nbr|tot|perc|avg)_awd_"),
names_to = c(".value", "aid_type"),
names_pattern = "(nbr|tot|perc|avg)_awd_(.*)"
) |>
mutate(
aid_type = recode(aid_type,
any_aid = "Any Aid",
pell = "Pell Grant",
loans = "Federal Loans")
) |>
relocate(aid_type, .after = sector) |>
transmute(
UNITID,
name,
state,
sector,
aid_type,
tot_grad_rt,
perc,
avg_k = avg / 1000,
log_tot_ugrd = log10(tot_degr_ugrd)
)
head(tidy_full)
## # A tibble: 6 × 9
## UNITID name state sector aid_type tot_grad_rt perc avg_k log_tot_ugrd
## <int> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 100654 Alabama A &… AL Publi… Any Aid 0.29 0.83 11.6 3.72
## 2 100654 Alabama A &… AL Publi… Pell Gr… 0.29 0.64 5.77 3.72
## 3 100654 Alabama A &… AL Publi… Federal… 0.29 0.59 6.73 3.72
## 4 100663 University … AL Publi… Any Aid 0.63 0.75 10.6 4.11
## 5 100663 University … AL Publi… Pell Gr… 0.63 0.34 6.14 4.11
## 6 100663 University … AL Publi… Federal… 0.63 0.42 6.79 4.11
What kind of correlation can be observed from this long format?
aid_share_cor <- tidy_full %>%
group_by(aid_type) %>%
summarise(
n_institutions = n(),
corr_grad_vs_share = cor(perc, tot_grad_rt, use = "complete.obs")
)
aid_share_cor
## # A tibble: 3 × 3
## aid_type n_institutions corr_grad_vs_share
## <chr> <int> <dbl>
## 1 Any Aid 1535 0.0343
## 2 Federal Loans 1535 0.273
## 3 Pell Grant 1535 -0.317
It may be easier to visualize the correlation.
ggplot(tidy_full, aes(perc, tot_grad_rt)) +
geom_point(alpha = .25) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1) +
facet_wrap(~ aid_type, nrow = 1) +
labs(
title = "Graduation Rate vs. Share of Students Receiving Aid",
x = "Proportion of Undergrads Receiving Aid",
y = "Graduation Rate"
)
## `geom_smooth()` using formula = 'y ~ x'
I want to see the distributions of the variables I have included in the full data. I will look at the distributions for the dependent variable (grad rate), the columns that represent the share of students meeting a condition, and the average aid awarded. I am not as interested in just the count of students or the total sum of aid at a college as it does not tell as complete a story as the percents do. The proportion of students receiving aid takes into consideration the total number of students and the number of students receviing aid by college.
ggplot(full_data, aes(x = tot_degr_ugrd)) +
geom_histogram(bins = 35,
fill = "steelblue", colour = "white") +
labs(title = "Distribution of Degree Seeking Undergraduate Students",
x = "Students",
y = "Number of colleges") +
theme_minimal()
There is a natural floor to the number of students since there cannot be
less than 0 students seeking an undergraduate degree at a college. The
variable is heavily right skewed which can impact the performance of the
model so we will apply a log transformation.
ggplot(full_data, aes(x = log10(tot_degr_ugrd))) +
geom_histogram(bins = 50,
fill = "steelblue", colour = "white") +
labs(title = "Distribution of Degree Seeking Undergraduate Students",
x = "Students (log)",
y = "Number of colleges") +
theme_minimal()
We see that this now takes on a more normal distrubution bell like
shape. we will add a column for the log value
Then we look at how the dependent variable is distributed, the graduation rate. Graduation rate is slightly right skewed, but is generally bell shaped
ggplot(full_data, aes(x = tot_grad_rt)) +
geom_histogram(bins = 35,
fill = "steelblue", colour = "white") +
labs(title = "Distribution of Graduation Rate",
x = "Graduation Rate (percentage)",
y = "Number of colleges") +
theme_minimal()
We have natrual ceilings and floors as you cannot offer less than 0 % of students aid, and you cannot give more than 100% of your students aid. The shape is roughly bell shaped with a left skew
ggplot(full_data, aes(x = perc_awd_any_aid)) +
geom_histogram(bins = 35,
fill = "steelblue", colour = "white") +
labs(title = "Distribution of Share of Students Receiving Any Aid",
x = "Receives Any Aid (percentage)",
y = "Number of colleges") +
theme_minimal()
We have a nearly normal bell shape forming for this distribution of percent of students receiving pell
ggplot(full_data, aes(x = perc_awd_pell)) +
geom_histogram(bins = 35,
fill = "steelblue", colour = "white") +
labs(title = "Distribution of Share of Students Receiving Pell",
x = "Receives Pell (percentage)",
y = "Number of colleges") +
theme_minimal()
The percent share of loans has a heavy right skew and a large proportion at 0. This is something that will be worth taking into consideration as we build the model. It may turn out to be a characteristic that should be evaluated seperately and out of the scope of this analysis.
ggplot(full_data, aes(x = perc_awd_loans)) +
geom_histogram(bins = 35,
fill = "steelblue", colour = "white") +
labs(title = "Distribution of Share of Students Receiving Loans",
x = "Receives Loans (percentage)",
y = "Number of colleges") +
theme_minimal()
The data are generally right skewed, however we cannot just delete them as they may be indications of the real world and not errors. It is possible that the colleges have high cost of attendance and offer more scholarships to balance out the cost.
full_data |>
select(unitid = UNITID,
avg_any_aid = avg_awd_any_aid,
avg_pell = avg_awd_pell,
avg_loans = avg_awd_loans) |>
pivot_longer(
cols = starts_with("avg_"),
names_to = "award_type",
values_to = "amount"
) |>
ggplot(aes(x = award_type,
y = amount,
fill = award_type)) +
geom_boxplot(width = .5, outlier.colour = "red") +
labs(title = "Distribution of Average Awards per Recipient",
x = NULL, y = "Amount (USD)") +
theme_minimal(base_size = 13)
What relationship can be observed between the share of students receiving any aid and graduation rate? We do not see a very strong linear relationship between the proportion of students receiving aid and the graduation rate
ggplot(data = full_data, aes(x = perc_awd_any_aid, y = tot_grad_rt)) +
geom_jitter() +
stat_smooth(method = "lm", se = FALSE) +
labs(title = "Percent Stdnts Awarded vs Graduation Rate",
x = "Students Awarded (percentage)",
y = "Graduation Rate") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
What relationship can be observed between the average aid students received and graduation rate? Here we see a much stronger positive trending relationship between the two variables
ggplot(data = full_data, aes(x = avg_awd_any_aid, y = tot_grad_rt)) +
geom_jitter() +
stat_smooth(method = "lm", se = FALSE) +
labs(title = "Avg Aid Awarded vs Graduation Rate",
x = "Avg Aid Awarded",
y = "Graduation Rate") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
final_data <- full_data |>
select(sector,state,tot_grad_rt,tot_degr_ugrd,perc_awd_any_aid,avg_awd_pell,avg_awd_loans,avg_awd_any_aid,perc_awd_pell,perc_awd_loans) |>
filter(!is.na(avg_awd_loans)) |>
transmute(
sector,
state,
tot_grad_rt,
perc_awd_any_aid,
perc_awd_pell,
perc_awd_loans,
avg_aid = avg_awd_any_aid / 1000,
avg_pell = avg_awd_pell / 1000,
avg_loans = avg_awd_loans / 1000,
log_tot_ugrd = log10(tot_degr_ugrd)
)
head(final_data,5)
## sector state tot_grad_rt perc_awd_any_aid perc_awd_pell
## 1 Public, 4‑Year AL 0.29 0.83 0.64
## 2 Public, 4‑Year AL 0.63 0.75 0.34
## 3 Public, 4‑Year AL 0.62 0.69 0.23
## 4 Public, 4‑Year AL 0.30 0.96 0.72
## 5 Public, 4‑Year AL 0.74 0.68 0.19
## perc_awd_loans avg_aid avg_pell avg_loans log_tot_ugrd
## 1 0.59 11.59689 5.77489 6.733796 3.716087
## 2 0.42 10.55476 6.13531 6.788345 4.106395
## 3 0.35 9.81416 4.68933 6.441246 3.844166
## 4 0.78 10.19909 4.90675 8.423590 3.517987
## 5 0.35 15.68529 5.20234 6.233895 4.496376
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
corr_vars <- final_data |>
select(
tot_grad_rt,
log_tot_ugrd,
perc_awd_any_aid,
perc_awd_pell,
perc_awd_loans,
avg_aid,
avg_pell,
avg_loans
)
GGally::ggpairs(corr_vars)
I first want to explore some simple linear regressions with the share of students receiving any aid (% of students) vs grad rate, and the depth of aid (average amount of aid recevied) vs grad rate.
ipeds_lm_1 <- lm(tot_grad_rt ~ perc_awd_any_aid, data = final_data)
summary(ipeds_lm_1)
##
## Call:
## lm(formula = tot_grad_rt ~ perc_awd_any_aid, data = final_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41814 -0.11948 -0.02443 0.09962 0.60712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.38206 0.01942 19.670 <2e-16 ***
## perc_awd_any_aid 0.03608 0.02685 1.344 0.179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.169 on 1533 degrees of freedom
## Multiple R-squared: 0.001177, Adjusted R-squared: 0.000525
## F-statistic: 1.806 on 1 and 1533 DF, p-value: 0.1792
The R-squared value is near 0 This model is generally not a great predictor for graduation rate
ipeds_lm_2 <- lm(tot_grad_rt ~ avg_aid, data = final_data)
summary(ipeds_lm_2)
##
## Call:
## lm(formula = tot_grad_rt ~ avg_aid, data = final_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.53038 -0.09103 -0.00393 0.08624 0.65557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.191682 0.009076 21.12 <2e-16 ***
## avg_aid 0.031373 0.001211 25.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.141 on 1533 degrees of freedom
## Multiple R-squared: 0.3044, Adjusted R-squared: 0.304
## F-statistic: 670.9 on 1 and 1533 DF, p-value: < 2.2e-16
This model is a much better fit than the previous model. The R-squared value went up to .316, where we can estiamte that about 31% of the variance in graduation rate is explained by the average amount of aid. Based on this model we would expect that an increase in $1000 of aid on average would be associated with about 3.1 percentage point gain in graduation rate.
Next I want to move onto exploring some multiple linear regression models.
ipeds_lm_3 <- lm(tot_grad_rt ~ perc_awd_any_aid + avg_aid + sector + state + log_tot_ugrd + perc_awd_pell + perc_awd_loans, data = final_data)
summary(ipeds_lm_3)
##
## Call:
## lm(formula = tot_grad_rt ~ perc_awd_any_aid + avg_aid + sector +
## state + log_tot_ugrd + perc_awd_pell + perc_awd_loans, data = final_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48239 -0.07581 -0.00008 0.06834 0.72678
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.057931 0.066569 0.870 0.384311
## perc_awd_any_aid 0.184345 0.026177 7.042 2.89e-12 ***
## avg_aid 0.027981 0.001380 20.278 < 2e-16 ***
## sectorPublic, 4‑Year -0.008776 0.009385 -0.935 0.349855
## stateAL 0.124186 0.063496 1.956 0.050677 .
## stateAR 0.194557 0.063842 3.047 0.002349 **
## stateAZ 0.013722 0.064495 0.213 0.831544
## stateCA 0.111404 0.060843 1.831 0.067298 .
## stateCO 0.053847 0.063882 0.843 0.399412
## stateCT 0.152225 0.070714 2.153 0.031506 *
## stateDC -0.013561 0.132699 -0.102 0.918614
## stateDE 0.041056 0.091064 0.451 0.652164
## stateFL 0.239504 0.062744 3.817 0.000141 ***
## stateGA 0.127901 0.062618 2.043 0.041272 *
## stateHI 0.122800 0.070378 1.745 0.081219 .
## stateIA 0.171571 0.066414 2.583 0.009880 **
## stateID 0.111959 0.073173 1.530 0.126214
## stateIL 0.134741 0.061954 2.175 0.029800 *
## stateIN 0.117530 0.066909 1.757 0.079201 .
## stateKS 0.208498 0.063475 3.285 0.001045 **
## stateKY 0.199434 0.065356 3.052 0.002317 **
## stateLA 0.115795 0.064798 1.787 0.074141 .
## stateMA 0.084678 0.063977 1.324 0.185852
## stateMD 0.102557 0.063968 1.603 0.109093
## stateME 0.101965 0.067808 1.504 0.132864
## stateMI 0.043068 0.062452 0.690 0.490541
## stateMN 0.159639 0.062954 2.536 0.011322 *
## stateMO 0.125657 0.064239 1.956 0.050642 .
## stateMS 0.180259 0.065491 2.752 0.005988 **
## stateMT 0.112001 0.066177 1.692 0.090775 .
## stateNC 0.204866 0.061619 3.325 0.000907 ***
## stateND 0.122430 0.067584 1.812 0.070263 .
## stateNE 0.126551 0.067234 1.882 0.059998 .
## stateNH 0.088761 0.069205 1.283 0.199838
## stateNJ 0.132840 0.063860 2.080 0.037683 *
## stateNM 0.016982 0.063767 0.266 0.790038
## stateNV 0.130980 0.074605 1.756 0.079354 .
## stateNY 0.146417 0.061699 2.373 0.017767 *
## stateOH 0.068403 0.061684 1.109 0.267648
## stateOK 0.034877 0.064166 0.544 0.586838
## stateOR 0.052379 0.064426 0.813 0.416341
## statePA 0.015858 0.062338 0.254 0.799235
## stateRI 0.107299 0.091244 1.176 0.239798
## stateSC 0.058581 0.063872 0.917 0.359204
## stateSD 0.142788 0.068280 2.091 0.036679 *
## stateTN 0.036092 0.065222 0.553 0.580095
## stateTX 0.105593 0.061284 1.723 0.085098 .
## stateUT 0.109848 0.073267 1.499 0.134016
## stateVA 0.167927 0.062993 2.666 0.007764 **
## stateVT 0.008754 0.091385 0.096 0.923695
## stateWA 0.060030 0.062170 0.966 0.334415
## stateWI 0.225549 0.063694 3.541 0.000411 ***
## stateWV 0.075837 0.065608 1.156 0.247907
## stateWY 0.104351 0.072827 1.433 0.152109
## log_tot_ugrd 0.030241 0.007919 3.819 0.000140 ***
## perc_awd_pell -0.507952 0.031498 -16.127 < 2e-16 ***
## perc_awd_loans 0.055341 0.023194 2.386 0.017159 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1185 on 1478 degrees of freedom
## Multiple R-squared: 0.5263, Adjusted R-squared: 0.5084
## F-statistic: 29.33 on 56 and 1478 DF, p-value: < 2.2e-16
We have a statistically significant P-Value, and increased the R-Squared value to .557, the model explains about 55.7% of the variance in graduation rate.
After controlling for sector, state, school size, and other aid metrics, every $1 000 increase in the average aid package predicts ~2.8 percentage-points higher graduation rates, while a larger share of Pell recipients predicts lower rates (–5.4 pp per 10 pp), and boosting the overall share of students who receive any aid still confers a modest positive benefit (~1.9 pp per 10 pp).
I will remove a variable, starting with the percent of students that receive loans.
ipeds_lm_4 <- lm(tot_grad_rt ~ perc_awd_any_aid + avg_aid + sector + state + log_tot_ugrd + perc_awd_pell , data = final_data)
summary(ipeds_lm_4)
##
## Call:
## lm(formula = tot_grad_rt ~ perc_awd_any_aid + avg_aid + sector +
## state + log_tot_ugrd + perc_awd_pell, data = final_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.47379 -0.07679 0.00059 0.06781 0.73679
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0519299 0.0666268 0.779 0.435860
## perc_awd_any_aid 0.1881399 0.0261703 7.189 1.03e-12 ***
## avg_aid 0.0286971 0.0013490 21.273 < 2e-16 ***
## sectorPublic, 4‑Year -0.0003344 0.0087059 -0.038 0.969368
## stateAL 0.1352966 0.0634256 2.133 0.033076 *
## stateAR 0.2057312 0.0637712 3.226 0.001282 **
## stateAZ 0.0164678 0.0645869 0.255 0.798781
## stateCA 0.1113685 0.0609390 1.828 0.067820 .
## stateCO 0.0593746 0.0639413 0.929 0.353258
## stateCT 0.1629975 0.0706819 2.306 0.021244 *
## stateDC -0.0086239 0.1328934 -0.065 0.948267
## stateDE 0.0508534 0.0911154 0.558 0.576847
## stateFL 0.2408940 0.0628409 3.833 0.000132 ***
## stateGA 0.1347728 0.0626507 2.151 0.031624 *
## stateHI 0.1258651 0.0704782 1.786 0.074325 .
## stateIA 0.1903419 0.0660511 2.882 0.004012 **
## stateID 0.1236531 0.0731244 1.691 0.091048 .
## stateIL 0.1393707 0.0620221 2.247 0.024780 *
## stateIN 0.1281417 0.0668672 1.916 0.055512 .
## stateKS 0.2226469 0.0632974 3.517 0.000449 ***
## stateKY 0.2101991 0.0653031 3.219 0.001315 **
## stateLA 0.1342787 0.0644353 2.084 0.037338 *
## stateMA 0.0996413 0.0637702 1.563 0.118383
## stateMD 0.1127101 0.0639276 1.763 0.078092 .
## stateME 0.1141748 0.0677217 1.686 0.092018 .
## stateMI 0.0524525 0.0624269 0.840 0.400920
## stateMN 0.1753116 0.0627100 2.796 0.005247 **
## stateMO 0.1351954 0.0642159 2.105 0.035431 *
## stateMS 0.1911386 0.0654353 2.921 0.003542 **
## stateMT 0.1216786 0.0661578 1.839 0.066083 .
## stateNC 0.2084749 0.0616981 3.379 0.000747 ***
## stateND 0.1353244 0.0674743 2.006 0.045084 *
## stateNE 0.1378933 0.0671716 2.053 0.040263 *
## stateNH 0.1088452 0.0688002 1.582 0.113852
## stateNJ 0.1403861 0.0638830 2.198 0.028136 *
## stateNM 0.0176477 0.0638674 0.276 0.782342
## stateNV 0.1306948 0.0747228 1.749 0.080488 .
## stateNY 0.1573859 0.0616248 2.554 0.010751 *
## stateOH 0.0820049 0.0615177 1.333 0.182727
## stateOK 0.0476745 0.0640428 0.744 0.456743
## stateOR 0.0634605 0.0643603 0.986 0.324285
## statePA 0.0357847 0.0618740 0.578 0.563117
## stateRI 0.1193314 0.0912485 1.308 0.191157
## stateSC 0.0693567 0.0638129 1.087 0.277269
## stateSD 0.1593410 0.0680339 2.342 0.019309 *
## stateTN 0.0406440 0.0652974 0.622 0.533746
## stateTX 0.1124338 0.0613142 1.834 0.066895 .
## stateUT 0.1111228 0.0733815 1.514 0.130159
## stateVA 0.1732132 0.0630542 2.747 0.006086 **
## stateVT 0.0177906 0.0914515 0.195 0.845783
## stateWA 0.0573604 0.0622589 0.921 0.357033
## stateWI 0.2400187 0.0635054 3.780 0.000163 ***
## stateWV 0.0887331 0.0654890 1.355 0.175646
## stateWY 0.1096617 0.0729085 1.504 0.132769
## log_tot_ugrd 0.0303936 0.0079315 3.832 0.000132 ***
## perc_awd_pell -0.5085752 0.0315466 -16.121 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1187 on 1479 degrees of freedom
## Multiple R-squared: 0.5245, Adjusted R-squared: 0.5068
## F-statistic: 29.66 on 55 and 1479 DF, p-value: < 2.2e-16
We do not see a major change in the R-Squared value, and still have a statistically significant model, while trimming down on the features included.
Next I will trim down on the sector
ipeds_lm_5 <- lm(tot_grad_rt ~ perc_awd_any_aid + avg_aid + state + log_tot_ugrd + perc_awd_pell , data = final_data)
summary(ipeds_lm_5)
##
## Call:
## lm(formula = tot_grad_rt ~ perc_awd_any_aid + avg_aid + state +
## log_tot_ugrd + perc_awd_pell, data = final_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.47400 -0.07677 0.00063 0.06777 0.73686
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.052031 0.066552 0.782 0.434447
## perc_awd_any_aid 0.187977 0.025814 7.282 5.33e-13 ***
## avg_aid 0.028670 0.001146 25.011 < 2e-16 ***
## stateAL 0.135532 0.063106 2.148 0.031901 *
## stateAR 0.205947 0.063503 3.243 0.001209 **
## stateAZ 0.016651 0.064389 0.259 0.795982
## stateCA 0.111631 0.060534 1.844 0.065368 .
## stateCO 0.059517 0.063811 0.933 0.351122
## stateCT 0.163150 0.070547 2.313 0.020880 *
## stateDC -0.008603 0.132847 -0.065 0.948375
## stateDE 0.050995 0.091010 0.560 0.575343
## stateFL 0.240917 0.062817 3.835 0.000131 ***
## stateGA 0.134944 0.062471 2.160 0.030925 *
## stateHI 0.126015 0.070346 1.791 0.073439 .
## stateIA 0.190630 0.065601 2.906 0.003716 **
## stateID 0.123812 0.072982 1.696 0.090007 .
## stateIL 0.139650 0.061573 2.268 0.023471 *
## stateIN 0.128248 0.066788 1.920 0.055020 .
## stateKS 0.222884 0.062975 3.539 0.000414 ***
## stateKY 0.210467 0.064909 3.243 0.001211 **
## stateLA 0.134472 0.064217 2.094 0.036427 *
## stateMA 0.099835 0.063548 1.571 0.116392
## stateMD 0.112915 0.063683 1.773 0.076420 .
## stateME 0.114353 0.067540 1.693 0.090647 .
## stateMI 0.052669 0.062150 0.847 0.396881
## stateMN 0.175522 0.062449 2.811 0.005009 **
## stateMO 0.135398 0.063977 2.116 0.034482 *
## stateMS 0.191378 0.065116 2.939 0.003344 **
## stateMT 0.121796 0.066065 1.844 0.065447 .
## stateNC 0.208732 0.061314 3.404 0.000681 ***
## stateND 0.135434 0.067392 2.010 0.044650 *
## stateNE 0.138080 0.066973 2.062 0.039409 *
## stateNH 0.109063 0.068544 1.591 0.111792
## stateNJ 0.140648 0.063496 2.215 0.026908 *
## stateNM 0.017880 0.063558 0.281 0.778505
## stateNV 0.130693 0.074698 1.750 0.080390 .
## stateNY 0.157572 0.061414 2.566 0.010394 *
## stateOH 0.082065 0.061477 1.335 0.182114
## stateOK 0.047862 0.063836 0.750 0.453518
## stateOR 0.063677 0.064092 0.994 0.320615
## statePA 0.035952 0.061700 0.583 0.560194
## stateRI 0.119539 0.091057 1.313 0.189454
## stateSC 0.069607 0.063458 1.097 0.272866
## stateSD 0.159444 0.067958 2.346 0.019097 *
## stateTN 0.040909 0.064910 0.630 0.528632
## stateTX 0.112585 0.061167 1.841 0.065877 .
## stateUT 0.111250 0.073282 1.518 0.129204
## stateVA 0.173477 0.062659 2.769 0.005700 **
## stateVT 0.018108 0.091046 0.199 0.842373
## stateWA 0.057436 0.062207 0.923 0.355997
## stateWI 0.240164 0.063371 3.790 0.000157 ***
## stateWV 0.088898 0.065326 1.361 0.173774
## stateWY 0.109808 0.072785 1.509 0.131598
## log_tot_ugrd 0.030328 0.007745 3.916 9.42e-05 ***
## perc_awd_pell -0.508393 0.031176 -16.307 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1187 on 1480 degrees of freedom
## Multiple R-squared: 0.5245, Adjusted R-squared: 0.5072
## F-statistic: 30.23 on 54 and 1480 DF, p-value: < 2.2e-16
car::vif(ipeds_lm_5)
## GVIF Df GVIF^(1/(2*Df))
## perc_awd_any_aid 1.874214 1 1.369019
## avg_aid 1.265029 1 1.124735
## state 2.478467 50 1.009118
## log_tot_ugrd 1.705701 1 1.306025
## perc_awd_pell 2.010798 1 1.418026
What happens if the state is removed? Can we compare the with and without models using an anova?
ipeds_lm_6 <- lm(
tot_grad_rt ~ perc_awd_any_aid + avg_aid +
log_tot_ugrd + perc_awd_pell,
data = final_data
)
summary(ipeds_lm_6)
##
## Call:
## lm(formula = tot_grad_rt ~ perc_awd_any_aid + avg_aid + log_tot_ugrd +
## perc_awd_pell, data = final_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.48951 -0.08609 -0.00387 0.07658 0.65970
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.116669 0.033422 3.491 0.000495 ***
## perc_awd_any_aid 0.170607 0.025581 6.669 3.58e-11 ***
## avg_aid 0.027183 0.001184 22.968 < 2e-16 ***
## log_tot_ugrd 0.041990 0.007228 5.810 7.60e-09 ***
## perc_awd_pell -0.405228 0.030081 -13.471 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1296 on 1530 degrees of freedom
## Multiple R-squared: 0.4132, Adjusted R-squared: 0.4117
## F-statistic: 269.4 on 4 and 1530 DF, p-value: < 2.2e-16
car::vif(ipeds_lm_6)
## perc_awd_any_aid avg_aid log_tot_ugrd perc_awd_pell
## 1.541813 1.129703 1.244408 1.568178
anova(ipeds_lm_6, ipeds_lm_5)
## Analysis of Variance Table
##
## Model 1: tot_grad_rt ~ perc_awd_any_aid + avg_aid + log_tot_ugrd + perc_awd_pell
## Model 2: tot_grad_rt ~ perc_awd_any_aid + avg_aid + state + log_tot_ugrd +
## perc_awd_pell
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1530 25.716
## 2 1480 20.839 50 4.877 6.9273 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the Anova results, the state variable captures significant unique signal, so it would be better to keep the state included.
Below we plot the residuals of the model in a historgram and observe that there are nearly normal residuals. This assumption is satisfied.
ggplot(data = ipeds_lm_5, aes(x = .resid)) +
geom_histogram(bins = 25) +
xlab("Residuals")
Most of our residuals below fall on a line, and are nearly normal, however we do see for some values on both extremes a tail that trails off. The model may not be as reliable when handling those extreme values.
ggplot(data = ipeds_lm_5, aes(sample = .resid)) +
stat_qq()
Here we do not see any major patterns but there are observed outliers again and a denser cloud around .3 - .5. The assumption is mildly violated but generally still seems to hold. It may be worth looking into this further in future analysis.
ggplot(data = ipeds_lm_5, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
The final model chosen uses the proportion of students awarded any aid, the average aid amount by student, the percent of students that receive Pell, the state of the college, and the total number of students as predictors for the colleges total graduation rate.
The model produced an R-Squared value of .555 with a signiifcant p-value. About 55.5% of the variance is explained.
For every $1000 increase in the average aid awarded, it is expected that the graduation rate would rise 2.9 percentage points.
For every 10 percentage point increase in proportion of students that recevive aid attributes to 2 percentage point gain in graduation rate.
Interestingly enough we observe a drop of 5.4 percentage points when the percent of pell recipients increases 10 percentage points
Overall, the breadth of aid (proportion of students awarded) along with the depth (average aid package amount) are strongly associated with student success(graduation rate) when factoring in State effects, proportion of pell recipients, and student count.
It is important to note that there can be underlying factors which are influencing the results, such as net price, types of aid offered (Predominantly private scholarships, state aid, institutional scholarships? etc), admissions criteria and more that may better explain fluctuations in graduation rate. Further considerations may also be given to the data itself, are the outliers a result of error or do they truly represent the real world? What types of events may have influenced the data, for example would the years reported have accounted for any aid provided to aid with pandemic related issues, are the admissions or enrollment data influenced by post pandemic effects?