# load data
scorecard=read.csv("https://s3.amazonaws.com/ed-college-choice-public/Most+Recent+Cohorts+(Scorecard+Elements).csv",stringsAsFactors = F)
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Are students at private for-profit institution have lower rate of federal loan repayment compared to students at non-profit institutions?
Predict federal loan repayment rate of students from an institution based on the data collected by Integrated Postsecondary Education Data System (IPEDS), National Student Loan Data System (NSLDS) and Administrative Earnings Data from Tax Records.
What are the cases, and how many are there?
Each case is aggregation of following data collected for an institution for the cohort under study
1. Education data of the institution (graduation rates, student subgroups, tuition, cost of attendance etc.) (IPEDS)
2. Federal financial aid for the institution for the cohort under consideration
3. Earnings data of the students from cohort under consideration
There are 7804 cases available. If required, I may pick a smaller sample from these cases for the project after doing initial analysis. I will check with you when I like to do so.
Describe the method of data collection.
Education Data- Collected annually through surveys administered by the Department of Education’s National Center for Education Statistics (NCES)
Federal Financial Aid - Collected from federal load records. This data is taken from the National Student Loan Data System (NSLDS) which is the Department of Education’s central database for monitoring federal student aid—primarily federal student loans and Pell grants
Earnings Data - From Tax records
What type of study is this (observational/experiment)?
This study is observational.
If you collected the data, state self-collected. If not, provide a citation/link.
Data source: https://collegescorecard.ed.gov/data/
What is the response variable, and what type is it (numerical/categorical)?
The response variable is RPY_3YR_RT_SUPP (Three year repayment rate considering repayments that results in declining balance). The response variable is numerical
What is the explanatory variable, and what type is it (numerical/categorical)?
The main explanatory variable is CONTROL (indicates if institution is for-profit, not for-profit or public). The explanatory variable is categorical
Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
#Convert response to numerics
scorecard$RPY_3YR_RT_SUPP=as.numeric(scorecard$RPY_3YR_RT_SUPP)
sumstat=aggregate(scorecard$RPY_3YR_RT_SUPP,list(scorecard$CONTROL),summary)
sumstat$Group.1=c("Public","Private NonProfit","Private for-Profit")
stats=sumstat$Group.1
stats=cbind(stats,sumstat$x)
sds=aggregate(scorecard$RPY_3YR_RT_SUPP,list(scorecard$CONTROL),sd,na.rm=T)
stats=cbind(stats,sds$x)
colnames(stats)[c(1,9)]=c("Inst_Type","sd")
stats=as.data.frame(stats)
stats$`NA's`=NULL
kable(stats,caption = "Repayment Rates")
| Inst_Type | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | sd |
|---|---|---|---|---|---|---|---|
| Public | 0.2036 | 0.5451 | 0.6601 | 0.6628 | 0.7836 | 0.979 | 0.155864606300111 |
| Private NonProfit | 0.09091 | 0.7096 | 0.8183 | 0.7709 | 0.8998 | 1 | 0.181342285034381 |
| Private for-Profit | 0.05051 | 0.3795 | 0.4624 | 0.4833 | 0.5909 | 1 | 0.162984001903471 |