# load data
scorecard=read.csv("https://s3.amazonaws.com/ed-college-choice-public/Most+Recent+Cohorts+(Scorecard+Elements).csv",stringsAsFactors = F)

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

  1. Are students at private for-profit institution have lower rate of federal loan repayment compared to students at non-profit institutions?

  2. Predict federal loan repayment rate of students from an institution based on the data collected by Integrated Postsecondary Education Data System (IPEDS), National Student Loan Data System (NSLDS) and Administrative Earnings Data from Tax Records.

Cases

What are the cases, and how many are there?

Each case is aggregation of following data collected for an institution for the cohort under study
1. Education data of the institution (graduation rates, student subgroups, tuition, cost of attendance etc.) (IPEDS)
2. Federal financial aid for the institution for the cohort under consideration
3. Earnings data of the students from cohort under consideration

There are 7804 cases available. If required, I may pick a smaller sample from these cases for the project after doing initial analysis. I will check with you when I like to do so.

Data collection

Describe the method of data collection.
Education Data- Collected annually through surveys administered by the Department of Education’s National Center for Education Statistics (NCES)

Federal Financial Aid - Collected from federal load records. This data is taken from the National Student Loan Data System (NSLDS) which is the Department of Education’s central database for monitoring federal student aid—primarily federal student loans and Pell grants

Earnings Data - From Tax records

Type of study

What type of study is this (observational/experiment)?
This study is observational.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

Data source: https://collegescorecard.ed.gov/data/

Response

What is the response variable, and what type is it (numerical/categorical)?

The response variable is RPY_3YR_RT_SUPP (Three year repayment rate considering repayments that results in declining balance). The response variable is numerical

Explanatory

What is the explanatory variable, and what type is it (numerical/categorical)?

The main explanatory variable is CONTROL (indicates if institution is for-profit, not for-profit or public). The explanatory variable is categorical

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

#Convert response to numerics
scorecard$RPY_3YR_RT_SUPP=as.numeric(scorecard$RPY_3YR_RT_SUPP)
sumstat=aggregate(scorecard$RPY_3YR_RT_SUPP,list(scorecard$CONTROL),summary)
sumstat$Group.1=c("Public","Private NonProfit","Private for-Profit")
stats=sumstat$Group.1
stats=cbind(stats,sumstat$x)
sds=aggregate(scorecard$RPY_3YR_RT_SUPP,list(scorecard$CONTROL),sd,na.rm=T)
stats=cbind(stats,sds$x)
colnames(stats)[c(1,9)]=c("Inst_Type","sd")
stats=as.data.frame(stats)
stats$`NA's`=NULL

kable(stats,caption = "Repayment Rates")
Repayment Rates
Inst_Type Min. 1st Qu. Median Mean 3rd Qu. Max. sd
Public 0.2036 0.5451 0.6601 0.6628 0.7836 0.979 0.155864606300111
Private NonProfit 0.09091 0.7096 0.8183 0.7709 0.8998 1 0.181342285034381
Private for-Profit 0.05051 0.3795 0.4624 0.4833 0.5909 1 0.162984001903471