I made this data from a previous project but didn’t complete this particular analysis with the data.
It’s NYS counties broken down by average annual school aid and graduation and drop-out rates.
combo_school_aid_url <- "https://raw.githubusercontent.com/jbrnbrg/proj_proposal_606/master/jbrnbrg-project606-proposal20170322.csv"
school_aid_grad_NYS<- read.csv(text = getURL(combo_school_aid_url), stringsAsFactors = FALSE)
head(school_aid_grad_NYS)
## X COUNTY_NAME AVG_AID STU_POP_4YRAVG USD_AID_PER_STU GRAD_RT
## 1 1 albany 235888446 41046 5746.929 0.8592629
## 2 2 allegany 106406561 7726 13772.529 0.8605809
## 3 3 broome 258815172 26419 9796.554 0.8834121
## 4 4 cattaraugus 171343074 10763 15919.639 0.8801755
## 5 5 cayuga 96665641 10002 9664.631 0.8719947
## 6 6 chautauqua 225452963 18031 12503.631 0.8780143
## DROP_RT
## 1 0.09063595
## 2 0.04613393
## 3 0.07345100
## 4 0.08180304
## 5 0.08007588
## 6 0.08269780
sm_src <- school_aid_grad_NYS
ggplot(data = sm_src, aes(x=USD_AID_PER_STU, y=DROP_RT)) +
geom_point(size = 2, alpha = .4) +
scale_x_continuous(label = dollar_format()) +
theme(panel.grid.major = element_line(color = "gray")) +
labs(title = "Average Annual Student Aid per Student vs. County Drop-out Rates:",
x = "Annual Student Aid per Student by County",
y = "HS Drop-out Rates by County")
There looks like there’s a trend:
cor(sm_src$DROP_RT, sm_src$USD_AID_PER_STU)
## [1] 0.2702581
I’ll create the linear model:
jmb_m1 <- lm(DROP_RT~USD_AID_PER_STU, data = sm_src)
ggplot(data = sm_src, aes(x=USD_AID_PER_STU, y=DROP_RT)) +
geom_point(size = 2, alpha = .4) +
scale_x_continuous(label = dollar_format()) +
#geom_abline(intercept = 0.9238, slope = -0.000004511, color = "blue") +
geom_smooth(method = "lm", se = FALSE) +
theme(panel.grid.major = element_line(color = "gray")) +
labs(title = "Average Annual Student Aid per Student vs. County Drop-out Rates:",
x = "Annual Student Aid per Student by County",
y = "HS Drop-out Rates by County")
I’ll review the residuals to see if they’re normally dispersed about zero and then review the qqnorm plot of the linear model’s residuals. This will help support a normal determination:
ggplot(data = sm_src, aes(x=USD_AID_PER_STU, y=jmb_m1$residuals)) +
geom_point(size = 2, alpha = .4) +
scale_x_continuous(label = dollar_format()) +
geom_abline(intercept = 0, slope = 0, color = "blue") +
theme(panel.grid.major = element_line(color = "gray")) +
labs(title = "Average Annual Student Aid per Student vs. Model Residuals:",
x = "Annual Student Aid per Student by County",
y = "Model Residuals")
qqnorm(jmb_m1$residuals)
qqline(jmb_m1$residuals)
hist(jmb_m1$residuals)
summary(jmb_m1)
##
## Call:
## lm(formula = DROP_RT ~ USD_AID_PER_STU, data = sm_src)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.043573 -0.017160 -0.002709 0.012247 0.088565
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.979e-02 1.144e-02 4.352 5.78e-05 ***
## USD_AID_PER_STU 2.233e-06 1.063e-06 2.101 0.0402 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0259 on 56 degrees of freedom
## Multiple R-squared: 0.07304, Adjusted R-squared: 0.05649
## F-statistic: 4.412 on 1 and 56 DF, p-value: 0.04019