Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

NYC 2019 DOE High School Directory

I decided that I’d look into whether a multiple regression model could be appropriately applied to NYC high schools and the college career rates of their students. The data utilized here, 2019 DOE High School Directory, was taken from NYC Open Data. This is a continuation of the previous assignment which focused on linear regression.

Data Cleaning

json <- fromJSON("https://data.cityofnewyork.us/resource/g2qs-86ey.json")[c("boro", "pct_stu_safe", "college_career_rate")]
json["pct_stu_safe"] <- as.numeric(unlist(json["pct_stu_safe"]))
json["college_career_rate"] <- as.numeric(unlist(json["college_career_rate"]))
json <- json[complete.cases(json),]

Some very light data cleaning was performed. This time, only three columns were taken out of the original file

  • the borough
  • the percentage of students who felt safe in each high school
  • the rate in which students enrolled in a college or career program within six months of graduating

These columns were then assigned the correct data type, and any schools without information in one or more columns were removed.

Analysis

Whether the boroughs as a group or on their own, how safe students felt in their high schools, or a combination of the two have a direct influence on the college or career rate six months after graduation is yet to be seen.

Boroughs

Each of the boroughs are represented to varying degrees in the data. The boroughs bore the following labels:

  • K: Brooklyn; the K is for Kings County, the official county name of the borough
  • M: Manhattan; the M is for the first letter of Manhattan
  • Q: Queens; the Q is for the first letter of Queens
  • R: Staten Island; the R is for Richmond County, the official county name of the borough
  • X: Bronx; the X is for the final letter of Bronx
ggplot(json, aes(boro, college_career_rate, color=boro)) + 
  geom_point(alpha=I(0.3)) +
  labs(x = "Borough", y="College Career Rate", color="Borough") +
  scale_color_manual(values=c("red", "orange", "green", "blue", "purple"))

This quick overview of the data shows the rate in which students were either enrolled in college or a career program within six months of graduation for each high school, separated by borough. This preliminary display of the data shows that there were more opportunities for students in Manhattan and Queens to succeed after high school, and less opportunities for those in Brooklyn and the Bronx.

p1 <- ggplot() + 
  geom_bar(data=json[json$boro == "K",], aes(college_career_rate), fill="red") +
  scale_y_continuous(breaks=seq(0, 20, 1), limits=c(0,7)) +
  scale_x_continuous(breaks=seq(0,1,0.01), labels=function(x) ifelse(round((x*100)%%5)==0, x, ""), limits=c(0,1)) +
  guides(fill=FALSE) +
  labs(title="Brooklyn", x="College Career Rate") +
  theme(axis.text.x = element_text(angle=90), panel.grid = element_blank(), panel.grid.major.y = element_line(color="white"))
p2 <- ggplot() + 
  geom_bar(data=json[json$boro == "M",], aes(college_career_rate), fill="orange") +
  scale_y_continuous(breaks=seq(0, 20, 1), limits=c(0,7)) +
  scale_x_continuous(breaks=seq(0,1,0.01), labels=function(x) ifelse(round((x*100)%%5)==0, x, ""), limits=c(0,1)) +
  guides(fill=FALSE) +
  labs(title="Manhattan", x="College Career Rate") +
  theme(axis.text.x = element_text(angle=90), panel.grid = element_blank(), panel.grid.major.y = element_line(color="white"))
p3 <- ggplot() + 
  geom_bar(data=json[json$boro == "Q",], aes(college_career_rate), fill="green") +
  scale_y_continuous(breaks=seq(0, 20, 1), limits=c(0,7)) +
  scale_x_continuous(breaks=seq(0,1,0.01), labels=function(x) ifelse(round((x*100)%%5)==0, x, ""), limits=c(0,1)) +
  guides(fill=FALSE) +
  labs(title="Queens", x="College Career Rate") +
  theme(axis.text.x = element_text(angle=90), panel.grid = element_blank(), panel.grid.major.y = element_line(color="white"))
p4 <- ggplot() + 
  geom_bar(data=json[json$boro == "R",], aes(college_career_rate), fill="blue") +
  scale_y_continuous(breaks=seq(0, 20, 1), limits=c(0,7)) +
  scale_x_continuous(breaks=seq(0,1,0.01), labels=function(x) ifelse(round((x*100)%%5)==0, x, ""), limits=c(0,1)) +
  guides(fill=FALSE) +
  labs(title="Staten Island", x="College Career Rate") +
  theme(axis.text.x = element_text(angle=90), panel.grid = element_blank(), panel.grid.major.y = element_line(color="white"))
p5 <- ggplot() + 
  geom_bar(data=json[json$boro == "X",], aes(college_career_rate), fill="purple") +
  scale_y_continuous(breaks=seq(0, 20, 1), limits=c(0,7)) +
  scale_x_continuous(breaks=seq(0,1,0.01), labels=function(x) ifelse(round((x*100)%%5)==0, x, ""), limits=c(0,1)) +
  guides(fill=FALSE) +
  labs(title="Bronx", x="College Career Rate") +
  theme(axis.text.x = element_text(angle=90), panel.grid = element_blank(), panel.grid.major.y = element_line(color="white"))
grid.arrange(p1, p2, p3, p4, p5, ncol=2, nrow=3)

When a more in-depth observation is made of each of the boroughs, there is a clear pattern apparent in each borough. For example, in the Bronx and in Staten Island, there is a roughly normal distribution, indicating a child going to a high school in either of these boroughs has a roughly equal chance of ending up in a college or career program six months after their graduation. Meanwhile, in Brooklyn, 66 of the 110 - or 60% - students attending high school go on to a career or college program after six months. This compares with Manhattan’s 75.27% and Queens’s 77.14%.

To further this borough-by-borough analysis, producing a linear model could be beneficial.

boro_rate <- lm(college_career_rate ~ boro, data=json)
sum_boro <- summary(boro_rate)
sum_boro
## 
## Call:
## lm(formula = college_career_rate ~ boro, data = json)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42271 -0.13179 -0.01673  0.13821  0.47510 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.55673    0.01725  32.272  < 2e-16 ***
## boroM        0.09413    0.02549   3.693 0.000254 ***
## boroQ        0.10599    0.02766   3.831 0.000149 ***
## boroR        0.06627    0.05976   1.109 0.268132    
## boroX       -0.03183    0.02500  -1.273 0.203747    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1809 on 378 degrees of freedom
## Multiple R-squared:  0.09301,    Adjusted R-squared:  0.08341 
## F-statistic: 9.691 on 4 and 378 DF,  p-value: 1.803e-07

The residuals, on the whole, are nearly normally distributed. The adjusted \(R^{2}\) value is 8.341078%, which means it is unlikely this linear model accounts in its entirety for the college and career rate of students from different boroughs. Those boroughs where there is a clear influence, however, are Brooklyn, Manhattan, and Queens.

Student Safety

As previously explored, how safe a student feels in a given high school has a direct impact on whether or not they go on to a post-high school program for furthering their professional or academic careers six months after graduation.

qplot(pct_stu_safe, college_career_rate, data=json, alpha=I(0.3), xlab="% Students Feel Safe", ylab="College Career Rate") + 
  geom_smooth(se = FALSE, method = "loess", formula = y ~ x)

College and career rates of recent high school graduates appears to scale exponentially with how safe these students felt while attending high school.

safe_lm <- lm(college_career_rate ~ pct_stu_safe, data=json)
sum_safe <- summary(safe_lm)
sum_safe
## 
## Call:
## lm(formula = college_career_rate ~ pct_stu_safe, data = json)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3751 -0.1183  0.0002  0.1216  0.3969 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.40634    0.08609   -4.72 3.32e-06 ***
## pct_stu_safe  1.19448    0.10249   11.65  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1625 on 381 degrees of freedom
## Multiple R-squared:  0.2628, Adjusted R-squared:  0.2609 
## F-statistic: 135.8 on 1 and 381 DF,  p-value: < 2.2e-16

The residuals are nearly normal based off of the information provided above. Additionally, the adjusted \(R^{2}\) value is 26.088808%, indicating that this is more of a factor in determining whether or not students will pursue something more after high school.

While better than the previous model and clearly showing a correlation, it is not enough on its own.

Boroughs and Student Safety

When both the borough of a high school is considered alongside how safe students feel in a given school, a fuller picture is provided.

fit <- lm(college_career_rate ~ ., data=json)
sum_fit <- summary(fit)
sum_fit
## 
## Call:
## lm(formula = college_career_rate ~ ., data = json)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41188 -0.12005 -0.00484  0.12380  0.37003 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.34106    0.08833  -3.861 0.000133 ***
## boroM         0.03621    0.02322   1.559 0.119790    
## boroQ         0.07377    0.02466   2.992 0.002958 ** 
## boroR         0.03781    0.05291   0.715 0.475347    
## boroX        -0.02011    0.02213  -0.909 0.364099    
## pct_stu_safe  1.09486    0.10610  10.319  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.16 on 377 degrees of freedom
## Multiple R-squared:  0.2928, Adjusted R-squared:  0.2834 
## F-statistic: 31.21 on 5 and 377 DF,  p-value: < 2.2e-16

As can be seen, the adjusted \(R^{2}\) for this multiple regression model is 28.3379803%. Logistically, one might have expected the adjusted \(R^{2}\) to be equal to the sum of that for the boroughs and the percentage of students who felt safe in their high schools, but this was not the case. (8.341078 + 26.088808 \(\neq\) 28.3379803)

Conclusion

This linear model, while appropriate, is limited. A model with a higher adjusted \(R^{2}\) would be preferred. If all a parent has access to is the borough a high school is located in and how safe the students felt in each high school, this model will do well enough.