Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

NYC 2019 DOE High School Directory

I decided that I’d look into whether there was a linear model that could be applied to NYC high schools and the college career rates of their students. The data utilized here, 2019 DOE High School Directory, was taken from NYC Open Data. To start with, I had some data cleaning to do.

Data Cleaning

json <- fromJSON("https://data.cityofnewyork.us/resource/g2qs-86ey.json")[c("school_name", "boro", "total_students", "pct_stu_enough_variety", "pct_stu_safe", "college_career_rate")]
json["total_students"] <- as.numeric(unlist(json["total_students"]))
json["pct_stu_enough_variety"] <- as.numeric(unlist(json["pct_stu_enough_variety"]))
json["pct_stu_safe"] <- as.numeric(unlist(json["pct_stu_safe"]))
json["college_career_rate"] <- as.numeric(unlist(json["college_career_rate"]))
json = json[complete.cases(json),]

Some very light data cleaning was performed. Only six columns were taken out of the original file

the school name, which was used as an identifier for individual high schools schools
the borough
the total number of students in each high school
the percentage of students who felt there was enough variety in student activities in their high school
the percentage of students who felt safe in each high school
the rate in which students enrolled in a college or career program within six months of graduating

These columns were then assigned the correct data type, and any schools without information in one or more columns were removed.

Analysis

To begin with determining the linear model, visualization is necessary, followed by analysis of the residuals. There are three main comparisons to observe:

total students in a high school to the rate of those students enrolling in college or career programs within six months of graduating,
percentage of students who felt there was enough variety of student activities in a high school to the rate of those students enrolling in college or career programs within six months of graduating,
percentage of students who feel safe in their high school to the rate of those students enrolling in college or career programs within six month sof graduating

Total Students and College/Career Rate

“Total students” is a misnomer here; due to the large variety in school sizes (schools range between 135 and 5838), the numbers representing the total students had to be normalized. That is, all of the schools were compared against each other when it came to their size and assigned a number on a smaller scale to fit their relative size of student population.

boro_letters <- unlist(unique(json["boro"]))
json["z_students"] <- scale(json["total_students"])
ab_college_z_students <- data.frame(matrix(ncol=3,nrow=0))
names(ab_college_z_students) <- c("intercept", "slope", "boro_letters")
ab_college_z_students[1,] <- c(lm(college_career_rate ~ z_students, data=json)[[1]], "all")
for (boro_num in 1:length(boro_letters)){
  ab_college_z_students[nrow(ab_college_z_students)+1,] <- c(lm(college_career_rate ~ z_students, data=json[which(json["boro"] == boro_letters[[boro_num]]),])[[1]], boro_letters[boro_num])
}
ab_college_z_students["intercept"] <- as.numeric(unlist(ab_college_z_students["intercept"]))
ab_college_z_students["slope"] <- as.numeric(unlist(ab_college_z_students["slope"]))
qplot(z_students, college_career_rate, data=json, alpha=I(0.3), xlab="Normalized Number of Students", ylab="College Career Rate", color=boro) + 
  geom_abline(data=ab_college_z_students, aes(intercept=intercept, slope=slope, color=boro_letters)) +
  scale_color_manual(values=c("black","red", "orange", "green", "blue", "purple"))

Based on the above graph, should this linear model be appropriate, the following results can be expected:

Overall, the number of students in a high school has a positive correlation with the rate of its students enrolling in college or a career program
There is next to no correlation with Queens schools
The number of students in a Staten Island high school has a negative correlation with the college career rate of Staten Island high school students

Overall Validity of the Linear Model

current_lm <- lm(college_career_rate ~ z_students, data=json)
summary(current_lm)

## 
## Call:
## lm(formula = college_career_rate ~ z_students, data = json)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42097 -0.14875 -0.02347  0.14687  0.42560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.592376   0.009428  62.832  < 2e-16 ***
## z_students  0.041951   0.009440   4.444 1.16e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1845 on 381 degrees of freedom
## Multiple R-squared:  0.04928,    Adjusted R-squared:  0.04678 
## F-statistic: 19.75 on 1 and 381 DF,  p-value: 1.16e-05

Right away, it can be observed that the residuals are likely to be normally distributed by their mirroring and the very low median residual. The distribution of the residuals can be observed more clearly by plotting the values.

p1 <- ggplot() + geom_histogram(binwidth=0.05, aes(x=current_lm$residuals), fill="skyblue") + xlab("Linear Model Residuals") + guides(fill=FALSE)
p2 <- gg_qqplot(current_lm)
grid.arrange(p1, p2, ncol=2)

There is a skew present in these residuals, which indicates immediately that this model may not be the best pick, but the residauls are more normally distributed than not. This is further backed up when the coefficients for this linear model are examined. The z-score for the number of students is 4.4438933x the standard error, which does not quite make the threshold for being a decent residual analysis - it would have to be at least 5, maximum 10, times the size of the standard error to be considered.

Lastly, the chance the size of the student population in a college high school has no effect on whether or not its students go on to a college or career program is very slim at 1.16e-05, or 0.007816.

While not a terrible pick in the least, it warrants looking into each of the other boroughs.

Validity of the Linear Model for Brooklyn

current_lm <- lm(college_career_rate ~ z_students, data=json[which(json["boro"]=="K"),])
summary(current_lm)

## 
## Call:
## lm(formula = college_career_rate ~ z_students, data = json[which(json["boro"] == 
##     "K"), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35831 -0.11025 -0.01432  0.10733  0.43129 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.55744    0.01586  35.148  < 2e-16 ***
## z_students   0.04450    0.01318   3.375  0.00103 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1663 on 108 degrees of freedom
## Multiple R-squared:  0.0954, Adjusted R-squared:  0.08702 
## F-statistic: 11.39 on 1 and 108 DF,  p-value: 0.001027

There is nearly mirroring occuring with Brooklyn’s residuals, but nowhere near the level there was for all the boroughs overall. Once again there is promise for this linear model, as there is also a small residual for the median.

p1 <- ggplot() + geom_histogram(binwidth=0.05, aes(x=current_lm$residuals), fill="skyblue") + xlab("Linear Model Residuals") + guides(fill=FALSE)
p2 <- gg_qqplot(current_lm)
grid.arrange(p1, p2, ncol=2)

These residuals present with a tall middle peak and a clear desire to be normally distributed, though they are not quite. Whether or not this model should be used goes to the coefficients next. The z-score for the number of Brooklyn students is 3.3748478x the standard error, which fails to make the threshold for being a decent residual analysis - it would have to be at least 5, maximum 10, times the size of the standard error to be considered.

Lastly, the chance the size of the student population in a college high school has no effect on whether or not its students go on to a college or career program is unlikely at 0.00103.

Brooklyn introduces doubt to this linear model.

Validity of the Linear Model for Manhattan

current_lm <- lm(college_career_rate ~ z_students, data=json[which(json["boro"]=="M"),])
summary(current_lm)

## 
## Call:
## lm(formula = college_career_rate ~ z_students, data = json[which(json["boro"] == 
##     "M"), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29879 -0.12237 -0.03074  0.12741  0.35168 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.66682    0.01882  35.424  < 2e-16 ***
## z_students   0.09415    0.02882   3.267  0.00154 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1753 on 91 degrees of freedom
## Multiple R-squared:  0.105,  Adjusted R-squared:  0.09512 
## F-statistic: 10.67 on 1 and 91 DF,  p-value: 0.001536

With Manhattan there is another case of the residuals being almost mirrored, though the residual of the median is larger this time around. It seems they might be normally distributed.

p1 <- ggplot() + geom_histogram(binwidth=0.05, aes(x=current_lm$residuals), fill="skyblue") + xlab("Linear Model Residuals") + guides(fill=FALSE)
p2 <- gg_qqplot(current_lm)
grid.arrange(p1, p2, ncol=2)

That is, until a graph is drawn and it becomes clear as day they are vaguely normally distributed. This implies it will be difficult to trust this model when it comes to Manhattan, and therefore it likely should not be trusted for determining which high school is most likely to have its students go on to college or a career program. Does Manhattan make the threshold for being a decent residual analysis? This linear model signals once more: not quite; the z-score for the number of Manhattan students is 3.2666429x the standard error.

Like Brooklyn, the school’s size still proves to be important for Manhattan, with there being a 0.00154 chance otherwise.

Validity of the Linear Model for Queens

current_lm <- lm(college_career_rate ~ z_students, data=json[which(json["boro"]=="Q"),])
summary(current_lm)

## 
## Call:
## lm(formula = college_career_rate ~ z_students, data = json[which(json["boro"] == 
##     "Q"), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42170 -0.14187  0.01186  0.15504  0.33723 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.662237   0.025285  26.191   <2e-16 ***
## z_students  0.001057   0.019692   0.054    0.957    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.198 on 68 degrees of freedom
## Multiple R-squared:  4.24e-05,   Adjusted R-squared:  -0.01466 
## F-statistic: 0.002884 on 1 and 68 DF,  p-value: 0.9573

Queens has residuals that are nearly mirrored, and the smallest absolute residual number of the median thus far.

p1 <- ggplot() + geom_histogram(binwidth=0.05, aes(x=current_lm$residuals), fill="skyblue") + xlab("Linear Model Residuals") + guides(fill=FALSE)
p2 <- gg_qqplot(current_lm)
grid.arrange(p1, p2, ncol=2)

The multiple high peaks of this graph of the residuals indicates a comb pattern. This data, while it might appear not to be the case at first glance, likely does not have a normal distribution. The coefficients do not help this, with the standard deviation being over floor(summary(current_lm)$coefficients[2,2]/summary(current_lm)$coefficients[2,1])x the estimated values, when in fact, it would have to be the other way around and not quite so intense a number in order for this linear model to be considered reliable for Queens.

There is a 0.957 chance this linear model fails to predict whether or not a student at a Queens high school will go on to a college or career program. This would be enough to throw the model out on its own, but for the sake of being thorough, there are two more boroughs to go.

Validity of the Linear Model for Staten Island

current_lm <- lm(college_career_rate ~ z_students, data=json[which(json["boro"]=="R"),])
summary(current_lm)

## 
## Call:
## lm(formula = college_career_rate ~ z_students, data = json[which(json["boro"] == 
##     "R"), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35234 -0.10440  0.00981  0.14143  0.25953 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.65375    0.09621   6.795 0.000139 ***
## z_students  -0.02176    0.04735  -0.460 0.658042    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2187 on 8 degrees of freedom
## Multiple R-squared:  0.02572,    Adjusted R-squared:  -0.09606 
## F-statistic: 0.2112 on 1 and 8 DF,  p-value: 0.658

When it comes to Staten Island, there is little to no mirroring of the residuals; the only thing the residuals have leaning towards the favor of this linear model is that the residual of the median has taken the crown from Queens for being the smallest.

p1 <- ggplot() + geom_histogram(binwidth=0.05, aes(x=current_lm$residuals), fill="skyblue") + xlab("Linear Model Residuals") + guides(fill=FALSE)
p2 <- gg_qqplot(current_lm)
grid.arrange(p1, p2, ncol=2)

There is no question: Staten Island’s residuals are not normally distributed at all, likely to their small quantity. The standard deviation is -0.45959x the estimated value, making it the highest absolute value thus far. The key word there is absolute. That negative sign means there’s a negative correlation, so the linear model actually suggests for Staten Island that the more students there are in a high school, the less likely it is those students will go on to take part in a college or career program.

This model has a 0.658042 chance of not being relevant for determining enrolment in a college or career program.

Validity of the Linear Model for the Bronx

current_lm <- lm(college_career_rate ~ z_students, data=json[which(json["boro"]=="X"),])
summary(current_lm)

## 
## Call:
## lm(formula = college_career_rate ~ z_students, data = json[which(json["boro"] == 
##     "X"), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35871 -0.10881 -0.02583  0.11281  0.49187 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.55720    0.01964  28.367  < 2e-16 ***
## z_students   0.11451    0.03819   2.999  0.00344 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1643 on 98 degrees of freedom
## Multiple R-squared:  0.08404,    Adjusted R-squared:  0.07469 
## F-statistic: 8.991 on 1 and 98 DF,  p-value: 0.003438

Lastly, there’s the Bronx. It’s residuals are not remotely mirrored or similar, which means the residuals are likely not normally distributed.

p1 <- ggplot() + geom_histogram(binwidth=0.05, aes(x=current_lm$residuals), fill="skyblue") + xlab("Linear Model Residuals") + guides(fill=FALSE)
p2 <- gg_qqplot(current_lm)
grid.arrange(p1, p2, ncol=2)

The graph agrees, presenting with a right skew. At 2.9985283x, the size of the student population of a high school in the Bronx is modestly influential at best, though there is a 0.0034 chance that this model is not applicable for the Bronx.

Variety of Student Activities and College/Career Rate

Student activities get students more interested in being in school so they can participate in these activities. For example, certain sports or groups require students maintain a certain average, which inherently means students being present if they wish to continue whatever activity they are enrolled in. Activities are, on the whole, things students find enjoyable and should result in a positive correlation across the board.

boro_letters <- unlist(unique(json["boro"]))
ab_college_variety <- data.frame(matrix(ncol=3,nrow=0))
names(ab_college_variety) <- c("intercept", "slope", "boro_letters")
ab_college_variety[1,] <- c(lm(college_career_rate ~ pct_stu_enough_variety, data=json)[[1]], "all")
for (boro_num in 1:length(boro_letters)){
  ab_college_variety[nrow(ab_college_variety)+1,] <- c(lm(college_career_rate ~ pct_stu_enough_variety, data=json[which(json["boro"] == boro_letters[[boro_num]]),])[[1]], boro_letters[boro_num])
}
ab_college_variety["intercept"] <- as.numeric(unlist(ab_college_variety["intercept"]))
ab_college_variety["slope"] <- as.numeric(unlist(ab_college_variety["slope"]))
qplot(pct_stu_enough_variety, college_career_rate, data=json, alpha=I(0.3), xlab="% Enough Variety in Student Activities", ylab="College Career Rate", color=boro) + 
  geom_abline(data=ab_college_variety, aes(intercept=intercept, slope=slope, color=boro_letters)) +
  scale_color_manual(values=c("black","red", "orange", "green", "blue", "purple"))

This is why data science exists; common sense and instinct sometimes get in the way of the cold, hard facts. Looking at our previous model, Staten Island can be observed being contrary again. This particular model implies the following:

Overall, it is not a bad model - that is, it seems like it might as accurate as the first
It cannot be trusted when it comes to Brooklyn or Staten Island

Overall Validity of the Linear Model

current_lm <- lm(college_career_rate ~ pct_stu_enough_variety, data=json)
summary(current_lm)

## 
## Call:
## lm(formula = college_career_rate ~ pct_stu_enough_variety, data = json)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.43741 -0.14544 -0.01516  0.13445  0.42563 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.44179    0.05300   8.335 1.42e-15 ***
## pct_stu_enough_variety  0.20224    0.07002   2.888  0.00409 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1872 on 381 degrees of freedom
## Multiple R-squared:  0.02143,    Adjusted R-squared:  0.01886 
## F-statistic: 8.343 on 1 and 381 DF,  p-value: 0.004092

The residuals present in the overall model suggest near-reflective qualities - that is, the absolute value of the residuals provided are nearly equal, and the residual of the median is small.

p1 <- ggplot() + geom_histogram(binwidth=0.05, aes(x=current_lm$residuals), fill="skyblue") + xlab("Linear Model Residuals") + guides(fill=FALSE)
p2 <- gg_qqplot(current_lm)
grid.arrange(p1, p2, ncol=2)

The residuals lack a defined peak overall, but they appear to be mostly normally distributed, which is a good sign for the validity of this model. As a reminder, a good sign for the appropriateness of a linear model is to compare the standard error of the given variable for the model to the estimated variable; if the estimate is a 5x to 10x the standard error, it hints to a good linear model. This model in particular, when compared to all the data, has am estimate that is 2.8884783x the standard error. This number is too low for it to be worthwhile overall, but again, it is worth looking into its effects borough-by-borough.

Validity of the Linear Model for Brooklyn

current_lm <- lm(college_career_rate ~ pct_stu_enough_variety, data=json[which(json["boro"]=="K"),])
summary(current_lm)

## 
## Call:
## lm(formula = college_career_rate ~ pct_stu_enough_variety, data = json[which(json["boro"] == 
##     "K"), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37549 -0.10600 -0.02345  0.12312  0.41025 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.57047    0.10155   5.618 1.52e-07 ***
## pct_stu_enough_variety -0.01849    0.13477  -0.137    0.891    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1749 on 108 degrees of freedom
## Multiple R-squared:  0.0001742,  Adjusted R-squared:  -0.009083 
## F-statistic: 0.01882 on 1 and 108 DF,  p-value: 0.8911

The absolute value of the residuals given in the summary indicate there may be too great a difference for there to be a normal distribution. The only way to know for certain is to go through the residuals or to graph them.

p1 <- ggplot() + geom_histogram(binwidth=0.05, aes(x=current_lm$residuals), fill="skyblue") + 
  xlab("Linear Model Residuals") + guides(fill=FALSE)
p2 <- gg_qqplot(current_lm)
grid.arrange(p1, p2, ncol=2)

Despite its odd shape, the linear model for Brooklyn has a relatively normal distribution. It would appear with its negative coefficient ratio of -0.1371752 that, when it comes to Brooklyn, there was a negative correlation between there being a variety of student activities in a given high school and the students from that school going on to enroll in college or a career program of some kind.

There is a 0.891 chance that this is a poor linear model for Brooklyn.

Validity of the Linear Model for Manhattan

current_lm <- lm(college_career_rate ~ pct_stu_enough_variety, data=json[which(json["boro"]=="M"),])
summary(current_lm)

## 
## Call:
## lm(formula = college_career_rate ~ pct_stu_enough_variety, data = json[which(json["boro"] == 
##     "M"), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30365 -0.14151  0.00921  0.14206  0.36493 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.38505    0.09933   3.877  0.00020 ***
## pct_stu_enough_variety  0.35717    0.13114   2.724  0.00774 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1782 on 91 degrees of freedom
## Multiple R-squared:  0.07537,    Adjusted R-squared:  0.06521 
## F-statistic: 7.418 on 1 and 91 DF,  p-value: 0.007743

The residuals are nearly mirrored, and the median’s residual is extremely small. This bodes well for the residuals being normally distributed.

p1 <- ggplot() + geom_histogram(binwidth=0.05, aes(x=current_lm$residuals), fill="skyblue") + xlab("Linear Model Residuals") + guides(fill=FALSE)
p2 <- gg_qqplot(current_lm)
grid.arrange(p1, p2, ncol=2)

The residuals are a little shaky, but also are nearly normally distributed. The coefficient ratio 2.7235304 fails to fall between 5 and 10, and this indicates that it might not be the best pick for Manhattan. There is a less than 1% chance (0.7742994% chance) that this model is not relevant for determining whether or not students who graduate high school in Manhattan go on to enroll in a college or career program.

Validity of the Linear Model for Queens

current_lm <- lm(college_career_rate ~ pct_stu_enough_variety, data=json[which(json["boro"]=="Q"),])
summary(current_lm)

## 
## Call:
## lm(formula = college_career_rate ~ pct_stu_enough_variety, data = json[which(json["boro"] == 
##     "Q"), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40894 -0.14386  0.00945  0.14821  0.32365 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.59911    0.12672   4.728 1.19e-05 ***
## pct_stu_enough_variety  0.08305    0.16257   0.511    0.611    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1977 on 68 degrees of freedom
## Multiple R-squared:  0.003823,   Adjusted R-squared:  -0.01083 
## F-statistic: 0.261 on 1 and 68 DF,  p-value: 0.6111

Once again the median residual is small, and the first and third quantiles are nearly mirrored; the true difference lies between the min and the max. This implies there may be something like a normal distribution, but not quite.

p1 <- ggplot() + geom_histogram(binwidth=0.05, aes(x=current_lm$residuals), fill="skyblue") + xlab("Linear Model Residuals") + guides(fill=FALSE)
p2 <- gg_qqplot(current_lm)
grid.arrange(p1, p2, ncol=2)

The graphs confirm this; the linear model as a histogram has a comb pattern, and the meat of the points in the qqplot are normally distributed in clumps with there being great difference between the two tails. With the estimate to standard error ratio coming in at 0.5108603, this model shouldn’t be used. In addition, there is a chance of 0.611 that this linear model is not be relevant for Queens.

Validity of the Linear Model for Staten Island

current_lm <- lm(college_career_rate ~ pct_stu_enough_variety, data=json[which(json["boro"]=="R"),])
summary(current_lm)

## 
## Call:
## lm(formula = college_career_rate ~ pct_stu_enough_variety, data = json[which(json["boro"] == 
##     "R"), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34611 -0.09813 -0.00390  0.13022  0.30047 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)              0.9477     0.5448   1.740    0.120
## pct_stu_enough_variety  -0.3916     0.6519  -0.601    0.565
## 
## Residual standard error: 0.2167 on 8 degrees of freedom
## Multiple R-squared:  0.04316,    Adjusted R-squared:  -0.07644 
## F-statistic: 0.3609 on 1 and 8 DF,  p-value: 0.5647

This linear model likely has failed to produce normally distributed residuals, based off of the statistical analysis above.

p1 <- ggplot() + geom_histogram(binwidth=0.05, aes(x=current_lm$residuals), fill="skyblue") + xlab("Linear Model Residuals") + guides(fill=FALSE)
p2 <- gg_qqplot(current_lm)
grid.arrange(p1, p2, ncol=2)

There is a comb pattern in the histogram of the residuals, and the qqplot does not have the residuals across the abline. Not only does the ratio of estimate to standard error fail to fall between 5 and 10 by coming in at -0.6007174, but it is negative, indicating that for Staten Island there is a negative correlation between Staten Island students having access to a variety of school activities and those same students enrolling in college and career programs. This model is (0.565) likely not relevant in the case of Staten Island.

Validity of the Linear Model for the Bronx

current_lm <- lm(college_career_rate ~ pct_stu_enough_variety, data=json[which(json["boro"]=="X"),])
summary(current_lm)

## 
## Call:
## lm(formula = college_career_rate ~ pct_stu_enough_variety, data = json[which(json["boro"] == 
##     "X"), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37417 -0.11192 -0.04321  0.11742  0.45807 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.34850    0.09032   3.858 0.000205 ***
## pct_stu_enough_variety  0.24399    0.12274   1.988 0.049620 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1683 on 98 degrees of freedom
## Multiple R-squared:  0.03876,    Adjusted R-squared:  0.02895 
## F-statistic: 3.951 on 1 and 98 DF,  p-value: 0.04962

The residuals indicate this data is not normally distributed.

p1 <- ggplot() + geom_histogram(binwidth=0.05, aes(x=current_lm$residuals), fill="skyblue") + xlab("Linear Model Residuals") + guides(fill=FALSE)
p2 <- gg_qqplot(current_lm)
grid.arrange(p1, p2, ncol=2)

These charts back up the idea the linear model fails to produce normally distributed residuals when it comes to Bronx students. The ratio between the estimate and standard error is 1.9878371, and it handily fails to meet the 5 threshold. Even so, there is a 0.049620 chance this model isn’t relevant for the Bronx, meaning it is not wholly untrustworthy.

How Safe Students Feel and College/Career Rate

When a high school graduate is deciding whether or not to attempt to enroll in college or for a job, they recall experiences they might have had in their high school. How hard the school work was, their friends, activities they participated in, their teachers, et cetera. One thing they do not think about is how safe their school was; in fact, like a lot of things in life, no students notice how safe their high school is until it is not safe any longer. Even then, it is not a memory that comes to the surface when applying for college or a job, or sitting for an interview. Common sense dictates this should not have anything to do with the likelihood of students enrolling in a college or career program within six months of graduating.

boro_letters <- unlist(unique(json["boro"]))
ab_college_safety <- data.frame(matrix(ncol=3,nrow=0))
names(ab_college_safety) <- c("intercept", "slope", "boro_letters")
ab_college_safety[1,] <- c(lm(college_career_rate ~ pct_stu_safe, data=json)[[1]], "all")
for (boro_num in 1:length(boro_letters)){
  ab_college_safety[nrow(ab_college_safety)+1,] <- c(lm(college_career_rate ~ pct_stu_safe, data=json[which(json["boro"] == boro_letters[[boro_num]]),])[[1]], boro_letters[boro_num])
}
ab_college_safety["intercept"] <- as.numeric(unlist(ab_college_safety["intercept"]))
ab_college_safety["slope"] <- as.numeric(unlist(ab_college_safety["slope"]))
qplot(pct_stu_safe, college_career_rate, data=json, alpha=I(0.3), xlab="% Students Feel Safe", ylab="College Career Rate", color=boro) + 
  geom_abline(data=ab_college_safety, aes(intercept=intercept, slope=slope, color=boro_letters)) +
  scale_color_manual(values=c("black","red", "orange", "green", "blue", "purple"))

Again, common sense is thrown out the window. There is visually a clear positive correlation present, but how accurate it is or the intensity of it is another matter altogether. This model implies we can expect a positive correlation with every single boro to be present, and that the more students feel safe in their high school, the more likely it is they will enroll in a college or career program within six months of their graduation.

Overall Validity of the Linear Model

current_lm <- lm(college_career_rate ~ pct_stu_safe, data=json)
summary(current_lm)

## 
## Call:
## lm(formula = college_career_rate ~ pct_stu_safe, data = json)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3751 -0.1183  0.0002  0.1216  0.3969 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.40634    0.08609   -4.72 3.32e-06 ***
## pct_stu_safe  1.19448    0.10249   11.65  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1625 on 381 degrees of freedom
## Multiple R-squared:  0.2628, Adjusted R-squared:  0.2609 
## F-statistic: 135.8 on 1 and 381 DF,  p-value: < 2.2e-16

When this linear model is applied to New York City high schools overall, the residuals are mirrored and the residual of the median is the smallest yet spotted in this analysis at 0.0002.

p1 <- ggplot() + geom_histogram(binwidth=0.05, aes(x=current_lm$residuals), fill="skyblue") + xlab("Linear Model Residuals") + guides(fill=FALSE)
p2 <- gg_qqplot(current_lm)
grid.arrange(p1, p2, ncol=2)

The graphs imply a clear normal distribution of the residuals. Unusual for the data thus far, the coefficient ratio actually exceeds the 10 cap by coming in at 11.6548907, though this is the most accurate it’s been so far.

There is less than a $2.2e^{-16}$ chance (0.0000002475774) that this linear model is not relevant when determining the likelihood high school students in New York City go on to a college or career program.

Validity of the Linear Model for Brooklyn

current_lm <- lm(college_career_rate ~ pct_stu_safe, data=json[which(json["boro"]=="K"),])
summary(current_lm)

## 
## Call:
## lm(formula = college_career_rate ~ pct_stu_safe, data = json[which(json["boro"] == 
##     "K"), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35409 -0.11800 -0.01286  0.11587  0.38309 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.06209    0.19802  -0.314  0.75448   
## pct_stu_safe  0.75465    0.24070   3.135  0.00221 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1674 on 108 degrees of freedom
## Multiple R-squared:  0.08342,    Adjusted R-squared:  0.07493 
## F-statistic: 9.829 on 1 and 108 DF,  p-value: 0.002212

In the case of Brooklyn, the residuals are almost mirrored and the residual of the median is small. The residuals are likely normally distributed.

p1 <- ggplot() + geom_histogram(binwidth=0.05, aes(x=current_lm$residuals), fill="skyblue") + xlab("Linear Model Residuals") + guides(fill=FALSE)
p2 <- gg_qqplot(current_lm)
grid.arrange(p1, p2, ncol=2)

The residuals are indeed normally distributed, though not perfectly so. The estimate is 3.1351847x the standard error, which does fail to fall between 5 and 10.

There is a high likelihood (0.00221) that this linear model is relevant when it comes to Brooklyn high school students.

Validity of the Linear Model for Manhattan

current_lm <- lm(college_career_rate ~ pct_stu_safe, data=json[which(json["boro"]=="M"),])
summary(current_lm)

## 
## Call:
## lm(formula = college_career_rate ~ pct_stu_safe, data = json[which(json["boro"] == 
##     "M"), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37465 -0.11660  0.00742  0.12885  0.25275 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.6301     0.1863  -3.382  0.00106 ** 
## pct_stu_safe   1.4675     0.2127   6.899 6.78e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1502 on 91 degrees of freedom
## Multiple R-squared:  0.3434, Adjusted R-squared:  0.3362 
## F-statistic:  47.6 on 1 and 91 DF,  p-value: 6.779e-10

When this linear model is applied to Manhattan high school students, it should produce slightly normal residuals as evidenced by the small residual median and first and third quantile similarities.

p1 <- ggplot() + geom_histogram(binwidth=0.05, aes(x=current_lm$residuals), fill="skyblue") + xlab("Linear Model Residuals") + guides(fill=FALSE)
p2 <- gg_qqplot(current_lm)
grid.arrange(p1, p2, ncol=2)

This linear model’s residuals fail to be normally distributed when plotted on a histogram, but their qqplot indicates they actually might be normally distributed overall when considered theoretically. The ratio for Manhattan falls in range (5 < 6.8991725 < 10), which means this linear model is very likely to be an excellent one for determining if Manhattan high school students will go on to enroll in college or a career program.

There is a very small ($6.78e^{-10}$ or 0.0003078115) chance this linear model is inappropriate for Manhattan.

Validity of the Linear Model for Queens

current_lm <- lm(college_career_rate ~ pct_stu_safe, data=json[which(json["boro"]=="Q"),])
summary(current_lm)

## 
## Call:
## lm(formula = college_career_rate ~ pct_stu_safe, data = json[which(json["boro"] == 
##     "Q"), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.43501 -0.11575  0.03103  0.11225  0.34702 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.4843     0.2068  -2.342   0.0221 *  
## pct_stu_safe   1.3503     0.2423   5.572 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1641 on 68 degrees of freedom
## Multiple R-squared:  0.3135, Adjusted R-squared:  0.3034 
## F-statistic: 31.05 on 1 and 68 DF,  p-value: 4.672e-07

The residual median is a bit larger than the others so far in this linear model, but for the most part the residuals are mirrored. This should mean the residuals are normally distributed.

p1 <- ggplot() + geom_histogram(binwidth=0.05, aes(x=current_lm$residuals), fill="skyblue") + xlab("Linear Model Residuals") + guides(fill=FALSE)
p2 <- gg_qqplot(current_lm)
grid.arrange(p1, p2, ncol=2)

The residuals are normally distributed as evidenced by the histogram and the qqplot. For the second time when observing the boroughs individually, the ratio between students going on to enroll on a college or career program within six months of their graduation and how safe those students felt when they were in their high school falls between 5 and 10 (5.5722655).

When it comes to Queens, there is a $4.67e^{-7}$ or 0.004258489 chance this linear model is not correct or accurate.

Validity of the Linear Model for Staten Island

current_lm <- lm(college_career_rate ~ pct_stu_safe, data=json[which(json["boro"]=="R"),])
summary(current_lm)

## 
## Call:
## lm(formula = college_career_rate ~ pct_stu_safe, data = json[which(json["boro"] == 
##     "R"), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.18097 -0.05375  0.02264  0.07182  0.14346 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -1.3048     0.3932  -3.318  0.01057 * 
## pct_stu_safe   2.2787     0.4630   4.922  0.00116 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1104 on 8 degrees of freedom
## Multiple R-squared:  0.7517, Adjusted R-squared:  0.7207 
## F-statistic: 24.22 on 1 and 8 DF,  p-value: 0.001162

When looking at the minimum, quantiles, median, and maximum of the residuals, it is obvious that the residuals are not perfectly mirrored, though they do approach it. The residual median is also small, though perhaps not as small as other examples seen thus far.

p1 <- ggplot() + geom_histogram(binwidth=0.05, aes(x=current_lm$residuals), fill="skyblue") + xlab("Linear Model Residuals") + guides(fill=FALSE)
p2 <- gg_qqplot(current_lm)
grid.arrange(p1, p2, ncol=2)

Given the few residuals offered (10), they appear to be normally distributed with a slight left skew. When considering the college and career rate to student safety ratio, it just barely (4.921602) misses the lower bound needed (5) to be a good linear model. It is close enough to say this model may actually be right for Staten Island.

There is a small chance (0.00116) this linear model is not appropriate for Staten Island.

Validity of the Linear Model for the Bronx

current_lm <- lm(college_career_rate ~ pct_stu_safe, data=json[which(json["boro"]=="X"),])
summary(current_lm)

## 
## Call:
## lm(formula = college_career_rate ~ pct_stu_safe, data = json[which(json["boro"] == 
##     "X"), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33208 -0.10451 -0.02973  0.12477  0.36322 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.1468     0.1432  -1.025    0.308    
## pct_stu_safe   0.8300     0.1759   4.719 7.88e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1549 on 98 degrees of freedom
## Multiple R-squared:  0.1852, Adjusted R-squared:  0.1768 
## F-statistic: 22.27 on 1 and 98 DF,  p-value: 7.882e-06

Residuals generated by this linear model for the Bronx are normally distributed with a low residual median.

p1 <- ggplot() + geom_histogram(binwidth=0.05, aes(x=current_lm$residuals), fill="skyblue") + xlab("Linear Model Residuals") + guides(fill=FALSE)
p2 <- gg_qqplot(current_lm)
grid.arrange(p1, p2, ncol=2)

Though there are two spikes present in the histogram, overall the data is normally distributed, confirmed the previous observation by looking at the minimum, quantiles, median, and maximum values of the residuals. Just as with Staten Island, the ratio between Bronx college students who enrolled in college or career programs within six months of their graduation and how safe those students felt in their high schools is almost 5… 4.7189915.

This linear model results in a $7.882e^{-6}$ or 0.01953752 chance of it not being correct for this borough’s students.

Conclusion

The best-suited linear model for this data set in order to determine which high schools are likely to lead to its students continuing on for further education or enrichment six months after graduation or sooner is the one observing the percent of students who feel safe in their high schools. This is the result I did not expect, but it is also the result that had the best outcomes for Staten Island. In the other two models produced, Staten Island had a negative correlation; this was the only model that had Staten Island having a positive correlation, a nearly in-range ratio between the students who went on to those programs and the variable being observed, and a less than 50% chance of being irrelevant - in fact, this linear model had a 0.116% chance of being irrelevant for Staten Island! My gold standard was I wanted the residuals to be as normally distributed as possible, a positive correlation for all five boroughs, and a near or in range ratio, which this linear model met consistently.

Looking at the total students is instinctively what I thought would be the winner, as a resident of New York City. Growing up, a familiar complaint from my teachers was the size of each class and how there were not enough desks. I had on more than one occassion myself been the student to sit on the ground with my notebook and pencil to take notes, so I certainly though this would be the best option for all of New York’s five boroughs. The data showed that Staten Island certainly operated under such an assumption, but Queens was indifferent, and the rest of the boroughs had its students flourish the more of them there were.

I had inferred that the second linear model would be the right one, if somehow the first failed. Recent high school graduates like to talk about all the fun things they did in school, the things they did with their classmates and school friends, and applications for college and jobs encourage you to talk about these experiences, so why wouldn’t it be the right linear model? Ultimately it showed not to produce any strong or noticeable correlations that were in any agreement with each other between each of the boroughs, even if it was overall a positive correlation.

If the winning linear model was used on previous years and future ones, I suspect it would have a similar result. The only way to know for certain would be to apply it and find out. But that is a project for another day.

DATA 605 Discussion 11

Gabrielle Bartomeo

October 29, 2018

NYC 2019 DOE High School Directory

Data Cleaning

Analysis

Total Students and College/Career Rate

Overall Validity of the Linear Model

Validity of the Linear Model for Brooklyn

Validity of the Linear Model for Manhattan

Validity of the Linear Model for Queens

Validity of the Linear Model for Staten Island

Validity of the Linear Model for the Bronx

Variety of Student Activities and College/Career Rate

Overall Validity of the Linear Model

Validity of the Linear Model for Brooklyn

Validity of the Linear Model for Manhattan

Validity of the Linear Model for Queens

Validity of the Linear Model for Staten Island

Validity of the Linear Model for the Bronx

How Safe Students Feel and College/Career Rate

Overall Validity of the Linear Model

Validity of the Linear Model for Brooklyn

Validity of the Linear Model for Manhattan

Validity of the Linear Model for Queens

Validity of the Linear Model for Staten Island

Validity of the Linear Model for the Bronx

Conclusion