DATA 605, Week 11 Discussion: Linear Modeling

Kavya Beheraj

November 6, 2018



Background


Research Question

How well does the percentage of students in a school who are eligible for free or reduced-price lunch (FRPL, a common measure of poverty) explain the average critical reading SAT score of a school?


Findings

Schools with higher poverty (as measured by FRPL) have a lower average critical reading SAT score. However, poverty only explains about 50% of the variation in critical reading score. To make a better model, we need to account for other variables – such as school budget, racial diversity, etc.




Data Preparation

Read in and clean the SAT dataset:

url1 <- "C:/Users/Kavya/Desktop/Kavya/msds/DATA 605 Computational Mathematics/Discussion Week 10/sat2012.csv"

sat <- read.csv(url1, fill = TRUE, sep = ",") # Read in dataset
sat <- filter(sat, sat$Num.of.SAT.Test.Takers != "s") # Filter out missing values
# Coerce columns to character and numeric
sat[,1:2] <- lapply(sat[,1:2], as.character)
sat[,3:6] <- lapply(sat[,3:6], as.numeric)
names(sat) <- c("DBN", "school", "num_takers", "reading_avg", "math_avg", "writing_avg")

Read in and clean the Demographics dataset, filtering for the 2011-2012 school year (to match the SAT dataset).

url2 <- "C:/Users/Kavya/Desktop/Kavya/msds/DATA 605 Computational Mathematics/Discussion Week 10/demographics2012.csv"

demo <- read.csv(url2, fill = TRUE, sep = ",") # Read in dataset
demo <- filter(demo, demo$schoolyear=="20112012") # Filter for the 2011-2012 school year
# Coerce columns to character and numeric
demo[,1:2] <- lapply(demo[,1:2], as.character)
demo[,3:38] <- lapply(demo[,3:38], as.numeric)
demo <- demo[, c(1, 5)] # Choose relevant columns
names(demo)[1] <- "DBN"

Join the two datasets together using the DBN (district-borough number) of each school, and view the dataframe.

sat_demo <- inner_join(sat, demo, by="DBN")
DBN school num_takers reading_avg math_avg writing_avg frl_percent
01M292 HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES 68 34 70 45 88.6
01M448 UNIVERSITY NEIGHBORHOOD HIGH SCHOOL 166 62 87 48 71.8
01M450 EAST SIDE COMMUNITY SCHOOL 136 56 68 52 71.8
01M458 FORSYTH SATELLITE ACADEMY 135 93 67 41 72.8
01M509 MARTA VALLE HIGH SCHOOL 98 69 94 66 80.7
01M515 LOWER EAST SIDE PREPARATORY HIGH SCHOOL 11 17 153 11 77.0




Data Modeling and Analysis


Distributions

When we look at histograms of each variable, we see they are slightly skewed from the normal distribution, especially with FRPL.

This suggests bias in the data – that poverty is not evenly distributed across NYC schools. There are more schools with poverty levels between 60-75%. If the distribution was normal, we would expect to see a peak around 50%.

y <- sat_demo$reading_avg
x <- sat_demo$frl_percent

plot1 <- qplot(x, geom = "histogram", xlab="% Free/Reduced Price Lunch (Poverty)", binwidth=10)
plot2 <- qplot(y, geom = "histogram", xlab="Avg. Critical Reading SAT Score", binwidth=10)
grid.arrange(plot1, plot2, ncol=2, top="Distribution of School Poverty and Critical Reading SAT Score (2011-2012)")


Linear Model

I modeled the relationship between FRPL and critical reading SAT score using the lm function and summarized the results below.

  • Slope: \(-1.5929\). For every 1 percentage-point increase in the proportion of FRPL students in a school, there is a 1.6-point decrease in average critical reading SAT score.

  • Intercept: \(178.5304\). A school that has 0% of students eligible for FRPL would be estimated to have an average critical reading SAT score of about 179 points.

  • Standard error: \(5.3885\) (intercept) and \(0.0792\) (slope). These values are much smaller than the corresponding coefficients. Indicates that there is relatively little variability in the estimates of the slope and intercept.

  • P-value: \(<2e^{-16}\). This tiny value, and three significance stars, means that there is a high likelihood that FRPL is relevant in the model, and the model more accurately predicts it.

  • R-squared: \(0.4966\); adjusted \(R^2\): \(0.4954\). This means that FRPL explains about 50% of the variation in critical reading SAT score. This suggests that we may need more variables than just FRPL to explain SAT score.

  • Degrees of freedom: \(410\). There were \(412\) observations used to generate the model.

a <- lm(y ~ x)
summary(a)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.648 -16.781   0.844  18.789  64.050 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 178.5304     5.3885   33.13   <2e-16 ***
## x            -1.5929     0.0792  -20.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.79 on 410 degrees of freedom
## Multiple R-squared:  0.4966, Adjusted R-squared:  0.4954 
## F-statistic: 404.5 on 1 and 410 DF,  p-value: < 2.2e-16
qplot(x, y, ylab="School Avg. Critical Reading SAT Score", xlab="Percent of School Pop. Eligible for Free/Reduced Price Lunch (Poverty)", main="School Poverty vs. Critical Reading SAT Score (2011-2012)") +
  geom_abline(intercept = a$coefficients[1], slope = a$coefficients[2])


Residuals

When we plot the residuals of this model, we see a cluster of points between 40 and 80, and slight skew in the ends of the quantile-quantile plot. This reinforces what we saw in the histogram, that there may be bias influencing the data, or that using FRPL alone may not be enough to explain the data.

plot4 <- qplot(a$fitted.values, a$residuals, ylab="Fitted Values", xlab="Residuals")

plot5 <- ggplot() + geom_qq(aes(sample = a$residuals))

grid.arrange(plot4, plot5, ncol=1, nrow=2)