Data 605 Discussion 12

High School Math Proficiency in NYC

I was looking for data to use for this week’s discussion and stumbled on this site that has information about New York City residents.

We’ll be looking to see if median household income and the unemployment rate can predict the levels of high school math proficiency.

Found here

Loading Packages

library(tidyr)
library(dplyr)
library(ggplot2)

Loading the Data

## High School Math Proficiency
mp <- read.csv('https://github.com/dataconsumer101/data605/raw/main/High%20School%20Math%20Proficiency.csv', skip = 5)

head(mp)

##                         Location TimeFrame DataFormat       Data Fips
## 1                Lower East Side      2009     Number  666.00000  SD1
## 2                Lower East Side      2009    Percent    0.75169  SD1
## 3                Lower East Side      2010     Number  676.00000  SD1
## 4                Lower East Side      2010    Percent    0.74696  SD1
## 5 Stuyvesant Town, Upper E. Side      2009     Number 6168.00000  SD2
## 6 Stuyvesant Town, Upper E. Side      2009    Percent    0.67780  SD2

# only keep percents
mp <- mp[mp$DataFormat == 'Percent',]

# rename the data column name
colnames(mp)[4] <- 'hs_math_proficiency'

## Median Household Incomes
hi <- read.csv('https://github.com/dataconsumer101/data605/raw/main/Median%20Incomes.csv', skip = 8)

head(hi)

##         Location Household.Type TimeFrame DataFormat        Data Fips
## 1      Riverdale All Households      2005    Dollars 63435.46009  208
## 2 Williamsbridge All Households      2005    Dollars 52141.77239  212
## 3    Throgs Neck All Households      2005    Dollars 57712.50261  210
## 4 Pelham Parkway All Households      2005    Dollars 52782.17672  211
## 5     Morrisania All Households      2005    Dollars  24589.9512  203
## 6   East Tremont All Households      2005    Dollars  24589.9512  206

# keep only households with children
hi <- filter(hi, Household.Type == 'Families with Children')

# rename the data column name
colnames(hi)[5] <- 'med_household_income'


## Unemployment Rates
ur <- read.csv('https://github.com/dataconsumer101/data605/raw/main/Unemployment%20Rate.csv', skip = 5)

head(ur)

##             Location TimeFrame DataFormat  Data Fips
## 1         Mott Haven      2005    Percent 0.176  201
## 2        Hunts Point      2005    Percent 0.176  202
## 3         Morrisania      2005    Percent 0.165  203
## 4       East Tremont      2005    Percent 0.165  206
## 5 Bedford Stuyvesant      2005    Percent 0.139  303
## 6        Brownsville      2005    Percent 0.139  316

# rename the data column name
colnames(ur)[4] <- 'unemployment_rate'

Combining the Data

Since we’ll be using both the household income and unemployment rate to predict math proficiency, we can use inner joins to keep only data that will be used.

df <- inner_join(mp, hi, by = c('Location' = 'Location', 'TimeFrame' = 'TimeFrame')) %>%
  inner_join(ur, by = c('Location' = 'Location', 'TimeFrame' = 'TimeFrame')) %>%
  select(Location, TimeFrame, hs_math_proficiency, med_household_income, unemployment_rate)

head(df)

##          Location TimeFrame hs_math_proficiency med_household_income
## 1 Lower East Side      2009             0.75169          37648.80552
## 2 Lower East Side      2010             0.74696          48289.29775
## 3 Upper West Side      2009             0.67755         240555.07138
## 4 Upper West Side      2010             0.69310         205958.66105
## 5     East Harlem      2009             0.73480          29602.79029
## 6     East Harlem      2010             0.75441          33001.32959
##   unemployment_rate
## 1           0.09279
## 2           0.10102
## 3           0.06634
## 4           0.07463
## 5           0.16207
## 6           0.14768

str(df)

## 'data.frame':    176 obs. of  5 variables:
##  $ Location            : chr  "Lower East Side" "Lower East Side" "Upper West Side" "Upper West Side" ...
##  $ TimeFrame           : int  2009 2010 2009 2010 2009 2010 2009 2010 2009 2010 ...
##  $ hs_math_proficiency : num  0.752 0.747 0.678 0.693 0.735 ...
##  $ med_household_income: chr  "37648.80552" "48289.29775" "240555.07138" "205958.66105" ...
##  $ unemployment_rate   : num  0.0928 0.101 0.0663 0.0746 0.1621 ...

# convert income to numeric
df$med_household_income <- as.numeric(df$med_household_income)

Creating a linear model

lm <- lm(hs_math_proficiency ~ med_household_income + unemployment_rate, data = df)

summary(lm)

## 
## Call:
## lm(formula = hs_math_proficiency ~ med_household_income + unemployment_rate, 
##     data = df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.233241 -0.040949  0.007827  0.054029  0.175862 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           8.083e-01  2.227e-02  36.303  < 2e-16 ***
## med_household_income  1.674e-07  1.298e-07   1.290    0.199    
## unemployment_rate    -9.987e-01  1.709e-01  -5.844 2.48e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07354 on 173 degrees of freedom
## Multiple R-squared:  0.256,  Adjusted R-squared:  0.2474 
## F-statistic: 29.76 on 2 and 173 DF,  p-value: 7.777e-12

It looks like income isn’t a significant factor. Let’s take it out and see what happens.

lm <- update(lm, .~. - med_household_income)

summary(lm)

## 
## Call:
## lm(formula = hs_math_proficiency ~ unemployment_rate, data = df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.236601 -0.044255  0.008637  0.051912  0.173521 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.82989    0.01471  56.400  < 2e-16 ***
## unemployment_rate -1.11264    0.14655  -7.592 1.84e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07368 on 174 degrees of freedom
## Multiple R-squared:  0.2489, Adjusted R-squared:  0.2445 
## F-statistic: 57.65 on 1 and 174 DF,  p-value: 1.837e-12

Chart It Out

# unemployment rate vs math proficiency
plot(df$unemployment_rate, df$hs_math_proficiency)
abline(lm)

# residual plot
plot(df$unemployment_rate, resid(lm))
abline(h = 0)

# qq plot
qqnorm(resid(lm))
qqline(resid(lm))

# t test
df$pred <- predict(lm, newdata = df)

delta <- df$pred - df$hs_math_proficiency

t.test(delta, conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  delta
## t = -5.0683e-15, df = 175, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.01092908  0.01092908
## sample estimates:
##     mean of x 
## -2.806626e-17

Conclusion

It looks like this model tells us that unemployment and high school math proficiency have a negative relationship. The model doesn’t seem strong enough to use to predict accurately, but we can definitely see the relationship there. It’s not too hard to believe that employed families have the stability needed for kids to do well in school.