I was looking for data to use for this week’s discussion and stumbled on this site that has information about New York City residents.
We’ll be looking to see if median household income and the unemployment rate can predict the levels of high school math proficiency.
Found here
library(tidyr)
library(dplyr)
library(ggplot2)
## High School Math Proficiency
mp <- read.csv('https://github.com/dataconsumer101/data605/raw/main/High%20School%20Math%20Proficiency.csv', skip = 5)
head(mp)
## Location TimeFrame DataFormat Data Fips
## 1 Lower East Side 2009 Number 666.00000 SD1
## 2 Lower East Side 2009 Percent 0.75169 SD1
## 3 Lower East Side 2010 Number 676.00000 SD1
## 4 Lower East Side 2010 Percent 0.74696 SD1
## 5 Stuyvesant Town, Upper E. Side 2009 Number 6168.00000 SD2
## 6 Stuyvesant Town, Upper E. Side 2009 Percent 0.67780 SD2
# only keep percents
mp <- mp[mp$DataFormat == 'Percent',]
# rename the data column name
colnames(mp)[4] <- 'hs_math_proficiency'
## Median Household Incomes
hi <- read.csv('https://github.com/dataconsumer101/data605/raw/main/Median%20Incomes.csv', skip = 8)
head(hi)
## Location Household.Type TimeFrame DataFormat Data Fips
## 1 Riverdale All Households 2005 Dollars 63435.46009 208
## 2 Williamsbridge All Households 2005 Dollars 52141.77239 212
## 3 Throgs Neck All Households 2005 Dollars 57712.50261 210
## 4 Pelham Parkway All Households 2005 Dollars 52782.17672 211
## 5 Morrisania All Households 2005 Dollars 24589.9512 203
## 6 East Tremont All Households 2005 Dollars 24589.9512 206
# keep only households with children
hi <- filter(hi, Household.Type == 'Families with Children')
# rename the data column name
colnames(hi)[5] <- 'med_household_income'
## Unemployment Rates
ur <- read.csv('https://github.com/dataconsumer101/data605/raw/main/Unemployment%20Rate.csv', skip = 5)
head(ur)
## Location TimeFrame DataFormat Data Fips
## 1 Mott Haven 2005 Percent 0.176 201
## 2 Hunts Point 2005 Percent 0.176 202
## 3 Morrisania 2005 Percent 0.165 203
## 4 East Tremont 2005 Percent 0.165 206
## 5 Bedford Stuyvesant 2005 Percent 0.139 303
## 6 Brownsville 2005 Percent 0.139 316
# rename the data column name
colnames(ur)[4] <- 'unemployment_rate'
Since we’ll be using both the household income and unemployment rate to predict math proficiency, we can use inner joins to keep only data that will be used.
df <- inner_join(mp, hi, by = c('Location' = 'Location', 'TimeFrame' = 'TimeFrame')) %>%
inner_join(ur, by = c('Location' = 'Location', 'TimeFrame' = 'TimeFrame')) %>%
select(Location, TimeFrame, hs_math_proficiency, med_household_income, unemployment_rate)
head(df)
## Location TimeFrame hs_math_proficiency med_household_income
## 1 Lower East Side 2009 0.75169 37648.80552
## 2 Lower East Side 2010 0.74696 48289.29775
## 3 Upper West Side 2009 0.67755 240555.07138
## 4 Upper West Side 2010 0.69310 205958.66105
## 5 East Harlem 2009 0.73480 29602.79029
## 6 East Harlem 2010 0.75441 33001.32959
## unemployment_rate
## 1 0.09279
## 2 0.10102
## 3 0.06634
## 4 0.07463
## 5 0.16207
## 6 0.14768
str(df)
## 'data.frame': 176 obs. of 5 variables:
## $ Location : chr "Lower East Side" "Lower East Side" "Upper West Side" "Upper West Side" ...
## $ TimeFrame : int 2009 2010 2009 2010 2009 2010 2009 2010 2009 2010 ...
## $ hs_math_proficiency : num 0.752 0.747 0.678 0.693 0.735 ...
## $ med_household_income: chr "37648.80552" "48289.29775" "240555.07138" "205958.66105" ...
## $ unemployment_rate : num 0.0928 0.101 0.0663 0.0746 0.1621 ...
# convert income to numeric
df$med_household_income <- as.numeric(df$med_household_income)
lm <- lm(hs_math_proficiency ~ med_household_income + unemployment_rate, data = df)
summary(lm)
##
## Call:
## lm(formula = hs_math_proficiency ~ med_household_income + unemployment_rate,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.233241 -0.040949 0.007827 0.054029 0.175862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.083e-01 2.227e-02 36.303 < 2e-16 ***
## med_household_income 1.674e-07 1.298e-07 1.290 0.199
## unemployment_rate -9.987e-01 1.709e-01 -5.844 2.48e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07354 on 173 degrees of freedom
## Multiple R-squared: 0.256, Adjusted R-squared: 0.2474
## F-statistic: 29.76 on 2 and 173 DF, p-value: 7.777e-12
It looks like income isn’t a significant factor. Let’s take it out and see what happens.
lm <- update(lm, .~. - med_household_income)
summary(lm)
##
## Call:
## lm(formula = hs_math_proficiency ~ unemployment_rate, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.236601 -0.044255 0.008637 0.051912 0.173521
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.82989 0.01471 56.400 < 2e-16 ***
## unemployment_rate -1.11264 0.14655 -7.592 1.84e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07368 on 174 degrees of freedom
## Multiple R-squared: 0.2489, Adjusted R-squared: 0.2445
## F-statistic: 57.65 on 1 and 174 DF, p-value: 1.837e-12
# unemployment rate vs math proficiency
plot(df$unemployment_rate, df$hs_math_proficiency)
abline(lm)
# residual plot
plot(df$unemployment_rate, resid(lm))
abline(h = 0)
# qq plot
qqnorm(resid(lm))
qqline(resid(lm))
# t test
df$pred <- predict(lm, newdata = df)
delta <- df$pred - df$hs_math_proficiency
t.test(delta, conf.level = 0.95)
##
## One Sample t-test
##
## data: delta
## t = -5.0683e-15, df = 175, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.01092908 0.01092908
## sample estimates:
## mean of x
## -2.806626e-17
It looks like this model tells us that unemployment and high school math proficiency have a negative relationship. The model doesn’t seem strong enough to use to predict accurately, but we can definitely see the relationship there. It’s not too hard to believe that employed families have the stability needed for kids to do well in school.