What I want to do today is explore the relationship between education level and Asian proportion of the population in a county.

Yes you guessed it…I want to prove (or disprove) once and for all whether Asian people are studious than the rest !

library(acs)
library(magrittr)
library(dplyr)
library(ggplot2)

county_geo <- geo.make(state = "*", county = "*")

race_acs <- acs.fetch(geography = county_geo, table.number = "B02001", 
                           col.names = "pretty", endyear = 2014, span = 5)

state_county <- race_acs@geography$NAME %>% str_split(",")

county <- state_county  %>%
  sapply(`[[`, 1) %>%
  str_replace(" County", "")

state <-  state_county %>%
  sapply(`[[`, 2)

columns <- 2:7
races <- c("White", "Black", "Indian", "Asian", "Islander", "Other")

df_race <- NULL

for(i in 1:length(columns)){
  df_race <- rbind(df_race, data.frame(county=county, 
                                       state=state,
                                       race=races[i], 
                                       value=race_acs@estimate[,columns[i]]))
}

Let us see what we have so far:

head(df_race)
##                          county    state  race  value
## Autauga County, Alabama Autauga  Alabama White  43011
## Baldwin County, Alabama Baldwin  Alabama White 165673
## Barbour County, Alabama Barbour  Alabama White  12806
## Bibb County, Alabama       Bibb  Alabama White  17379
## Blount County, Alabama   Blount  Alabama White  54927
## Bullock County, Alabama Bullock  Alabama White   2803

Now we turn these numbers into percentages of county totals:

by_county <-  group_by(df_race, state, county) %>%
              summarise(county_total = sum(value))

df_race %<>% left_join(by_county)
df_race %<>% mutate(perc = value/county_total) %>%
              select(state, county, race, perc)

head(df_race)
##      state  county  race      perc
## 1  Alabama Autauga White 0.7906725
## 2  Alabama Baldwin White 0.8827513
## 3  Alabama Barbour White 0.4818452
## 4  Alabama    Bibb White 0.7772014
## 5  Alabama  Blount White 0.9680643
## 6  Alabama Bullock White 0.2652598

Awesome ! I will reuse the data on the education level from previous post. I’m gonna give a simple education score to each county (giving 0 point for education level “< 9th grade” up to 6 points for education level “Graduate”):

points = data.frame(education=c("< 9th grade", "< 12th grade", "High school", "College",
                                "Associate", "Bachelor", "Graduate"),
                    points=c(0,1,2,3,4,5,6))
df_education %<>% left_join(points)
## Joining by: "education"

We can now do a weighted average of the education points by county:

df_education %<>% group_by(state, county) %>%
                  summarise(points = weighted.mean(points, w=count))

Finally, we merge this data back onto the race data:

df_race %<>% left_join(df_education)

And this is our final product:

head(df_race)
##      state  county  race      perc   points
## 1  Alabama Autauga White 0.7906725 2.874249
## 2  Alabama Baldwin White 0.8827513 3.139124
## 3  Alabama Barbour White 0.4818452 2.435715
## 4  Alabama    Bibb White 0.7772014 2.349634
## 5  Alabama  Blount White 0.9680643 2.488051
## 6  Alabama Bullock White 0.2652598 2.241498

Can we finally get a plot for f***’s sake ?! All right, all right, relax:

ggplot(data=subset(df_race, race=="Asian"), aes(x=log(perc), y=points)) + 
  geom_point(col="orange") + 
  geom_smooth(method="lm", col="red") + 
  xlab("Log-proportion of Asian population in county") + 
  ylab("County education score") + 
  ggtitle("Correlation between county education level and Asian proportion of the population")

I knew it ! So, do you believe me now when I say that asians are more studious ? Well you shouldn’t, because all I did was show that there is a positive correlation between the percentage of asians in a given county and the overall education level in that county. It doesn’t mean that the asian people are themselves more studious…