What I want to do today is explore the relationship between education level and Asian proportion of the population in a county.
Yes you guessed it…I want to prove (or disprove) once and for all whether Asian people are studious than the rest !
library(acs)
library(magrittr)
library(dplyr)
library(ggplot2)
county_geo <- geo.make(state = "*", county = "*")
race_acs <- acs.fetch(geography = county_geo, table.number = "B02001",
col.names = "pretty", endyear = 2014, span = 5)
state_county <- race_acs@geography$NAME %>% str_split(",")
county <- state_county %>%
sapply(`[[`, 1) %>%
str_replace(" County", "")
state <- state_county %>%
sapply(`[[`, 2)
columns <- 2:7
races <- c("White", "Black", "Indian", "Asian", "Islander", "Other")
df_race <- NULL
for(i in 1:length(columns)){
df_race <- rbind(df_race, data.frame(county=county,
state=state,
race=races[i],
value=race_acs@estimate[,columns[i]]))
}
Let us see what we have so far:
head(df_race)
## county state race value
## Autauga County, Alabama Autauga Alabama White 43011
## Baldwin County, Alabama Baldwin Alabama White 165673
## Barbour County, Alabama Barbour Alabama White 12806
## Bibb County, Alabama Bibb Alabama White 17379
## Blount County, Alabama Blount Alabama White 54927
## Bullock County, Alabama Bullock Alabama White 2803
Now we turn these numbers into percentages of county totals:
by_county <- group_by(df_race, state, county) %>%
summarise(county_total = sum(value))
df_race %<>% left_join(by_county)
df_race %<>% mutate(perc = value/county_total) %>%
select(state, county, race, perc)
head(df_race)
## state county race perc
## 1 Alabama Autauga White 0.7906725
## 2 Alabama Baldwin White 0.8827513
## 3 Alabama Barbour White 0.4818452
## 4 Alabama Bibb White 0.7772014
## 5 Alabama Blount White 0.9680643
## 6 Alabama Bullock White 0.2652598
Awesome ! I will reuse the data on the education level from previous post. I’m gonna give a simple education score to each county (giving 0 point for education level “< 9th grade” up to 6 points for education level “Graduate”):
points = data.frame(education=c("< 9th grade", "< 12th grade", "High school", "College",
"Associate", "Bachelor", "Graduate"),
points=c(0,1,2,3,4,5,6))
df_education %<>% left_join(points)
## Joining by: "education"
We can now do a weighted average of the education points by county:
df_education %<>% group_by(state, county) %>%
summarise(points = weighted.mean(points, w=count))
Finally, we merge this data back onto the race data:
df_race %<>% left_join(df_education)
And this is our final product:
head(df_race)
## state county race perc points
## 1 Alabama Autauga White 0.7906725 2.874249
## 2 Alabama Baldwin White 0.8827513 3.139124
## 3 Alabama Barbour White 0.4818452 2.435715
## 4 Alabama Bibb White 0.7772014 2.349634
## 5 Alabama Blount White 0.9680643 2.488051
## 6 Alabama Bullock White 0.2652598 2.241498
Can we finally get a plot for f***’s sake ?! All right, all right, relax:
ggplot(data=subset(df_race, race=="Asian"), aes(x=log(perc), y=points)) +
geom_point(col="orange") +
geom_smooth(method="lm", col="red") +
xlab("Log-proportion of Asian population in county") +
ylab("County education score") +
ggtitle("Correlation between county education level and Asian proportion of the population")
I knew it ! So, do you believe me now when I say that asians are more studious ? Well you shouldn’t, because all I did was show that there is a positive correlation between the percentage of asians in a given county and the overall education level in that county. It doesn’t mean that the asian people are themselves more studious…