Overview

Who vote for Obama in 2008? Just the liberals or more?
I will use this rmarkdown to show both the data analysis with R as well as statistical thinking. This is an assignment from Andrew Gelman’s Survey Statistics class in Columbia University.

Skills

By going through this exercise, you should learn how to

  • join datasets with dplyr
  • manipulate data with dplyr
  • create and refine plot with ggplot2
  • create map with choroplethr

Does Obama get more votes in liberal states?

I will create a plot of estimated proportion liberal in each state vs. Obama’s vote share in 2008 (data available at http://www.stat.columbia.edu/~gelman/surveys.course/2008ElectionResult.csv)

Load and Clean Data

With the Pew 2008 survey, I will compute the percentage of respondents in each state (excluding Alaska and Hawaii) who are liberal. I will need to recode a liberal variable from pew 2008 survey.

library(foreign)
library(arm)
library(ggplot2)
library(dplyr)
library(readr)
library(openintro)
library(choroplethr)
library(choroplethrMaps)

data = read.dta("pew_research_center_june_elect_wknd_data.dta")
names(data)

# recode ideology
data$ideo = as.character(data$ideo)
data[is.na(data$ideo),]$ideo = 0
data[data$ideo=="very conservative",]$ideo = 2
data[data$ideo=="conservative",]$ideo = 1
data[data$ideo=="moderate",]$ideo = 0
data[data$ideo=="dk/refused",]$ideo = 0
data[data$ideo=="liberal",]$ideo = -1
data[data$ideo=="very liberal",]$ideo = -2

# create a new liberal variable
data$liberal = data$ideo<0

Calculate Liberals Share in Each State

I will turn pew 2008 survey into state-level and calculate share of liberals from the responses.

table(data$liberal)
## 
## FALSE  TRUE 
## 25196  6005
# create state-level liberal share and sample size
liberalshare_state = data %>% group_by(state) %>% summarize(liberal.pct = sum(liberal)/n()*100, samplesize = n()) %>% select(state, liberal.pct, samplesize)

# clean the dataset
liberalshare_state$state = as.character(liberalshare_state$state)
liberalshare_state[liberalshare_state$state == "washington dc",]$state = "district of columbia"

Liberal-Obama Vote Gap By State

I want to plot Obama vote share versus liberal proportion at state-level. I will need to join the survey result and election result to one dataset.

# read election result
election08 = read_csv("http://www.stat.columbia.edu/~gelman/surveys.course/2008ElectionResult.csv")


# merge the two datasets
obamashare_state = election08 %>% select(state, vote_Obama_pct) %>% mutate_each(funs(tolower))
merge = obamashare_state %>% inner_join(liberalshare_state)
## Joining by: "state"
merge$vote_Obama_pct = as.numeric(merge$vote_Obama_pct)
merge = merge[!merge$state=="hawaii",]

# create a plot
p = ggplot(merge, aes(x=liberal.pct, y=vote_Obama_pct))

p + geom_point()

# add abbreviation to make it more interpretable
merge$abbr = state2abbr(merge$state)

p + geom_text(aes(label=merge$abbr)) +
  labs(title="Who vote for Obama in 2008?", 
       x = "Percentage of people identified as liberals", 
       y = "Percentage of voters for Obama")

How certain are we regarding the estimated liberal population at state level?

I will create a plot of estimated proportion liberal in each state vs. sample size in each state (again as a scatterplot using the two-letter state abbreviations).

liberalshare_state$abbr = state2abbr(liberalshare_state$state)

ggplot(liberalshare_state, aes(x=samplesize, y=liberal.pct))+
  geom_text(aes(label=liberalshare_state$abbr))+
  scale_y_continuous(expand = c(0,0), limits = c(0,40)) + 
  scale_x_continuous(expand = c(0,0), limits = c(0,3000)) +
  labs(title = "How certain are we about the estimated liberal share?", 
       x = "Sample Size",
       y = "Percentage of liberals in 2008 Pew Survey") 

There seems to be a funnel shape. This makes sense because for states that have less respondents, the uncertainty in estimated proportion of liberals is larger. We should not be as confident regarding the estimates for the states that fall on the left side of the plot.

Put these liberasl on a map!!

Since it’s always cool to visualize things in a map, I will create a map of estimated proportion liberal using colors in a U.S. map.

# create a data frame that has region and value
liberalshare_cho = liberalshare_state[,c(1,2)]
names(liberalshare_cho) = c("region","value")

# recode dc for choropleth function
liberalshare_cho[liberalshare_cho$region == "washington dc",1] = "district of columbia"

state_choropleth(liberalshare_cho, title = "Proportion of people identified as liberals in 2008")
## Warning in self$bind(): The following regions were missing and are being
## set to NA: alaska