Preparation

# On first use packages must be installed.
# Example: install.packages("haven") 

# Load packages used in this session of R
library(knitr)
library(car)
library(ggplot2)
opts_chunk$set(echo = TRUE)
options(digits = 3)
  ## As needed, set path to folder where data is located.
  ## For example: opts_knit$set(root.dir ="C:/Data")

1) Load the data set Ch2_Lab_ExitPoll.dta

load(“C:/PPOL560/PPOL560_data/Ch2_lab_ExitPoll.Rdata”)

load("C:/PPOL560/PPOL560_data/Ch2_lab_ExitPoll.Rdata")

2) Create dummy variables for each state/DC. How many observations are in each state?

There are 768 observations for DC.

There are 369 observations for Maryland.

There are 547 observations for Ohio.

There are 664 observations for Virginia.

##I wrote this code before the actual lab session so I'm sorry if it's more complex than necessary. That said, I got the right answers, and everything runs as I want it to.
Lab2DF <- as.data.frame(dta) ##Duplicate the data frame
DC_dummy <- Lab2DF$state ##DC dummy variable
DC_dummy <- ifelse(DC_dummy == 1, 1,0)
MD_dummy <- Lab2DF$state ##MD dummy variable
MD_dummy <- ifelse(MD_dummy == 2, 1,0)
OH_dummy <- Lab2DF$state ##OH dummy variable
OH_dummy <- ifelse(OH_dummy == 3, 1,0)
VA_dummy <- Lab2DF$state ##VA dummy variable
VA_dummy <- ifelse(VA_dummy == 4, 1,0)
Lab2DF = cbind(Lab2DF, DC_dummy, MD_dummy, OH_dummy, VA_dummy) ##Add dummies to data frame
table(Lab2DF$DC_dummy) ##0 = 1580, 1 = 768
## 
##    0    1 
## 1580  768
table(Lab2DF$MD_dummy) ##0 = 1979, 1 = 369
## 
##    0    1 
## 1979  369
table(Lab2DF$OH_dummy) ##0 = 1801, 1 = 547
## 
##    0    1 
## 1801  547
table(Lab2DF$VA_dummy) ##0 = 1684, 1 = 664
## 
##    0    1 
## 1684  664

3) Convert the year_born variable into age. Check for and correct data errors. What is the average age of all observations in the data set? The minimum and maximum?

The average age is 43.1 years.

The minimum age is 17 years. While this might seem suspicious, there are states where 17 year olds can register to vote (and in some cases can even vote in primaries), as long as they will be 18 on or before election day.

The maximum age is 95 years.

age <- 2016 - Lab2DF$year_born
age <- na.omit(Lab2DF$age) ##Now only 1875 observations instead of 2359. 
age[Lab2DF$year_born <= 1920] <- NA
mean(age) ##Mean age is 43.06987
## [1] NA
min(age) ##Min age is 17. Probably someone who was 17 at the time of the survey but would have turned 18 on or before the 2016 general election
## [1] NA
max(age) ##Max age is 95
## [1] NA
summary(age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      17      30      41      43      55      95     486

4) What is the distribution of the gender variable? Create a male dummy variable and indicate the distribution of this variable. Compare distribution of the male variable to the distribution of the gender variable.

The distribution of the male variable is that there are 886 respondents who identify as male and 1067 respondents who do not identify as male.

The distribution of the gender variable is that there are 886 respondents who identify as male, 1062 who identify as female, and 5 who identify as another gender.

The male dummy variable elides the categories “female” and “other” into one larger “not-male” category.

male <- Lab2DF$gender
male <- ifelse(Lab2DF$male == 1, 1,0) ##For male, 1 = male, 0 = not male
table(male)
## male
##    0    1 
## 1067  886
table(Lab2DF$gender)
## 
##    1    2    3 
##  886 1062    5

5) Provide descriptive stats for Trump and Clinton feeling thermometer. Is there anything you need to adjust?

  • For the Trump feeling thermometer:

    +The minimum is 0.

    +The 1st quartile is 0.

    +The median is 0.

    +The 3rd quartile is 25.

    +The maximum is 100.

    +The mean is 17.8.

    +The variance is 836.

  • For the Clinton feeling thermometer:

    +The minimum is 0.

    +The 1st quartile is 20.

    +The median is 70.

    +The 3rd quartile is 90.

    +The maximum is 100.

    +The mean is 57.1.

    +The variance is 1277.

To obtain both of these summary statistics, I first removed the observations which were missing data, and then used the summary function.

therm_trump <- na.omit(Lab2DF$therm_trump) ##Remove all observations for which there is no data
therm_trump[therm_trump > 100] <- NA ##Remove observations greater than 100
summary(therm_trump)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0    17.8    25.0   100.0
var(therm_trump) ##variance is 835.5979
## [1] 836
therm_clinton <- na.omit(Lab2DF$therm_clinton) ##Do the same, remove the NAs
therm_clinton[therm_clinton > 100] <- NA ##Also remove the weird data.
summary(therm_clinton)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    20.0    70.0    57.1    90.0   100.0
var(therm_clinton) ##variance is 1277.341
## [1] 1277

6) Produce histograms of the Clinton and Trump feeling thermometers

hist_trump <- ggplot(Lab2DF, aes(x = therm_trump))+ ##Trump feeling therm histogram
  geom_histogram()+
  xlab("Trump Feeling Thermometer")+
  ylab("Frequency")+
  ggtitle("Responses to Feeling Thermometer Prompt About Donald Trump, 2016")+
  theme_bw()+
  stat_bin(binwidth = 10)
hist_trump
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 292 rows containing non-finite values (stat_bin).

## Warning: Removed 292 rows containing non-finite values (stat_bin).

hist_clinton <- ggplot(Lab2DF, aes(x = therm_clinton))+
  geom_histogram()+
  xlab("Clinton Feeling Thermometer")+
  ylab("Frequency")+
  ggtitle("Responses to Feeling Thermometer Prompt About Hillary Clinton, 2016")+
  theme_bw()+
  stat_bin(binwidth = 10)
hist_clinton
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 232 rows containing non-finite values (stat_bin).
## Warning: Removed 232 rows containing non-finite values (stat_bin).

7) Create a scatterplot of the Clinton and Trump feeling thermometers

scatter_therm <- ggplot(data = subset(Lab2DF, !is.na(therm_clinton & therm_trump)), aes(x = therm_trump, y = therm_clinton, color = therm_clinton > therm_trump))+
  geom_jitter()+
  xlab("Feeling Thermometer Response Toward Donald Trump")+
  ylab("Feeling Thermometer Response Toward Hillary Clinton")+
  ggtitle("Scatterplot of Feeling Thermometer Responses", subtitle = "Comparing Responses for Donald Trump and Hillary Clinton, 2016")+
  theme_bw()+
  scale_color_manual(name = "Preferred Candidate", values = c("red", "blue"), labels = c("Trump", "Clinton")) ##Display a point as blue if the respondent had a higher feeling thermometer score for Clinton than for Trump. Display it as red if higher for Trump than Clinton.
scatter_therm
## Warning: Removed 8 rows containing missing values (geom_point).

8) What is the distribution of the education variable? Is there any adjustment you would need to make if you will use this as a continuous variable in a regression model?

(NOTE: I think something is wrong with this data file.)

  • The distribution of the education variable (as best as I can determine based on the data available) is as follows:

    +17 respondents’ highest level of education was some high school.

    +125 respondents’ highest level of education was a high school diploma.

    +245 respondents’ highest level of education was some college.

    +134 respondents’ highest level of education was a two-year college degree.

    +677 respondents’ highest level of education was a four-year college degree.

    +746 respondents’ highest level of education was a post-graduate degree.

If I were going to use education as a continuous variable in a regression model, I would recode the variable to remove the “Other” category. As it is, for some reason this data set did not include the expected “Other” responses before I even had a chance to attempt to process them, so education variable as it stands is actually ready for use in a regression model, no further processing needed.

table(dta$education) ##I don't know why running this line of code only shows values for 1-6. Maybe it's because I'm using the .RData file instead of the .dta? They shouldn't be different in this way though.
## 
##   1   2   3   4   5   6 
##  17 125 245 134 677 746

Additional Comments: It may be convenient to save the data, as we’ll use this in the lab for next chapter. Here, we save the data as .Rdata format.