In this activity, you will explore a dataset that includes a large
number of general election polls that have been fielded since President
Biden dropped out of the race (i.e. after the race shifted to being
between Harris and Trump). The data, called PollsData.RData
can be downloaded from Canvas.
You should start by downloading both of this data file as well as the
.Rmd template file to your computer, saving them both in
the same folder (ideally one set up for data analysis for this course).
You should then start RStudio not by clicking the application icon, but
instead by double-clicking the .Rmd template which should open RStudio
with the working directory for R automatically set to the location of
the .Rmd file (which should also be the same location as the
dataset).
Below are brief descriptions of the variables (we won’t use all of these variables in this lab):
poll_id a numeric identifier for each pollpollster the name of the pollster who conducted the
pollmethodology the method of conducting the pollstart_date the date that the poll was startedend_date the date that the poll was endedsample_size the sample size (number of people surveyed)
for the pollHarris the percent of respondents who said they plan to
vote for Kamala HarrisTrump the percent of respondents who said they plan to
vote for Donald TrumpNote that this is kind of a messy dump of a wide range of polls of different levels of quality. This, plus the fact that we’re nearly 3 months away from the election, means that the polls are not necessarily able to give a precise prediction of how the election will actually turn out.
Open RStudio and load the dataset by typing
load("PollsData.RData"). This will load a data frame called
POLLS into your workspace, which is the dataset. You can
now look at the first few rows of the data by typing
head(POLLS). You will probably also want to attach the
dataset by typing attach(POLLS) which will allow you to
reference variables in the dataset without telling R every time to look
in POLLS for the variable.
load("PollsData.RData")
Calculate the sample mean and sample median of the percent of
respondents supporting Trump and also the percent of respondents
supporting Harris (i.e. find the mean and the median separately for
Trump and also for Harris). Briefly describe
what you calculated and what it tells you.
median(POLLS$Trump)
## [1] 45.7
median(POLLS$Harris)
## [1] 46.2
The sample mean of respondents that are supporting Trump is 45.3 percent and the sample median is 45.7 percent. The sample mean of respondents that are supporting Harris is 46.5 percent and the sample median is 46.2 percent. For Trump this shows a left skewed distribution where some respondents that have lower support percentages are pulling the mean down. For Harris this shows a right skewed distribution where some of the respondents that have higher support percentages are pulling the mean up.
Calculate the variance and standard deviation of the number of both
Trump and Harris. Briefly describe what you
calculated and what it tells you (focus your description here on the
standard deviation).
var(POLLS$Trump)
## [1] 11.41569
var(POLLS$Harris)
## [1] 14.50466
sd(POLLS$Trump)
## [1] 3.378711
sd(POLLS$Harris)
## [1] 3.808499
Trump has a variance of 11.41 and a standard deviation of 3.38. Harris has a variance of 14.50 and a standard deviation of 3.81. The higher variance and standard deviation that Harris has in her support shows that Harris has a much wider range of support levels than Trump.
Create a new variable called HarrisTwoParty that is
equal to the proportion of the two-party vote that Harris has in each
poll. In other words, we want to calculate what proportion of
respondents in a poll supported Harris out of those who supported either
Harris or Trump (i.e. ignoring any respondents who supported other
candidates or hadn’t made up their mind). Note that the variables
Trump and Harris add up to less than 100 for
each specific poll meaning that some percentage of voters did not say
they would vote for one of the major party candidates. Often, this share
of the two-party vote is used as a measure of how far ahead or behind
the Democratic (or Republican) candidate is in a given race
(Hint: you want to write something like
newvar <- EXPRESSION where newvar is the
name you want to call your new variable and EXPRESSION is a
mathematical expression that calculates the values for your new variable
based on other variables. For example we could say
newvar <- var1 + var2 if we wanted our new variable to
equal one variable called var1 plus another variable called
var2. You can also use parentheses and other mathematical
expressions (e.g. * for multiplication, / for
division) to make these expressions. Think about how you could find the
proportion supporting Harris out of the total of voters supporting
Harris and those supporting Trump.)
Then, make a histogram of the newly created variable and briefly describe what you learn.
HarrisTwoParty <- POLLS$Harris / (POLLS$Trump + POLLS$Harris)
hist(HarrisTwoParty)
The histogram shows me that there is a pretty even distribution between support of Trump and Harris when factoring out the people who don’t support either. This is with the exception of a few outlying polls.
Create another new variable called Other which is the
percent (not proportion) of respondents in each poll that did not say
they would vote for Harris or for Trump.
Calculate the mean of this variable, make a histogram of it, and briefly describe what you learn.
(Hint: this is just 100 minus the percent that say they’re voting for
Trump or for Harris, note that the variables Trump and
Harris are each percentages not proportions)
Other = 100 - POLLS$Trump - POLLS$Harris
mean(Other)
## [1] 8.19
hist(Other)
The mean percentage of respondents that don’t support either Trump or Harris is 8.19 percent. The histogram for this data is skewed more towards the right. This shows that most polls had less people abstain from picking a candidate to support that the mean might suggest.
How many of these polls have Harris ahead of Trump and how many have Trump ahead of Harris?
The easiest way to do this is to make a table based on a logical
condition for one or more variables. For example you could say
table(var1 > var2) to make a table showing for how many
observations a variable var1 is greater than another
variable var2. You could also use a variable you created
earlier in this lab to make the table.
table(POLLS$Harris > POLLS$Trump)
##
## FALSE TRUE
## 69 66
table(POLLS$Trump > POLLS$Harris)
##
## FALSE TRUE
## 77 58
Harris is conclusively ahead in 66 of the polls and Trump is conclusively ahead of 58 of the polls. There are 11 polls where a conclusive decision cannot be made.