Lab 1: Describing Data Answer Key

Due by 11:00pm on 9/6, submitted through Canvas

In this activity, you will explore a dataset that includes a large number of general election polls that have been fielded since President Biden dropped out of the race (i.e. after the race shifted to being between Harris and Trump). The data, called PollsData.RData can be downloaded from Canvas.

You should start by downloading both of this data file as well as the .Rmd template file to your computer, saving them both in the same folder (ideally one set up for data analysis for this course). You should then start RStudio not by clicking the application icon, but instead by double-clicking the .Rmd template which should open RStudio with the working directory for R automatically set to the location of the .Rmd file (which should also be the same location as the dataset).

Below are brief descriptions of the variables (we won’t use all of these variables in this lab):

poll_id a numeric identifier for each poll
pollster the name of the pollster who conducted the poll
methodology the method of conducting the poll
start_date the date that the poll was started
end_date the date that the poll was ended
sample_size the sample size (number of people surveyed) for the poll
Harris the percent of respondents who said they plan to vote for Kamala Harris
Trump the percent of respondents who said they plan to vote for Donald Trump

Note that this is kind of a messy dump of a wide range of polls of different levels of quality. This, plus the fact that we’re nearly 3 months away from the election, means that the polls are not necessarily able to give a precise prediction of how the election will actually turn out.

Question 1: Loading and Exploring the Dataset

Open RStudio and load the dataset by typing load("PollsData.RData"). This will load a data frame called POLLS into your workspace, which is the dataset. You can now look at the first few rows of the data by typing head(POLLS). You will probably also want to attach the dataset by typing attach(POLLS) which will allow you to reference variables in the dataset without telling R every time to look in POLLS for the variable.

load("PollsData.RData")

Question 2: Measures of Central Tendency

Calculate the sample mean and sample median of the percent of respondents supporting Trump and also the percent of respondents supporting Harris (i.e. find the mean and the median separately for Trump and also for Harris). Briefly describe what you calculated and what it tells you.

median(POLLS$Trump)

## [1] 45.7

median(POLLS$Harris)

## [1] 46.2

The sample mean of respondents that are supporting Trump is 45.3 percent and the sample median is 45.7 percent. The sample mean of respondents that are supporting Harris is 46.5 percent and the sample median is 46.2 percent. For Trump this shows a left skewed distribution where some respondents that have lower support percentages are pulling the mean down. For Harris this shows a right skewed distribution where some of the respondents that have higher support percentages are pulling the mean up.

Question 3: Measures of Variation

Calculate the variance and standard deviation of the number of both Trump and Harris. Briefly describe what you calculated and what it tells you (focus your description here on the standard deviation).

var(POLLS$Trump)

## [1] 11.41569

var(POLLS$Harris)

## [1] 14.50466

sd(POLLS$Trump)

## [1] 3.378711

sd(POLLS$Harris)

## [1] 3.808499

Trump has a variance of 11.41 and a standard deviation of 3.38. Harris has a variance of 14.50 and a standard deviation of 3.81. The higher variance and standard deviation that Harris has in her support shows that Harris has a much wider range of support levels than Trump.

Question 4: Creating a New Variable

Create a new variable called HarrisTwoParty that is equal to the proportion of the two-party vote that Harris has in each poll. In other words, we want to calculate what proportion of respondents in a poll supported Harris out of those who supported either Harris or Trump (i.e. ignoring any respondents who supported other candidates or hadn’t made up their mind). Note that the variables Trump and Harris add up to less than 100 for each specific poll meaning that some percentage of voters did not say they would vote for one of the major party candidates. Often, this share of the two-party vote is used as a measure of how far ahead or behind the Democratic (or Republican) candidate is in a given race

(Hint: you want to write something like newvar <- EXPRESSION where newvar is the name you want to call your new variable and EXPRESSION is a mathematical expression that calculates the values for your new variable based on other variables. For example we could say newvar <- var1 + var2 if we wanted our new variable to equal one variable called var1 plus another variable called var2. You can also use parentheses and other mathematical expressions (e.g. * for multiplication, / for division) to make these expressions. Think about how you could find the proportion supporting Harris out of the total of voters supporting Harris and those supporting Trump.)

Then, make a histogram of the newly created variable and briefly describe what you learn.

HarrisTwoParty <- POLLS$Harris / (POLLS$Trump + POLLS$Harris) 
hist(HarrisTwoParty)

The histogram shows me that there is a pretty even distribution between support of Trump and Harris when factoring out the people who don’t support either. This is with the exception of a few outlying polls.

Question 5: How many “other” voters

Create another new variable called Other which is the percent (not proportion) of respondents in each poll that did not say they would vote for Harris or for Trump.

Calculate the mean of this variable, make a histogram of it, and briefly describe what you learn.

(Hint: this is just 100 minus the percent that say they’re voting for Trump or for Harris, note that the variables Trump and Harris are each percentages not proportions)

Other = 100 - POLLS$Trump - POLLS$Harris
mean(Other)

## [1] 8.19

hist(Other)

The mean percentage of respondents that don’t support either Trump or Harris is 8.19 percent. The histogram for this data is skewed more towards the right. This shows that most polls had less people abstain from picking a candidate to support that the mean might suggest.

Question 6: How many polls have Harris ahead of Trump?

How many of these polls have Harris ahead of Trump and how many have Trump ahead of Harris?

The easiest way to do this is to make a table based on a logical condition for one or more variables. For example you could say table(var1 > var2) to make a table showing for how many observations a variable var1 is greater than another variable var2. You could also use a variable you created earlier in this lab to make the table.

table(POLLS$Harris > POLLS$Trump)

## 
## FALSE  TRUE 
##    69    66

table(POLLS$Trump > POLLS$Harris)

## 
## FALSE  TRUE 
##    77    58

Harris is conclusively ahead in 66 of the polls and Trump is conclusively ahead of 58 of the polls. There are 11 polls where a conclusive decision cannot be made.