load libraries

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## Warning: package 'tidyr' was built under R version 4.0.3

library(lubridate)

## Warning: package 'lubridate' was built under R version 4.0.3

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggplot2)
library(readr)

## Warning: package 'readr' was built under R version 4.0.3

load dataset

cases_county <- read_csv("C:\\Users\\cdgm9\\Desktop\\Academic\\FALL 2020\\STAT 20\\Projects\\Project 2\\covid_cases_county.csv")

## 
## -- Column specification -----------------------------------------------------------------------------------------------------------------------------------------------------------
## cols(
##   countyFIPS = col_double(),
##   `County Name` = col_character(),
##   State = col_character(),
##   stateFIPS = col_double(),
##   date = col_date(format = ""),
##   cases = col_double()
## )

Part I) Exploring Data

We will compare the rate of California to that of Iowa. Both states implemented SIP orders at different times. California implemented them early while Iowa did not.

Below is the filtered table that only includes values corresponding only to the states of CA and IA.

covid_cases_by_state<- cases_county %>%
  filter(State == "CA" | State == "IA")%>%
  group_by(date, State) %>%
  mutate(ConfirmedCases = sum(cases)) %>%
  filter(!duplicated(date)) %>%
  select(-cases, -`County Name`, - countyFIPS) %>% #remove this variables
  ungroup()

#first few rows
head(covid_cases_by_state, n = 2)

## # A tibble: 2 x 4
##   State stateFIPS date       ConfirmedCases
##   <chr>     <dbl> <date>              <dbl>
## 1 CA            6 2020-01-22              0
## 2 IA           19 2020-01-22              0

#last few rows
tail(covid_cases_by_state, n = 2)

## # A tibble: 2 x 4
##   State stateFIPS date       ConfirmedCases
##   <chr>     <dbl> <date>              <dbl>
## 1 CA            6 2020-10-26         910881
## 2 IA           19 2020-10-26         116452

We will compare the number of cases in absolute terms and in per capita terms in this section. We will use scatter plots plots for this purpose since there are sufficient number of points that they visually define smooth lines, and we care about the trend of the data not specific values at a given time. Furthermore, this approach will result in data visualization figures with high data-to-ink rations.

Cases in Absolute Terms

If you compare the number of confirmed in absolute terms of each state over the same period of time (January 22 - October 26), it is clear that CA has many more cases than IA ever does. In fact, the number of confirmed cases for both states increases throughout the time period.

# Figure 1
ggplot(covid_cases_by_state, aes(x = date, y= ConfirmedCases, color = State)) +
  geom_point() + 
  theme_classic() +
  labs(title = "Confirmed Cases vs. Time (Jan 22 - Oct. 26)",
       y = "Confirmed Cases",
       x = "Time",
       tag = "Figure 1")

Cases Per Capita

However, these facts are not surprising since CA has a population size that dwarfs that of Iowa’s. For a better comparison we need to compare the rate of confirmed cases per capita.

Below we create two data frames (ca_frame and ia_frame) each of which contains is specific to a state (CA or IA), contains the desired time interval, and replaces confirmed cases with confirmed cases per capita.

#CA population
ca_pop <- 39250017

#CA Confirmed Cases per capita = Confirmed Cases / Population
ca_cc_per_capita <- covid_cases_by_state %>%
  filter(State == "CA") %>%
  pull(ConfirmedCases) / ca_pop 

#Date of Confirmed Cases for CA
ca_cc_date <- covid_cases_by_state %>%
  filter(State == "CA") %>%
  pull(date) 

#CA frame with Date and Cases per capita
ca_frame <- data.frame(date = ca_cc_date, cc_per_capita = ca_cc_per_capita)

#IA population
ia_pop <- 3134693

#IA Confirmed Cases per capita
ia_cc_per_capita <- covid_cases_by_state %>%
  filter(State == "IA") %>%
  pull(ConfirmedCases) / ia_pop 

#Date of Confirmed Cases for IA
ia_cc_date <- covid_cases_by_state %>%
  filter(State == "IA") %>%
  pull(date) 

#IA frame with Date and Cases per capita
ia_frame <- data.frame(date = ia_cc_date, cc_per_capita = ia_cc_per_capita)

In terms of confirmed cases per capita, the number of cases per capita of Iowa increases faster and higher than than the number of cases per capita of California as shown below. We suspect that implementation of SIP orders earlier by CA may explain the apparent difference.

# Figure 2
ggplot() +
  geom_point(data = ia_frame, mapping = aes(x = date, y = cc_per_capita, color = "IA"), inherit.aes = FALSE) +
  geom_point(data = ca_frame, mapping = aes(x = date, y = cc_per_capita, color = "CA"), inherit.aes = FALSE) +
  theme_classic()+
  labs(title = "Confirmed Cases Per Capita vs. Time (Jan. 22 - Oct. 26)",
       y = "Confirmed Cases per Capita",
       x = "Time",
       color = "State",
       tag = "Figure 2")

Part II) Comparing COVID-19 Rates by State

Question 1

We use dplyr to extract the proportion of confirmed cases on October 26 from the newly created data frames (ca_frame and ia_frame).

#CA confirmed cases on Oct. 26
ca_cc_per_capita_final <- ca_frame %>% pull(cc_per_capita) %>% tail(n = 1)
cat("CA confirmed cases per capita on Oct. 26:", 
    format(ca_cc_per_capita_final, nsmall = 3, digits = 3))

## CA confirmed cases per capita on Oct. 26: 0.0232

#IA confirmed cases on Oct. 26
ia_cc_per_capita_final <- ia_frame %>% pull(cc_per_capita) %>% tail(n = 1)
cat("\nIA confirmed cases per capita on Oct. 26:", 
    format(ia_cc_per_capita_final, nsmall = 3, digits = 3))

## 
## IA confirmed cases per capita on Oct. 26: 0.0371

Question 2

We determine a 95% confidence interval for the true difference in proportions between CA and IA.

#CA SD
ca_sd <- (1-0)*sqrt(ca_cc_per_capita_final*(1 - ca_cc_per_capita_final))

#CA SE
se_ca <- sqrt(ca_pop)*ca_sd
se_ca_propor <- se_ca / ca_pop 

#IA SD
ia_sd <- (1-0)*sqrt(ia_cc_per_capita_final*(1 - ia_cc_per_capita_final))

#IA SE
se_ia <- sqrt(ia_pop)*ia_sd
se_ia_propor <- se_ia / ia_pop 

#SE Difference
se_diff <- sqrt(se_ca_propor^2+se_ia_propor^2)

# 95%-CI for Cases Per Capita Differences
cat("95%-Confidence Interval:", 
    format((ia_cc_per_capita_final - ca_cc_per_capita_final), 
           nsmall = 3, digits = 3),
    "+/-", 
    format((2*se_diff), nsmall = 3, digits = 3)
    )

## 95%-Confidence Interval: 0.0139 +/- 0.000219

Though we know the difference between the samples of the states is 1.4%, when we factor variation due to chance, the calculated confidence interval allows us to say that we are 95% confident that the difference of the confirmed cases per capita of CA and IA populations is 1.4% give or take 0.022%.

Question 3

Null: The proportion of confirmed cases per capita is the same in both states, any apparent difference is due to chance.

Alternative: The proportion of confirmed cases per capita in IA is greater than that of CA. The difference is not due to chance and an be explained by late adoption of SIP measures by IA.

Since the two samples are independent, large enough, and are assumed to be simple random samples, then we use the two-sample z-test

Question 4

ca <- 910881 / 39250017 #CA confirmed cases per capita
ia <- 116452 / 3134693 #IA confirmed cases per capita
cc <- 910881 + 116452 #pooled confirmed cases
pop <- 39250017+3134693 #pooled population

pool_sd <- sqrt(cc/pop*(1-cc/pop)) #pool SD

#SE CA
se_ca <- sqrt(ca_pop)*pool_sd
se_ca_propor <- se_ca / ca_pop 

#SE IA
se_ia <- sqrt(ia_pop)*pool_sd
se_ia_propor <- se_ia / ia_pop 

#SE Difference
se_diff <- sqrt(se_ca_propor^2+se_ia_propor^2)

#Confirmed Cases per capita Difference
diff <- (ia_cc_per_capita_final - ca_cc_per_capita_final)

#z-test
z_statistic <- (diff - 0) / se_diff
p_value <- 1 - pnorm(z_statistic)
significance <- p_value < .05
cat("\nz-statistic is", z_statistic)

## 
## z-statistic is 154.4625

cat("\np-value is", p_value)

## 
## p-value is 0

cat("\nIs it significant at .05 level?", significance)

## 
## Is it significant at .05 level? TRUE

Since the p-value is significant at the 0.05 level, then we reject the null hypothesis and conclude that the apparent difference between states is not due to chance. The implementation of SIP orders earlier by CA could explain the difference.

Part III) Writing assignment

The November 10th episode of the The New York Time’s podcast, The Daily, was dedicated to a postmortem look at the results of the presidential election. Host, Michael Barbaro, sat down with pollster Nate Cohn and they discussed the polling errors of this election cycle from which we can glean some insights.

Nate emphasizes the importance of understanding the limits of polling. Polling results are derived from the quality of the samples taken by the polling organization. When results are published and shared, they always include the desired statistic with a margin of error. However, the communicated level of precision and accuracy is misleading to the public (and the media). The margin of error is due to sampling error which does not communicate the error contributed by non-sampling error. Hence, the margin of error is larger than communicated, or calculated. As Nate revealed, non-sampling errors are what caused the polling errors this cycle. Nate speculates different forms that non-sampling error, or bias, could have taken, and they include the following:

Republicans and Democrats unsatisfied with the Trump more likely to respond to surveys.
Republicans less likely to respond because they distrust the media because of Trump.
Republicans less likely to respond because they distrust polls because of the 2016 election.
Selecting the wrong proportion of different types of Republicans, like rural vs. not rural.

Now, pollsters know they will encounter some kind of non-sampling error(s), hence they weight the value of responses from certain groups. But, underlying all of polling is the fact that pollsters have to screen for likely voters were they must guess a plausible likely-voter profile which is another source of error. Given these issues, we wonder whether the polling enterprise is worth the trouble? As Nate pointed out, in recent decades the “amount of error in the poll is basically equal to the range of possible results” , so polls don’t allow us to “rule out anything”.

And yet, the alternative to polling is “to talk to our neighbors and our like-minded friends”. Perhaps horse-race polls can be done away with since polls are not predictors but snapshots of a race. The only way to see if the race has changed is to take more snapshots. But polling in other areas like survey research or in political races for seats where margins are not tight serve a purpose.

STAT 20 - Project 2

Ty Burre

Cesar Gonzalez Murguia

Jasmine Zheng

Katherine Mckernan

Courtney John

11/22/2020