US Democrat presidents dataset
Here’s a dataset of US presidential elections. Each row represents a
presidential election at the county level. The variables in the dataset
are the US state, the county within that state, and the percentage of
votes that went to the Democrat candidate in 2008, 2012, and in
2016.
# Load libraries
library(tidyr)
library(dplyr)
library(ggplot2)
library(fst)
library(tibble)
library(readr)
dem_votes_potus_08_16 <- read_fst(uselect_path)
dem_votes_potus_08_16
colnames(dem_votes_potus_08_16)
[1] "state" "county" "FIPS" "dem_cand_votes_08" "dem_percent_08" "dem_cand_votes_12" "dem_percent_12"
[8] "dem_cand_votes_16" "dem_percent_16"
Hypotheses
One question is whether the percentage of votes given to the
Republican candidate was lower in 2008 compared to
2012. To test this, we form the hypotheses. As before,
the null hypothesis is that our hunch is wrong, and
that the population parameters are the same in each year group.
The alternative hypothesis is that the parameter in 2008 was
lower than in 2012. I’m setting a significance level of
0.05.
One feature of this dataset is that the 2008 votes and the
2012 votes are paired, since they both refer to the same
county. That is, the 2008 and 2012 values aren’t independent from each
other. Some voting patterns may occur due to county-level demographics
and local politics. We want to capture this pairing in our model.
\[
H_{0}: \mu_{2008} \ - \mu_{2012} = 0
\\ H_{A}: \mu_{2008} \ - \mu_{2012} < 0
\]
\[
Set\ \alpha = 0.05: \ Significance \ level
\]
From two samples to one
For paired analyses, rather than considering the two variables
separately, we consider a single variable of the difference. In this
histogram of the difference most values are between minus ten and ten,
with a few outliers.
sample_data <- dem_votes_potus_08_16 %>%
mutate(diff = dem_percent_08 - dem_percent_12)
ggplot(sample_data, aes(x = diff)) +
geom_histogram(binwidth = 1)

NA
head(sample_data$diff, 30)
[1] -2.23387788 1.20418692 -0.86343547 1.94779306 -1.56068097 2.19284462 0.46573839 -1.95303233 -0.15685772 1.09746821 -1.23219534 -0.65191061 1.07898478
[14] 2.90689306 -1.40145860 5.98220437 2.10897226 1.89697848 3.26170252 0.98114752 -0.35486217 3.46370218 3.06243598 1.50691738 4.39727218 2.72474816
[27] -0.09605252 2.30883794 2.32793299 3.61597762
Calculate sample statistics of the difference
The sample mean, x-bar
, is calculated on this
difference. It is \(3.059\)
xbar_diff <- sample_data %>%
summarise(xbar_diff = mean(diff)) %>%
pull(xbar_diff)
xbar_diff
[1] 3.059585
Revised hypotheses
We can restate the hypotheses in terms of the single population mean,
\(\mu_{diff}\) , being equal to
or less than zero. The test statistic, \(t\), has a slightly simpler equation
compared to the two sample case. We have one statistic, so the number of
degrees of freedom is the number of rows in the sample minus one.
Old hypotheses
\[
H_{0}: \mu_{2008} \ - \mu_{2012} = 0
\\ H_{A}: \mu_{2008} \ - \mu_{2012} < 0
\]
New hypothesis
\[
H_{0}: \mu_{diff} = 0 \\
H_{A}: \mu_{diff} < 0
\]
Calculating the p-value
To calculate the test statistic, we need the number of rows in the
dataset, 500. And we need the standard deviation of the differences. We
already know \(\bar{x}\) , the mean of
the differences. Assuming the null hypothesis is true means \(\mu_{diff}\) is zero.
\[
\ t = \frac{x_{diff} \ - \mu_{diff}}{\sqrt{\frac{s^2_{diff}}{n_{diff}}}}
\]
n_diff <- nrow(sample_data)
n_diff
[1] 500
s_diff <- sample_data %>%
summarise(sd_diff = sd(diff)) %>%
pull(sd_diff)
s_diff
[1] 3.388767
t_stat <- (xbar_diff - 0) / sqrt(s_diff ^ 2 / n_diff)
t_stat
[1] 20.18859
degrees_of_freedom <- n_diff - 1
p_value <- pt(t_stat, df = degrees_of_freedom)
p_value
[1] 1
We now have everything we need to plug into the equation to calculate
\(t\). It’s \(20.18\). The degrees of freedom are one
less than \(n_{diff}\) at \(499\). Finally, we transform t with the
t-distribution CDF. The p-value is \(1\). That means we accept the null
hypothesis and reject the alternative hypothesis that the Democrat
candidate got a smaller percentage of the vote in 2008 compared to
2012.
Testing differences between two means using
t.test()
That was a lot of calculating. Fortunately, there’s an easier way
using t.test()
. It works with vectors, so the first
argument is the vector of differences. The type of alternative
hypothesis can be two-sided, less or greater. Finally, you
specify the value of \(\mu_{diff}\)
from the null hypothesis. Zero is the default, so strictly-speaking we
didn’t need to specify it. Here’s the output. You should recognize the
value of the test statistic and the degrees of freedom, as well as \(\bar{x}\) on the last line. The
p-value is written as “less
1”. p-values smaller than this are less
reliable due to computational accuracy constraints, but it’s the same
number we calculated before.
t.test(
# Vector of differences
sample_data$diff,
# Choose between "two sided", "less", "greater"
alternative = "less",
# Null hypothesis population parameter
mu = 0
)
One Sample t-test
data: sample_data$diff
t = 20.189, df = 499, p-value = 1
alternative hypothesis: true mean is less than 0
95 percent confidence interval:
-Inf 3.309327
sample estimates:
mean of x
3.059585
t-test()
with paired = TRUE
There’s a variation of t.test()
for paired data that
requires even less work. Rather than calculating the difference between
the two paired variables, you can just pass them both directly to
t.test()
and set paired to TRUE. Notice that all the
numbers are the same.
t.test(sample_data$dem_percent_08, sample_data$dem_percent_12,
alternative = "less", mu = 0, paired = TRUE)
Paired t-test
data: sample_data$dem_percent_08 and sample_data$dem_percent_12
t = 20.189, df = 499, p-value = 1
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
-Inf 3.309327
sample estimates:
mean difference
3.059585
Unpaired t.test()
If we don’t set paired = TRUE
and instead performed an
unpaired t-test, then the numbers change. The test statistic is closer
to one, there are more degrees of freedom, and the p-value is much
larger. Performing an unpaired t-test increases the chance of a
false negative error.
t.test(x = sample_data$dem_percent_08, y = sample_data$dem_percent_12,
alternative = "less", mu = 0)
Welch Two Sample t-test
data: sample_data$dem_percent_08 and sample_data$dem_percent_12
t = 3.3584, df = 994.6, p-value = 0.9996
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf 4.559485
sample estimates:
mean of x mean of y
41.84253 38.78294
Visualizing the difference
Before you start running hypothesis tests, it’s a great idea to
perform some exploratory data analysis. That is, calculating summary
statistics and visualizing distributions.
Here, we’ll look at the proportion of county-level votes for the
Democratic candidate in 2012 and 2016,
dem_votes_potus_12_16
. Since the counties are the same in
both years, these samples are paired. The columns containing the samples
are dem_percent_12
and dem_percent_16
.
# Calculate the differences from 2012 to 2016
sample_dem_data <- dem_votes_potus_08_16 %>%
select(state, county, dem_percent_12, dem_percent_16) %>%
mutate(diff = dem_percent_12 - dem_percent_16)
sample_dem_data
# Find mean & standard deviations of differences
diff_stats <- sample_dem_data %>%
summarise(xbar_diff = mean(diff), s_diff = sd(diff))
# See the results
diff_stats
# Using sample_dem_data, plot diff as a histogram
ggplot(sample_dem_data, aes(diff)) +
geom_histogram(binwidth = 1)

Using t.test()
Manually calculating test statistics and transforming them with a CDF
to get a p-value is a lot of effort to do every time we
need to compare two sample means. The comparison of two sample means is
called a t-test, and R has a t.test()
function to
accomplish it. This function provides some flexibility in how you
perform the test.
Now, we’ll explore the difference between the proportion of
county-level votes for the Democratic candidate in 2012 and 2016.
# Conduct a t-test on diff
test_results <- t.test(sample_dem_data$diff, alternative = "greater", mu = 0)
# See the results
test_results
One Sample t-test
data: sample_dem_data$diff
t = 30.298, df = 499, p-value < 2.2e-16
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
6.45787 Inf
sample estimates:
mean of x
6.829313
# Conduct a paired t-test on dem_percent_12 and dem_percent_16
test_results <- t.test(sample_dem_data$dem_percent_12, sample_dem_data$dem_percent_16, alternative = "greater", mu = 0, paired = TRUE)
# See the results
test_results
Paired t-test
data: sample_dem_data$dem_percent_12 and sample_dem_data$dem_percent_16
t = 30.298, df = 499, p-value < 2.2e-16
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
6.45787 Inf
sample estimates:
mean difference
6.829313
What is the correct decision from the t-test
assuming \(\alpha \ = 0.001\)
?
Reject the null hypothesis
Compare the paired t-test to an (inappropriate) unpaired test on the
same data.
How does the p-value change?
# Conduct a t-test on diff
test_results <- t.test(x=sample_dem_data$dem_percent_12, y=sample_dem_data$dem_percent_16, alternative = "greater", mu = 0)
test_results
Welch Two Sample t-test
data: sample_dem_data$dem_percent_12 and sample_dem_data$dem_percent_16
t = 7.1816, df = 997.19, p-value = 6.732e-13
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
5.263684 Inf
sample estimates:
mean of x mean of y
38.78294 31.95363
The p-value from de unpaired test is greater than the p-value
from the paired test
END
---
title: "Paired t-tests"
output: html_notebook
date: 2024-03-18
author: Juan Fernando Mosquera Araujo
---

### **US Democrat presidents dataset**

Here's a dataset of US presidential elections. Each row represents a presidential election at the county level. The variables in the dataset are the US state, the county within that state, and the percentage of votes that went to the Democrat candidate in 2008, 2012, and in 2016.

```{r}
# Load libraries
library(tidyr)
library(dplyr)
library(ggplot2)
library(fst)
library(tibble)
library(readr)
```

```{r echo=FALSE}
uselect_path <- "C:/Users/JuanFer Mosquera/Documents/datasets/dem_county_pres_joined.fst"
```

```{r}
dem_votes_potus_08_16 <- read_fst(uselect_path)
dem_votes_potus_08_16
colnames(dem_votes_potus_08_16)
```

### **Hypotheses**

One question is whether the percentage of votes given to the Republican candidate **was lower in 2008 compared to 2012**. To test this, we form the hypotheses. As before, **the null hypothesis is that our hunch is wrong**, and that the population parameters are the same in each year group. T**he alternative hypothesis is that the parameter in 2008 was lower than in 2012**. I'm setting a significance level of **0.05**.

One feature of this dataset is that **the 2008 votes and the 2012 votes are paired**, since they both refer to the same county. That is, the 2008 and 2012 values aren't independent from each other. Some voting patterns may occur due to county-level demographics and local politics. We want to capture this pairing in our model.

$$
H_{0}: \mu_{2008} \ - \mu_{2012} = 0
\\ H_{A}: \mu_{2008} \ - \mu_{2012} < 0
$$

$$
Set\ \alpha = 0.05: \ Significance \ level
$$

### **From two samples to one**

For paired analyses, rather than considering the two variables separately, we consider a single variable of the difference. In this histogram of the difference most values are between minus ten and ten, with a few outliers.

```{r}
sample_data <- dem_votes_potus_08_16 %>%
  mutate(diff = dem_percent_08 - dem_percent_12)
```

```{r}
ggplot(sample_data, aes(x = diff)) +
  geom_histogram(binwidth = 1)
  
```

```{r}
head(sample_data$diff, 30)
```

### **Calculate sample statistics of the difference**

The sample mean, `x-bar`, is calculated on this difference. It is $3.059$

```{r}
xbar_diff <- sample_data %>%
  summarise(xbar_diff = mean(diff)) %>%
  pull(xbar_diff)

xbar_diff
```

### **Revised hypotheses**

We can restate the hypotheses in terms of the single population mean, $\mu_{diff}$ , being **equal to or less than zero**. The test statistic, $t$, has a slightly simpler equation compared to the two sample case. We have one statistic, so the number of degrees of freedom is the number of rows in the sample minus one.

#### **Old hypotheses**

$$
H_{0}: \mu_{2008} \ - \mu_{2012} = 0
\\ H_{A}: \mu_{2008} \ - \mu_{2012} < 0
$$

#### **New hypothesis**

$$
H_{0}: \mu_{diff} = 0 \\
H_{A}: \mu_{diff} < 0
$$

### **Calculating the p-value**

To calculate the test statistic, we need the number of rows in the dataset, 500. And we need the standard deviation of the differences. We already know $\bar{x}$ , the mean of the differences. Assuming the null hypothesis is true means $\mu_{diff}$ is zero.

$$
\ t = \frac{x_{diff} \ - \mu_{diff}}{\sqrt{\frac{s^2_{diff}}{n_{diff}}}}
$$

```{r}
n_diff <- nrow(sample_data)
n_diff
```

```{r}
s_diff <- sample_data %>%
  summarise(sd_diff = sd(diff)) %>%
  pull(sd_diff)

s_diff
```

```{r}
t_stat <- (xbar_diff - 0) / sqrt(s_diff ^ 2 / n_diff)
t_stat
```

```{r}
degrees_of_freedom <- n_diff - 1

p_value <- pt(t_stat, df = degrees_of_freedom)
p_value
```

We now have everything we need to plug into the equation to calculate $t$. It's $20.18$. The degrees of freedom are one less than $n_{diff}$ at $499$. Finally, we transform t with the t-distribution CDF. The p-value is $1$. That means we accept the null hypothesis and reject the alternative hypothesis that the Democrat candidate got a smaller percentage of the vote in 2008 compared to 2012.

### **Testing differences between two means using t.test()**

That was a lot of calculating. Fortunately, there's an easier way using `t.test()`. It works with vectors, so the first argument is the vector of differences. The **type of alternative hypothesis can be two-sided, less or greater**. Finally, you specify the value of $\mu_{diff}$ from the null hypothesis. Zero is the default, so strictly-speaking we didn't need to specify it. Here's the output. You should recognize the value of the test statistic and the degrees of freedom, as well as $\bar{x}$ on the last line. The **p-value** is written as "***less 1***". **p-values** smaller than this are less reliable due to computational accuracy constraints, but it's the same number we calculated before.

```{r}
t.test(
  # Vector of differences
  sample_data$diff,
  # Choose between "two sided", "less", "greater"
  alternative = "less",
  # Null hypothesis population parameter
  mu = 0
  )
```

### **`t-test()` with paired = TRUE**

There's a variation of `t.test()` for paired data that requires even less work. Rather than calculating the difference between the two paired variables, you can just pass them both directly to `t.test()` and set paired to TRUE. Notice that all the numbers are the same.

```{r}
t.test(sample_data$dem_percent_08, sample_data$dem_percent_12, 
       alternative = "less", mu = 0, paired = TRUE)
```

### **Unpaired `t.test()`**

If we don't set `paired = TRUE` and instead performed an unpaired t-test, then the numbers change. The test statistic is closer to one, there are more degrees of freedom, and the p-value is much larger. **Performing an unpaired t-test increases the chance of a false negative error**.

```{r}
t.test(x = sample_data$dem_percent_08, y = sample_data$dem_percent_12,
       alternative = "less", mu = 0)
```

### **Visualizing the difference**

Before you start running hypothesis tests, it's a great idea to perform some exploratory data analysis. That is, calculating summary statistics and visualizing distributions.

Here, we'll look at the proportion of county-level votes for the Democratic candidate in 2012 and 2016, `dem_votes_potus_12_16`. Since the counties are the same in both years, these samples are paired. The columns containing the samples are `dem_percent_12` and `dem_percent_16`.

```{r}
# Calculate the differences from 2012 to 2016
sample_dem_data <- dem_votes_potus_08_16 %>%
  select(state, county, dem_percent_12, dem_percent_16) %>%
  mutate(diff = dem_percent_12 - dem_percent_16)

sample_dem_data
```

```{r}
# Find mean & standard deviations of differences
diff_stats <- sample_dem_data %>%
  summarise(xbar_diff = mean(diff), s_diff = sd(diff))

# See the results
diff_stats
```

```{r}
# Using sample_dem_data, plot diff as a histogram
  ggplot(sample_dem_data, aes(diff)) + 
    geom_histogram(binwidth = 1)
```

### **Using t.test()**

Manually calculating test statistics and transforming them with a CDF to get a **p-value** is a lot of effort to do every time we need to compare two sample means. The comparison of two sample means is called a t-test, and R has a `t.test()` function to accomplish it. This function provides some flexibility in how you perform the test.

Now, we'll explore the difference between the proportion of county-level votes for the Democratic candidate in 2012 and 2016.

```{r}
# Conduct a t-test on diff
test_results <- t.test(sample_dem_data$diff, alternative = "greater", mu = 0)

# See the results
test_results
```

```{r}
# Conduct a paired t-test on dem_percent_12 and dem_percent_16
test_results <- t.test(sample_dem_data$dem_percent_12, sample_dem_data$dem_percent_16, alternative = "greater", mu = 0, paired = TRUE)

# See the results
test_results
```

### **What is the correct decision from the t-test assuming** $\alpha \ = 0.001$ ?

**Reject the null hypothesis**

### Compare the paired t-test to an (inappropriate) unpaired test on the same data.

**How does the p-value change?**

```{r}
# Conduct a t-test on diff
test_results <- t.test(x=sample_dem_data$dem_percent_12, y=sample_dem_data$dem_percent_16, alternative = "greater", mu = 0)
test_results
```

**The p-value from de unpaired test is greater than the p-value from the paired test**

### END
