US Democrat presidents dataset

Here’s a dataset of US presidential elections. Each row represents a presidential election at the county level. The variables in the dataset are the US state, the county within that state, and the percentage of votes that went to the Democrat candidate in 2008, 2012, and in 2016.

# Load libraries
library(tidyr)
library(dplyr)
library(ggplot2)
library(fst)
library(tibble)
library(readr)
dem_votes_potus_08_16 <- read_fst(uselect_path)
dem_votes_potus_08_16
colnames(dem_votes_potus_08_16)
[1] "state"             "county"            "FIPS"              "dem_cand_votes_08" "dem_percent_08"    "dem_cand_votes_12" "dem_percent_12"   
[8] "dem_cand_votes_16" "dem_percent_16"   

Hypotheses

One question is whether the percentage of votes given to the Republican candidate was lower in 2008 compared to 2012. To test this, we form the hypotheses. As before, the null hypothesis is that our hunch is wrong, and that the population parameters are the same in each year group. The alternative hypothesis is that the parameter in 2008 was lower than in 2012. I’m setting a significance level of 0.05.

One feature of this dataset is that the 2008 votes and the 2012 votes are paired, since they both refer to the same county. That is, the 2008 and 2012 values aren’t independent from each other. Some voting patterns may occur due to county-level demographics and local politics. We want to capture this pairing in our model.

\[ H_{0}: \mu_{2008} \ - \mu_{2012} = 0 \\ H_{A}: \mu_{2008} \ - \mu_{2012} < 0 \]

\[ Set\ \alpha = 0.05: \ Significance \ level \]

From two samples to one

For paired analyses, rather than considering the two variables separately, we consider a single variable of the difference. In this histogram of the difference most values are between minus ten and ten, with a few outliers.

sample_data <- dem_votes_potus_08_16 %>%
  mutate(diff = dem_percent_08 - dem_percent_12)
ggplot(sample_data, aes(x = diff)) +
  geom_histogram(binwidth = 1)

NA
head(sample_data$diff, 30)
 [1] -2.23387788  1.20418692 -0.86343547  1.94779306 -1.56068097  2.19284462  0.46573839 -1.95303233 -0.15685772  1.09746821 -1.23219534 -0.65191061  1.07898478
[14]  2.90689306 -1.40145860  5.98220437  2.10897226  1.89697848  3.26170252  0.98114752 -0.35486217  3.46370218  3.06243598  1.50691738  4.39727218  2.72474816
[27] -0.09605252  2.30883794  2.32793299  3.61597762

Calculate sample statistics of the difference

The sample mean, x-bar, is calculated on this difference. It is \(3.059\)

xbar_diff <- sample_data %>%
  summarise(xbar_diff = mean(diff)) %>%
  pull(xbar_diff)

xbar_diff
[1] 3.059585

Revised hypotheses

We can restate the hypotheses in terms of the single population mean, \(\mu_{diff}\) , being equal to or less than zero. The test statistic, \(t\), has a slightly simpler equation compared to the two sample case. We have one statistic, so the number of degrees of freedom is the number of rows in the sample minus one.

Old hypotheses

\[ H_{0}: \mu_{2008} \ - \mu_{2012} = 0 \\ H_{A}: \mu_{2008} \ - \mu_{2012} < 0 \]

New hypothesis

\[ H_{0}: \mu_{diff} = 0 \\ H_{A}: \mu_{diff} < 0 \]

Calculating the p-value

To calculate the test statistic, we need the number of rows in the dataset, 500. And we need the standard deviation of the differences. We already know \(\bar{x}\) , the mean of the differences. Assuming the null hypothesis is true means \(\mu_{diff}\) is zero.

\[ \ t = \frac{x_{diff} \ - \mu_{diff}}{\sqrt{\frac{s^2_{diff}}{n_{diff}}}} \]

n_diff <- nrow(sample_data)
n_diff
[1] 500
s_diff <- sample_data %>%
  summarise(sd_diff = sd(diff)) %>%
  pull(sd_diff)

s_diff
[1] 3.388767
t_stat <- (xbar_diff - 0) / sqrt(s_diff ^ 2 / n_diff)
t_stat
[1] 20.18859
degrees_of_freedom <- n_diff - 1

p_value <- pt(t_stat, df = degrees_of_freedom)
p_value
[1] 1

We now have everything we need to plug into the equation to calculate \(t\). It’s \(20.18\). The degrees of freedom are one less than \(n_{diff}\) at \(499\). Finally, we transform t with the t-distribution CDF. The p-value is \(1\). That means we accept the null hypothesis and reject the alternative hypothesis that the Democrat candidate got a smaller percentage of the vote in 2008 compared to 2012.

Testing differences between two means using t.test()

That was a lot of calculating. Fortunately, there’s an easier way using t.test(). It works with vectors, so the first argument is the vector of differences. The type of alternative hypothesis can be two-sided, less or greater. Finally, you specify the value of \(\mu_{diff}\) from the null hypothesis. Zero is the default, so strictly-speaking we didn’t need to specify it. Here’s the output. You should recognize the value of the test statistic and the degrees of freedom, as well as \(\bar{x}\) on the last line. The p-value is written as “less 1”. p-values smaller than this are less reliable due to computational accuracy constraints, but it’s the same number we calculated before.

t.test(
  # Vector of differences
  sample_data$diff,
  # Choose between "two sided", "less", "greater"
  alternative = "less",
  # Null hypothesis population parameter
  mu = 0
  )

    One Sample t-test

data:  sample_data$diff
t = 20.189, df = 499, p-value = 1
alternative hypothesis: true mean is less than 0
95 percent confidence interval:
     -Inf 3.309327
sample estimates:
mean of x 
 3.059585 

t-test() with paired = TRUE

There’s a variation of t.test() for paired data that requires even less work. Rather than calculating the difference between the two paired variables, you can just pass them both directly to t.test() and set paired to TRUE. Notice that all the numbers are the same.

t.test(sample_data$dem_percent_08, sample_data$dem_percent_12, 
       alternative = "less", mu = 0, paired = TRUE)

    Paired t-test

data:  sample_data$dem_percent_08 and sample_data$dem_percent_12
t = 20.189, df = 499, p-value = 1
alternative hypothesis: true mean difference is less than 0
95 percent confidence interval:
     -Inf 3.309327
sample estimates:
mean difference 
       3.059585 

Unpaired t.test()

If we don’t set paired = TRUE and instead performed an unpaired t-test, then the numbers change. The test statistic is closer to one, there are more degrees of freedom, and the p-value is much larger. Performing an unpaired t-test increases the chance of a false negative error.

t.test(x = sample_data$dem_percent_08, y = sample_data$dem_percent_12,
       alternative = "less", mu = 0)

    Welch Two Sample t-test

data:  sample_data$dem_percent_08 and sample_data$dem_percent_12
t = 3.3584, df = 994.6, p-value = 0.9996
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
     -Inf 4.559485
sample estimates:
mean of x mean of y 
 41.84253  38.78294 

Visualizing the difference

Before you start running hypothesis tests, it’s a great idea to perform some exploratory data analysis. That is, calculating summary statistics and visualizing distributions.

Here, we’ll look at the proportion of county-level votes for the Democratic candidate in 2012 and 2016, dem_votes_potus_12_16. Since the counties are the same in both years, these samples are paired. The columns containing the samples are dem_percent_12 and dem_percent_16.

# Calculate the differences from 2012 to 2016
sample_dem_data <- dem_votes_potus_08_16 %>%
  select(state, county, dem_percent_12, dem_percent_16) %>%
  mutate(diff = dem_percent_12 - dem_percent_16)

sample_dem_data
# Find mean & standard deviations of differences
diff_stats <- sample_dem_data %>%
  summarise(xbar_diff = mean(diff), s_diff = sd(diff))

# See the results
diff_stats
# Using sample_dem_data, plot diff as a histogram
  ggplot(sample_dem_data, aes(diff)) + 
    geom_histogram(binwidth = 1)

Using t.test()

Manually calculating test statistics and transforming them with a CDF to get a p-value is a lot of effort to do every time we need to compare two sample means. The comparison of two sample means is called a t-test, and R has a t.test() function to accomplish it. This function provides some flexibility in how you perform the test.

Now, we’ll explore the difference between the proportion of county-level votes for the Democratic candidate in 2012 and 2016.

# Conduct a t-test on diff
test_results <- t.test(sample_dem_data$diff, alternative = "greater", mu = 0)

# See the results
test_results

    One Sample t-test

data:  sample_dem_data$diff
t = 30.298, df = 499, p-value < 2.2e-16
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
 6.45787     Inf
sample estimates:
mean of x 
 6.829313 
# Conduct a paired t-test on dem_percent_12 and dem_percent_16
test_results <- t.test(sample_dem_data$dem_percent_12, sample_dem_data$dem_percent_16, alternative = "greater", mu = 0, paired = TRUE)

# See the results
test_results

    Paired t-test

data:  sample_dem_data$dem_percent_12 and sample_dem_data$dem_percent_16
t = 30.298, df = 499, p-value < 2.2e-16
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
 6.45787     Inf
sample estimates:
mean difference 
       6.829313 

What is the correct decision from the t-test assuming \(\alpha \ = 0.001\) ?

Reject the null hypothesis

Compare the paired t-test to an (inappropriate) unpaired test on the same data.

How does the p-value change?

# Conduct a t-test on diff
test_results <- t.test(x=sample_dem_data$dem_percent_12, y=sample_dem_data$dem_percent_16, alternative = "greater", mu = 0)
test_results

    Welch Two Sample t-test

data:  sample_dem_data$dem_percent_12 and sample_dem_data$dem_percent_16
t = 7.1816, df = 997.19, p-value = 6.732e-13
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 5.263684      Inf
sample estimates:
mean of x mean of y 
 38.78294  31.95363 

The p-value from de unpaired test is greater than the p-value from the paired test

END

---
title: "Paired t-tests"
output: html_notebook
date: 2024-03-18
author: Juan Fernando Mosquera Araujo
---

### **US Democrat presidents dataset**

Here's a dataset of US presidential elections. Each row represents a presidential election at the county level. The variables in the dataset are the US state, the county within that state, and the percentage of votes that went to the Democrat candidate in 2008, 2012, and in 2016.

```{r}
# Load libraries
library(tidyr)
library(dplyr)
library(ggplot2)
library(fst)
library(tibble)
library(readr)
```

```{r echo=FALSE}
uselect_path <- "C:/Users/JuanFer Mosquera/Documents/datasets/dem_county_pres_joined.fst"
```

```{r}
dem_votes_potus_08_16 <- read_fst(uselect_path)
dem_votes_potus_08_16
colnames(dem_votes_potus_08_16)
```

### **Hypotheses**

One question is whether the percentage of votes given to the Republican candidate **was lower in 2008 compared to 2012**. To test this, we form the hypotheses. As before, **the null hypothesis is that our hunch is wrong**, and that the population parameters are the same in each year group. T**he alternative hypothesis is that the parameter in 2008 was lower than in 2012**. I'm setting a significance level of **0.05**.

One feature of this dataset is that **the 2008 votes and the 2012 votes are paired**, since they both refer to the same county. That is, the 2008 and 2012 values aren't independent from each other. Some voting patterns may occur due to county-level demographics and local politics. We want to capture this pairing in our model.

$$
H_{0}: \mu_{2008} \ - \mu_{2012} = 0
\\ H_{A}: \mu_{2008} \ - \mu_{2012} < 0
$$

$$
Set\ \alpha = 0.05: \ Significance \ level
$$

### **From two samples to one**

For paired analyses, rather than considering the two variables separately, we consider a single variable of the difference. In this histogram of the difference most values are between minus ten and ten, with a few outliers.

```{r}
sample_data <- dem_votes_potus_08_16 %>%
  mutate(diff = dem_percent_08 - dem_percent_12)
```

```{r}
ggplot(sample_data, aes(x = diff)) +
  geom_histogram(binwidth = 1)
  
```

```{r}
head(sample_data$diff, 30)
```

### **Calculate sample statistics of the difference**

The sample mean, `x-bar`, is calculated on this difference. It is $3.059$

```{r}
xbar_diff <- sample_data %>%
  summarise(xbar_diff = mean(diff)) %>%
  pull(xbar_diff)

xbar_diff
```

### **Revised hypotheses**

We can restate the hypotheses in terms of the single population mean, $\mu_{diff}$ , being **equal to or less than zero**. The test statistic, $t$, has a slightly simpler equation compared to the two sample case. We have one statistic, so the number of degrees of freedom is the number of rows in the sample minus one.

#### **Old hypotheses**

$$
H_{0}: \mu_{2008} \ - \mu_{2012} = 0
\\ H_{A}: \mu_{2008} \ - \mu_{2012} < 0
$$

#### **New hypothesis**

$$
H_{0}: \mu_{diff} = 0 \\
H_{A}: \mu_{diff} < 0
$$

### **Calculating the p-value**

To calculate the test statistic, we need the number of rows in the dataset, 500. And we need the standard deviation of the differences. We already know $\bar{x}$ , the mean of the differences. Assuming the null hypothesis is true means $\mu_{diff}$ is zero.

$$
\ t = \frac{x_{diff} \ - \mu_{diff}}{\sqrt{\frac{s^2_{diff}}{n_{diff}}}}
$$

```{r}
n_diff <- nrow(sample_data)
n_diff
```

```{r}
s_diff <- sample_data %>%
  summarise(sd_diff = sd(diff)) %>%
  pull(sd_diff)

s_diff
```

```{r}
t_stat <- (xbar_diff - 0) / sqrt(s_diff ^ 2 / n_diff)
t_stat
```

```{r}
degrees_of_freedom <- n_diff - 1

p_value <- pt(t_stat, df = degrees_of_freedom)
p_value
```

We now have everything we need to plug into the equation to calculate $t$. It's $20.18$. The degrees of freedom are one less than $n_{diff}$ at $499$. Finally, we transform t with the t-distribution CDF. The p-value is $1$. That means we accept the null hypothesis and reject the alternative hypothesis that the Democrat candidate got a smaller percentage of the vote in 2008 compared to 2012.

### **Testing differences between two means using t.test()**

That was a lot of calculating. Fortunately, there's an easier way using `t.test()`. It works with vectors, so the first argument is the vector of differences. The **type of alternative hypothesis can be two-sided, less or greater**. Finally, you specify the value of $\mu_{diff}$ from the null hypothesis. Zero is the default, so strictly-speaking we didn't need to specify it. Here's the output. You should recognize the value of the test statistic and the degrees of freedom, as well as $\bar{x}$ on the last line. The **p-value** is written as "***less 1***". **p-values** smaller than this are less reliable due to computational accuracy constraints, but it's the same number we calculated before.

```{r}
t.test(
  # Vector of differences
  sample_data$diff,
  # Choose between "two sided", "less", "greater"
  alternative = "less",
  # Null hypothesis population parameter
  mu = 0
  )
```

### **`t-test()` with paired = TRUE**

There's a variation of `t.test()` for paired data that requires even less work. Rather than calculating the difference between the two paired variables, you can just pass them both directly to `t.test()` and set paired to TRUE. Notice that all the numbers are the same.

```{r}
t.test(sample_data$dem_percent_08, sample_data$dem_percent_12, 
       alternative = "less", mu = 0, paired = TRUE)
```

### **Unpaired `t.test()`**

If we don't set `paired = TRUE` and instead performed an unpaired t-test, then the numbers change. The test statistic is closer to one, there are more degrees of freedom, and the p-value is much larger. **Performing an unpaired t-test increases the chance of a false negative error**.

```{r}
t.test(x = sample_data$dem_percent_08, y = sample_data$dem_percent_12,
       alternative = "less", mu = 0)
```

### **Visualizing the difference**

Before you start running hypothesis tests, it's a great idea to perform some exploratory data analysis. That is, calculating summary statistics and visualizing distributions.

Here, we'll look at the proportion of county-level votes for the Democratic candidate in 2012 and 2016, `dem_votes_potus_12_16`. Since the counties are the same in both years, these samples are paired. The columns containing the samples are `dem_percent_12` and `dem_percent_16`.

```{r}
# Calculate the differences from 2012 to 2016
sample_dem_data <- dem_votes_potus_08_16 %>%
  select(state, county, dem_percent_12, dem_percent_16) %>%
  mutate(diff = dem_percent_12 - dem_percent_16)

sample_dem_data
```

```{r}
# Find mean & standard deviations of differences
diff_stats <- sample_dem_data %>%
  summarise(xbar_diff = mean(diff), s_diff = sd(diff))

# See the results
diff_stats
```

```{r}
# Using sample_dem_data, plot diff as a histogram
  ggplot(sample_dem_data, aes(diff)) + 
    geom_histogram(binwidth = 1)
```

### **Using t.test()**

Manually calculating test statistics and transforming them with a CDF to get a **p-value** is a lot of effort to do every time we need to compare two sample means. The comparison of two sample means is called a t-test, and R has a `t.test()` function to accomplish it. This function provides some flexibility in how you perform the test.

Now, we'll explore the difference between the proportion of county-level votes for the Democratic candidate in 2012 and 2016.

```{r}
# Conduct a t-test on diff
test_results <- t.test(sample_dem_data$diff, alternative = "greater", mu = 0)

# See the results
test_results
```

```{r}
# Conduct a paired t-test on dem_percent_12 and dem_percent_16
test_results <- t.test(sample_dem_data$dem_percent_12, sample_dem_data$dem_percent_16, alternative = "greater", mu = 0, paired = TRUE)

# See the results
test_results
```

### **What is the correct decision from the t-test assuming** $\alpha \ = 0.001$ ?

**Reject the null hypothesis**

### Compare the paired t-test to an (inappropriate) unpaired test on the same data.

**How does the p-value change?**

```{r}
# Conduct a t-test on diff
test_results <- t.test(x=sample_dem_data$dem_percent_12, y=sample_dem_data$dem_percent_16, alternative = "greater", mu = 0)
test_results
```

**The p-value from de unpaired test is greater than the p-value from the paired test**

### END
