Estimation and Statistical Inference

Getting Started

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

d <- read_csv("isbell_crowther_2023_gen_Z_data.csv")

Rows: 216 Columns: 296
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (27): ResponseId, wave, Generation, device, speaker_A_L1, speaker_B_L1,...
dbl (268): USA_Correct, USA_Friendly, USA_Pleasant, USA_Familiar, China_Corr...
lgl   (1): gender_4_TEXT

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Estimating a Sample Mean

We’ll start with the variable USA_Correct, which is a rating of how correct participants viewed the English of people from the United States. Here’s a quick histogram to get a sense of the sample distribution.

hist(d$USA_Correct)

So first, we might want to question how useful of a summary a mean is, based on this distribution. But it is a large-ish sample, so we’re pretty safely able to assume that the mean would fall into a larger distribution of sample means. Let’s proceed.

As we’ve practiced, estimating means is easy in r.

mean(d$USA_Correct, na.rm = T) #how correct is US English?

[1] 8.546729

But we don’t know how precise that mean is without accounting for the sample size and variation in responses.

m <- mean(d$USA_Correct, na.rm = T)
sd <- sd(d$USA_Correct, na.rm = T)
n <- length(na.omit(d$USA_Correct))
se <- sd/sqrt(n)

m

[1] 8.546729

sd

[1] 1.685387

[1] 214

se

[1] 0.1152107

So there are 214 observations of USA_Correct, with a mean of 8.55, a SD of 1.69, and a SE of 0.12. In other words, we would expect 68% of repeated samples to have the ‘true’ population mean within +/- 0.12 points of observed means. This appears fairly precise!

Visualizing a Sample Mean with a CI

Using the very simple example above, we can construct some 95% confidence intervals and create a simple plot.

First, the CI:

ci <- se*1.96 #1.96 corresponds to a 95% CI

On a side note, you can get the 1.96 value from the qnorm() function. You can also use it to find values for other CIs, e.g., 90%. The trick is you enter a probability corresponding to half of the CI, because the CI is split into two tails - above and below.

qnorm(.975) # this is for half of the 95% (1-95 = 5, 5/2 = 2.5)

[1] 1.959964

qnorm(.95) # this is for a 90% CI

[1] 1.644854

So now we can just add and subtract the CI to/from the mean. Numerically:

m - ci

[1] 8.320916

m + ci

[1] 8.772542

So 95% CI is 8.32-8.77. Less than half a point!

Wait a second - z or t?

What we’ve looked at so far involves using values from a z distribution to CIs. However, the Lakens chapter on CIs mentioned that we should be using values from a t distribution as our multiplier when calculating CIs. Lakens is not wrong, but when the sample is large enough, the differences in CIs is very, very small.

This function gets the technically appropriate t value for USA_Correct:

qt(.975, 214)

[1] 1.971111

Note that the value, 1.97, is hardly different than the 1.96 from the z distribution. This is because 214 people is nothing to sneeze at in terms of sample size!

Now we’ll do this with a dataframe based approach so we can make a nice plot.

usa_correct_summary <- d %>% 
  summarise(m = mean(USA_Correct, na.rm = T),
            sd = sd(USA_Correct, na.rm = T),
            n = length(na.omit(USA_Correct)),
            se = sd/sqrt(n),
            ci = se*1.96)

ggplot(usa_correct_summary, aes(x = "USA", y = m))+
  geom_pointrange(aes(ymin = m - ci, ymax = m + ci))+
  scale_y_continuous(limits = c(1, 10))+
  theme_bw()

With this CI, we can be fairly confident that the ‘true’ mean in the population is not 1, or 2, or even 5, 6, or 7. This is a useful estimation!

Comparing CIs

We can also use CIs for informal statistical comparisons. In this dataset, all participants provided ratings of the English spoken by people from several countries. Let’s grab this subset of data and pivot it longer so we can produce some summary statistics more efficiently.

(This is part of a common workflow - pivot, group_by, and summarise)

countries <- d %>% select(ResponseId:Vietnam_Familiar) %>%
  select(ResponseId, contains("_Correct")) %>%
  pivot_longer(USA_Correct:Vietnam_Correct, names_to = "country", 
               values_to = "correct")

countries_summary <- countries %>% group_by(country) %>%
  summarise(m = mean(correct, na.rm = T),
            sd = sd(correct, na.rm = T),
            n = length(na.omit(correct)),
            se = sd/sqrt(n),
            ci = se*1.96,
            lower = m - ci,
            upper = m + ci)

countries_summary

# A tibble: 8 × 8
  country                  m    sd     n    se    ci lower upper
  <chr>                <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 Canada_Correct        8.64  1.51   214 0.103 0.202  8.44  8.85
2 China_Correct         5.99  2.02   216 0.138 0.270  5.72  6.26
3 Japan_Correct         6.16  2.08   214 0.142 0.279  5.88  6.44
4 New Zealand_Correct   7.87  1.85   213 0.127 0.248  7.63  8.12
5 Saudi_Arabia_Correct  5.96  2.07   211 0.142 0.279  5.68  6.24
6 Spain_Correct         6.75  1.69   214 0.115 0.226  6.53  6.98
7 USA_Correct           8.55  1.69   214 0.115 0.226  8.32  8.77
8 Vietnam_Correct       5.91  2.05   211 0.141 0.276  5.63  6.18

We can see some fairly obvious differences (and similarities) in the means. The numbers are a lot to look at though - let’s plot.

Quickly cleaning up the country variable

countries_summary <- mutate(countries_summary,
                            country = gsub("_Correct", "", country))

Now plot

countries_summary %>% ggplot(aes(x = country, y = m, color = country))+
  geom_pointrange(aes(ymin = lower, ymax = upper))+
  scale_y_continuous(limits = c(5,10), breaks = 5:10)+
  theme_bw()

Based on the CIs, what can you tell about the mean correctness ratings for these countries?

More Formally Testing Differences

With confidence intervals, to more formally/confidently make inferences about differences, you need to actually compute differences. Let’s try it out by computing within-person differences in correctness ratings for Spanish English and Vietnamese English:

diffs <- d %>% select(ResponseId, Spain_Correct, Vietnam_Correct)

diffs <- diffs %>% mutate(difference = Spain_Correct - Vietnam_Correct) %>%
  summarise(m = mean(difference, na.rm = T),
                          sd = sd(difference, na.rm = T),
                          n = length(na.omit(difference)),
                          se = sd/sqrt(n),
                          ci = se*qt(.975, n),
                          lower = m - ci,
                          upper = m + ci)
                          

diffs

# A tibble: 1 × 7
      m    sd     n    se    ci lower upper
  <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.839  1.49   211 0.103 0.203 0.636  1.04

Here we can see the mean difference between mean correctness ratings of Spanish English and Vietnamese English is .83, SD = 1.49, SE = .10, with a 95% CI of [.64, 1.04]. So this is a mean difference greater than 0 (indicating superior ratings for Spanish English, in this case) with a CI that does not cross/include 0 (indicating a statistically-significant differences). In other words, we can reject the null hypothesis (no difference, or a difference of 0) and proceed with the understanding that a difference in the population (i.e., all Gen-Z undergraduates in the US) is likely.

Computing 95% CIs is useful but it’s also nice to have a quicker way to compare means. We’ll preview a simple analysis called a t-test that is built for that purpose.

t.test(d$Spain_Correct, d$Vietnam_Correct, paired = T)


    Paired t-test

data:  d$Spain_Correct and d$Vietnam_Correct
t = 8.1579, df = 210, p-value = 3.09e-14
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 0.636154 1.041571
sample estimates:
mean difference 
      0.8388626

We are using the paired version of the t-test because these ratings came from the same people. The p-value is very, very small - that e-14 bit means move the decimal point 14 places to the left. The probability of our results, if the null were really true, is extremely small, so we can safely reject the null.

If we want to compare means that are independent, coming from different people, we could try some of the other countries, which were part of randomly assigned blocks.

t.test(d$South_Korea_Correct, d$Philippines_Correct)


    Welch Two Sample t-test

data:  d$South_Korea_Correct and d$Philippines_Correct
t = -1.5054, df = 77.967, p-value = 0.1363
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.6036333  0.2226809
sample estimates:
mean of x mean of y 
 6.309524  7.000000

The t value is negative, indicating the mean of the first argument is smaller than the second. Here, we can see that the p-value is not so small (.14). This is greater than our typical threshold of .05, so we fail to reject the null hypothesis that there is no difference in how people view the correctness of South Korean and Filipino English. Note that the confidence interval here also crosses/includes 0 (i.e., one end is negative, and the other is positive).