Week 2 - R Lab

Week 2 R Lab

Submission Instructions

Submit a rendered R Markdown/Quarto file that includes…
- All code used in the lab
- Written answers to all questions

Loading R Packages

Quarto and R Markdown work with code chunks where your code goes in the grey boxes while you can type your answers to questions here.

library(tidyverse) #General Coding

Warning: package 'tidyverse' was built under R version 4.5.2

Warning: package 'lubridate' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(rstatix) #Running T-tests

Warning: package 'rstatix' was built under R version 4.5.2


Attaching package: 'rstatix'

The following object is masked from 'package:stats':

    filter

library(haven) #Importing Data into R
library(ggplot2) #Plotting Package

Importing & Filtering Data

wvs <- readRDS(url("https://raw.githubusercontent.com/drCES/course_data/main/wvs.rds")) 

#Create the filter using the below codes inside the () to extract only US and Canadian Data
wvs_us_can<- wvs %>%
filter( b_country == 124 | b_country == 840 ) %>% #Canada's code is b_country==124 and the USA code is b_country==840
select(b_country, b_country_alpha, q50)

dim(wvs_us_can)

[1] 6614    3

Questions to Answer from Above Code Chunk

What is the unit of observation in this dataset (what does each row represent)?

Each unit in the dataset represents an individual survey response in either Canada or the US.
How many rows and columns does the dataset have?

There are 6614 rows and 3 columns.

Calculate the Variance and Standard Deviation of Financial Satisfaction by Country

wvs_us_can %>%
group_by(b_country_alpha) %>% #Use the b_country_alpha variable here 
summarise(
n = n(),
mean_sat = mean(q50, na.rm=T), #Update var_name to q50
var_sat = var(q50, na.rm=T), #Update var_name to q50
sd_sat = sd(q50, na.rm=T))  #Update var_name to q50

# A tibble: 2 × 5
  b_country_alpha     n mean_sat var_sat sd_sat
  <chr>           <int>    <dbl>   <dbl>  <dbl>
1 CAN              4018     6.52    4.92   2.22
2 USA              2596     6.09    5.75   2.40

Questions

Which country’s residents have higher financial satisfaction when looking at the means?

When looking at the means, Canada has higher financial satisfaction (using mean_sat).
Which group shows more variability? What does that mean substantively?

The US group has higher variability, meaning that there is less consistency in how the US respondents answered the question about their satisfaction.

Confidence Intervals for the Means

Construct a95% confidence interval for financial satisfaction separately for the United States and Canada. Use a critical value of 1.96 to create the 95% confidence interval.

wvs_us_can %>%
group_by(b_country_alpha) %>%
summarise(
mean_sat =mean(q50, na.rm=T) , #Update var_name to q50
sd_sat = sd(q50, na.rm=T), #Update var_name to q50
n = n(),
se = sd_sat/sqrt(n), #Use formula for SE (sd_sat/n)
cv= 1.96, #Use 1.96
moe=se*cv,
ci_lower = mean_sat - moe, #CI Lower is mean - moe
ci_upper = mean_sat + moe) #CI Upper is mean + moe

# A tibble: 2 × 9
  b_country_alpha mean_sat sd_sat     n     se    cv    moe ci_lower ci_upper
  <chr>              <dbl>  <dbl> <int>  <dbl> <dbl>  <dbl>    <dbl>    <dbl>
1 CAN                 6.52   2.22  4018 0.0350  1.96 0.0686     6.46     6.59
2 USA                 6.09   2.40  2596 0.0471  1.96 0.0922     6.00     6.19

Questions

What does the confidence interval tell us about the population mean?

The CIs estimate that the true population mean is within the upper and lower bounds with 95% certainty.
Do the confidence intervals for the two countries overlap?

No, the CIs do not overlap.
Does overlap (or lack of overlap) suggest a meaningful difference?

The lack of overlap suggests a significant difference in means.

Topic 2: Hypotheses, T-Tests and P-values

To start, create an empirically testable hypothesis about if Canadian or American residents are more satisfied with their financial life.

Hypothesis to Test

Write Out the Null and Alternative Hypotheses

Null Hypothesis (Ho): Canadian and American residents do not differ in their satisfaction with their financial life.
Alternative Hypothesis (Ha): Canadian and American residents differ in their satisfaction with their financial life.

Two Independent Samples T-test

Estimate a two-sample t-test assuming the variance is not equal between the two groups.

#Update the DV to q50, IV to b_country_alpha and Data to wvs_us_can in this code 
t.test(q50 ~ b_country_alpha, data = wvs_us_can, var.equal = FALSE)


    Welch Two Sample t-test

data:  q50 by b_country_alpha
t = 7.3038, df = 5171.7, p-value = 3.224e-13
alternative hypothesis: true difference in means between group CAN and group USA is not equal to 0
95 percent confidence interval:
 0.3142825 0.5448975
sample estimates:
mean in group CAN mean in group USA 
         6.523644          6.094054

Questions

What is the p-value and what do we learn from it?

The p-value is 3.224e-13. As this number is below .05, we learn that the difference between Canadian and US satisfaction is statistically significant.
At α=0.05, do we reject or fail to reject the null hypothesis?

Using an alpha of .05, we reject the null hypothesis.
How does this result relate to the confidence intervals you calculated earlier?

As predicted earlier with the lack of CIs overlapping, the t-test revealed a significant difference between Canadian and US financial satisfaction. If 95% CIs overlap, then we cannot say with 95% certainty that means are statistically different.

Optional: Create a Boxplot Comparing Financial Satisfaction Between Americans and Canadians

You need to first calculate and save the mean financial satisfaction level for both countries then use the aggregated data frame to produce the plot. Basically, just save the data you created in the confidence interval analysis above as a new data frame then plot those results.

#Save this DF and use it to create the boxplot
new_df<-wvs_us_can %>%
group_by(b_country_alpha) %>%
summarise(
mean_sat =mean(q50, na.rm=T) , #Update var_name to q50
sd_sat = sd(q50, na.rm=T), #Update var_name to q50
n = n(),
se = sd_sat/sqrt(n), #Use formula for SE (sd_sat/n)
cv= 1.96, #Use 1.96
moe=se*cv,
ci_lower = mean_sat - moe, #CI Lower is mean - moe
ci_upper = mean_sat - moe) #CI Upper is mean - moe

new_df%>%
ggplot(aes(x = b_country_alpha, y = mean_sat)) +
geom_boxplot() +
labs(
title = "Financial Satisfaction by Country",
x = "Country",
y = "Financial Satisfaction") +
  theme_minimal()

Wrap Up Questions

Why do we use a t-test instead of simply looking at the means for the 2 groups?

T-tests take into account variability and sample size. A simple difference in group means can be the result of sampling issues rather than true differences between groups.
How does sample size affect standard error and confidence intervals?

An increase in sample size decreases standard error and thus the range of CIs.
In your own words, what does a p-value represent?

A p-value represents the likelihood that the findings presented in a statistical test were due by chance.