Submit a rendered R Markdown/Quarto file that includes…
All code used in the lab
Written answers to all questions
Loading R Packages
Quarto and R Markdown work with code chunks where your code goes in the grey boxes while you can type your answers to questions here.
library(tidyverse) #General Coding
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rstatix) #Running T-tests
Warning: package 'rstatix' was built under R version 4.5.2
Attaching package: 'rstatix'
The following object is masked from 'package:stats':
filter
library(haven) #Importing Data into Rlibrary(ggplot2) #Plotting Package
Importing & Filtering Data
wvs <-readRDS(url("https://raw.githubusercontent.com/drCES/course_data/main/wvs.rds")) #Create the filter using the below codes inside the () to extract only US and Canadian Datawvs_us_can<- wvs %>%filter( b_country ==124| b_country ==840 ) %>%#Canada's code is b_country==124 and the USA code is b_country==840select(b_country, b_country_alpha, q50)dim(wvs_us_can)
[1] 6614 3
Questions to Answer from Above Code Chunk
What is the unit of observation in this dataset (what does each row represent)?
Each unit in the dataset represents an individual survey response in either Canada or the US.
How many rows and columns does the dataset have?
There are 6614 rows and 3 columns.
Calculate the Variance and Standard Deviation of Financial Satisfaction by Country
wvs_us_can %>%group_by(b_country_alpha) %>%#Use the b_country_alpha variable here summarise(n =n(),mean_sat =mean(q50, na.rm=T), #Update var_name to q50var_sat =var(q50, na.rm=T), #Update var_name to q50sd_sat =sd(q50, na.rm=T)) #Update var_name to q50
# A tibble: 2 × 5
b_country_alpha n mean_sat var_sat sd_sat
<chr> <int> <dbl> <dbl> <dbl>
1 CAN 4018 6.52 4.92 2.22
2 USA 2596 6.09 5.75 2.40
Questions
Which country’s residents have higher financial satisfaction when looking at the means?
When looking at the means, Canada has higher financial satisfaction (using mean_sat).
Which group shows more variability? What does that mean substantively?
The US group has higher variability, meaning that there is less consistency in how the US respondents answered the question about their satisfaction.
Confidence Intervals for the Means
Construct a95% confidence interval for financial satisfaction separately for the United States and Canada. Use a critical value of 1.96 to create the 95% confidence interval.
wvs_us_can %>%group_by(b_country_alpha) %>%summarise(mean_sat =mean(q50, na.rm=T) , #Update var_name to q50sd_sat =sd(q50, na.rm=T), #Update var_name to q50n =n(),se = sd_sat/sqrt(n), #Use formula for SE (sd_sat/n)cv=1.96, #Use 1.96moe=se*cv,ci_lower = mean_sat - moe, #CI Lower is mean - moeci_upper = mean_sat + moe) #CI Upper is mean + moe
# A tibble: 2 × 9
b_country_alpha mean_sat sd_sat n se cv moe ci_lower ci_upper
<chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 CAN 6.52 2.22 4018 0.0350 1.96 0.0686 6.46 6.59
2 USA 6.09 2.40 2596 0.0471 1.96 0.0922 6.00 6.19
Questions
What does the confidence interval tell us about the population mean?
The CIs estimate that the true population mean is within the upper and lower bounds with 95% certainty.
Do the confidence intervals for the two countries overlap?
No, the CIs do not overlap.
Does overlap (or lack of overlap) suggest a meaningful difference?
The lack of overlap suggests a significant difference in means.
Topic 2: Hypotheses, T-Tests and P-values
To start, create an empirically testable hypothesis about if Canadian or American residents are more satisfied with their financial life.
Hypothesis to Test
Write Out the Null and Alternative Hypotheses
Null Hypothesis (Ho): Canadian and American residents do not differ in their satisfaction with their financial life.
Alternative Hypothesis (Ha): Canadian and American residents differ in their satisfaction with their financial life.
Two Independent Samples T-test
Estimate a two-sample t-test assuming the variance is not equal between the two groups.
#Update the DV to q50, IV to b_country_alpha and Data to wvs_us_can in this code t.test(q50 ~ b_country_alpha, data = wvs_us_can, var.equal =FALSE)
Welch Two Sample t-test
data: q50 by b_country_alpha
t = 7.3038, df = 5171.7, p-value = 3.224e-13
alternative hypothesis: true difference in means between group CAN and group USA is not equal to 0
95 percent confidence interval:
0.3142825 0.5448975
sample estimates:
mean in group CAN mean in group USA
6.523644 6.094054
Questions
What is the p-value and what do we learn from it?
The p-value is 3.224e-13. As this number is below .05, we learn that the difference between Canadian and US satisfaction is statistically significant.
At α=0.05, do we reject or fail to reject the null hypothesis?
Using an alpha of .05, we reject the null hypothesis.
How does this result relate to the confidence intervals you calculated earlier?
As predicted earlier with the lack of CIs overlapping, the t-test revealed a significant difference between Canadian and US financial satisfaction. If 95% CIs overlap, then we cannot say with 95% certainty that means are statistically different.
Optional: Create a Boxplot Comparing Financial Satisfaction Between Americans and Canadians
You need to first calculate and save the mean financial satisfaction level for both countries then use the aggregated data frame to produce the plot. Basically, just save the data you created in the confidence interval analysis above as a new data frame then plot those results.
#Save this DF and use it to create the boxplotnew_df<-wvs_us_can %>%group_by(b_country_alpha) %>%summarise(mean_sat =mean(q50, na.rm=T) , #Update var_name to q50sd_sat =sd(q50, na.rm=T), #Update var_name to q50n =n(),se = sd_sat/sqrt(n), #Use formula for SE (sd_sat/n)cv=1.96, #Use 1.96moe=se*cv,ci_lower = mean_sat - moe, #CI Lower is mean - moeci_upper = mean_sat - moe) #CI Upper is mean - moenew_df%>%ggplot(aes(x = b_country_alpha, y = mean_sat)) +geom_boxplot() +labs(title ="Financial Satisfaction by Country",x ="Country",y ="Financial Satisfaction") +theme_minimal()
Wrap Up Questions
Why do we use a t-test instead of simply looking at the means for the 2 groups?
T-tests take into account variability and sample size. A simple difference in group means can be the result of sampling issues rather than true differences between groups.
How does sample size affect standard error and confidence intervals?
An increase in sample size decreases standard error and thus the range of CIs.
In your own words, what does a p-value represent?
A p-value represents the likelihood that the findings presented in a statistical test were due by chance.