My Goals This Week

  1. Attempt the final 2 exploratory analysis
  2. Read Jenny’s Roaches examples
  3. Try out new code we learnt from Q&A in Week 9

How I Made Progress

Exploratory Q2: Is nutritional confusion correlated with backlash?

Confusing headlines can result backlash (i.e. negative beliefs about recommendations and research). This analysis thus aims to look at whether there is a correlation between these two variables. We expect that more confusing headlines will be correlated with higher negative beliefs about research (i.e. more backlash).

Load Packages

library(tidyverse) 
library(ggeasy) # needed to format plots
library(gt) # needed for tables
library(rstatix) # needed for calculating effect size
library(corrplot) # needed for correlation matrix

theme_set(theme_bw()) # sets theme at the start for all plots' theme later on

Data Visualisation

I am using corrplot() from the corrplot package creates a correlation matrix to visually show correlation between variables. From the correlation plot, it seems that there is a positive correlation between nutrition confusion and backlash.

corr_data <- exptwofinaldata %>% 
  select(
    confusion,
    backlash,
    mistrust,
    confidence,
    certainty,
    development
  ) %>% 
  cor() 

corrplot(
  corr_data,
  method = "pie", #shows correlation as pie chart
  type = "lower" #displays only lower of triangular matrix to remove duplicates
) 

Next, I am going to plot a graph with nutrition confusion vs. backlash Using geom_jitter() which allows me to see all data points and visually see whether there is a pattern in the relationship between both variables. Then, using geom_smooth() and specifying the argument for method="lm" allows us to plot the line of best fit using the linear model y=mx+c.

ggplot(
    exptwofinaldata, aes(confusion, backlash)) +
    geom_jitter() +
    geom_smooth(method = "lm")

As expected, there is a positive linear correlation between confusion and backlash. Next, let’s look at statistical analysis.

Statistics: Correlation Test

I am using cor.test() to look at the Pearson’s correlation between confusion and backlash.

cor.test (exptwofinaldata$confusion, exptwofinaldata$backlash)
## 
##  Pearson's product-moment correlation
## 
## data:  exptwofinaldata$confusion and exptwofinaldata$backlash
## t = 10.52, df = 398, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3860562 0.5398137
## sample estimates:
##       cor 
## 0.4664511

t-statistic = 10.52. df = 398.P-value is <0.05. Thus, we can conclude there is a significant positive relationship between nutrition confusion and backlash. Correlation is 0.466 - a strong positive relationship.

Q3: Does age influence participants’ ratings of scientific advancement?

A 3-point scale was used (know less, know the same, know more) to assess how participants perceive scientific advancement. I am interested to see whether age influences participants’ ratings of scientific advancement (i.e. do older adults rate scientific advancement differently to younger adults in this study)? If so, this may be a confounding variable that could influence the results, and further studies should account for this variable when interpreting results.

First, I am going to group the age variables in categories to allow easier interpretation. From the verification report, Experiment 1’s age range is 18-69. This allows us to group as below:

  • Gen Z (1997 – 2012): 9 – 24
  • Millennials (1981 – 1996): 25 – 40
  • Gen X (1965 – 1980): 41 – 56
  • Boomers II (1955 – 1964): 57 – 66

I am thus going to use the mutate() and case_when to create a new variable called generation where I will specify using %in% the age ranges and generation names. Then, I am converting generations to a factor variable and ordering it chronologically (R defaults to alphabetical).

exponefinaldata <- exponefinaldata %>% 
  mutate(
      generations = case_when(
        Age %in% 18:24 ~ "GenZ",
        Age %in% 25:40 ~ "Millennials",
        Age %in% 41:56 ~ "GenX",
        Age %in% 57:69 ~ "Boomers"
    )
  ) 

exponefinaldata$generations <- factor(exponefinaldata$generations, levels = c("GenZ", "Millennials", "GenX", "Boomers")) #change to factor variable and specify levels chronologically

Descriptive Statistics

Next, I am going to create a frequency table with both advancement and generations

generations_table <- table(exponefinaldata$advancement, exponefinaldata$generations)

print(generations_table)
##     
##      GenZ Millennials GenX Boomers
##   -1   28          33   19       5
##   0    41          69   32      17
##   1    14          25    9       2

Displaying frequency table as % of each generation might be more insightful…

prop.table(
  generations_table,
  margin = 2 #indicates to calculate proportion across columns of the table
)*100
##     
##           GenZ Millennials      GenX   Boomers
##   -1 33.734940   25.984252 31.666667 20.833333
##   0  49.397590   54.330709 53.333333 70.833333
##   1  16.867470   19.685039 15.000000  8.333333

From the table above, it looks like significantly more Boomers say they know the same before/after reading the headlines (70%), contrasted with approximately 50% of this response from all other generations.

Data Visualisation

Plotting a bar plot of counts using ggplot() and geom_bar()

ggplot(
  exponefinaldata, aes(advancement)) +
  geom_bar(aes(y=..prop.., fill=generations), position="dodge") + #plot y as proportion
  scale_y_continuous(labels=scales::percent) + #displays scale as a percentage i.e. out of 100
  facet_wrap(vars(generations),strip.position = "bottom")

This is in line with what we observed from the frequency table - Boomers show a higher percentage for “know the same” compared to other age groups. To determine whether this observed pattern is significantly different, we need to run a Chi-square test.

Statistics: Chi-square test

A chi-square test is selected as the two variables are of a categorial nature.

chisq.test(generations_table) 
## 
##  Pearson's Chi-squared test
## 
## data:  generations_table
## X-squared = 5.0736, df = 6, p-value = 0.5344

P value = 0.5344. Therefore, we fail to reject the null hypothesis that no association exists between the two categorical variables generation and perceived scientific advancement (i.e. both variables are independent). We have insufficient evidence to conclude that a participant’s generation and their perceived scientific advance are related.

Challenges

  • Q3 was a challenge for me - I found finding the whole process for Chi-squared test a bit confusing and was often unsure whether I was plotting correctly. I followed Jenny’s process from Week 8 and Googled to read more about this statistical test… I also feel like my Q3 conclusion could be clearer?

Successes

  • corrplot() was really cool and I found the process for Q2 much easier and faster! This could also be due to my familiarity with correlation coefficient from stats courses. :)

Next Steps

  • Working on putting all the parts together in the final report! I think there might be some redundant commentary as they are currently scattered across my learning logs - so I will start working on polishing up the format this week.