Part 1 - Summary and Reaction

Summary

There is a lot of conflicting information out there about the risks and benefits of certain activities that is prevalent in today’s media, especially in the news. The examples that the paper by Haigh and Birch give is:

  1. “A fasting diet could regenerate pancreas and reverse diabetes, researchers say” – ABC news
  2. “Fasting diets may raise risk of diabetes, researchers warn” – The Guardian

They believe that the use of “researchers” as a catch-all term for any scientific implication made is problematic. It’s been shown by Leslie et al (2011) that ‘researchers say’ equates to ‘all researchers say’ and the public’s default response is a high degree of scientific consensus on any given topic. Clearly this is an issue as the conflict implies that the scientific consensus is ever-changing and unreliable. While many studies have reinforced this issue, almost no studies have been conducted to try and provide a solution. Hence, the authors of the paper aimed to see if the use of qualifiers like the word “some”, before “researchers”, as “A fasting diet could reverse diabetes, SOME researcher say” would be able to reduce some confusion. They hypothesized that utilizing such qualifiers would in fact be able to reduce confusion within participants about conflicting headlines.

Across two experiments, they tested whether conflicting and qualified headlines had an effect and did this by splitting the participants into four groups across the two variables: generic/conflicting, conflicting/non-conflicting, qualified/conflicting, qualified/non-conflicting. In Experiment 1, the experimenters looked are more immediate outcomes such as the participant’s perceived contradiction and confusion towards the headlines. Whereas, in Experiment 2, they asked about participant’s global beliefs towards science and nutrition in general.

Unfortunately, for the experimenters they found no significant effect due to the qualifiers across both experiments. However, in experiment 1 the non-conflicting headlines were considered less confusing than conflicting headlines. Thus to summarize their results, the use of a qualified header does not reduce the confusion and contradiction of news for its audience and does not adhere to their hypothesis. However, what this novel study does highlight is how future studies should look into other ways to address the problem of conflicting news headlines.

Reaction

Reaction 1

I can see this paper relates to the importance of how the science consensus should be perceived and the inherent clash behind the methodical ethos of the scientific method to that of the fast-paced world of the 24/7 media cycle. As the authors poignantly touch on, by it’s very nature “conflict is a normal and healthy part of the scientific process”, however as the media is charged by a need to get clicks and views, this process is transformed in a way that makes it seem unreliable. They don’t touch on this point, but since the paper emphasizes the effect of headlines, I believe it to be a reflection of how society as a whole consumes media. For me personally aswell, most of what I do - is just read the headlines, because there is just so much information out there aka “infobesity”, often making it hard to do so. However, if anything, this thought process led me to reflect that when I’m personally consuming news, I should make a conscious effort to go beyond the headline to try and truly understand what’s going on!

Reaction 2

I was confused by the author’s justification on the usage of qualifying information as a possible method for reducing headline confusion. The step-by-step approach they deduce for the possibility of a qualifying solution logically makes sense. However, it wasn’t the logical steps as to how they got to qualifiers as a variable, but rather the surrounding context of their attempt to justify it that seemed lacking. The lack of sources in this department leads me to believe that it’s novel in what it attempts to do specific to headlines, however they make no mention of this uniqueness where as other studies preface its pilot results and hence a justification for a lack of sources behind such results. The other option is that they had sources to justify otherwise but failed to include them, one related study by Jensen, J. D. regarding the effects of hedging on the news showcases successful results but no mention of this study is made in the original Haigh and Birch article. Thus, I think a better effort could have been made to justify why they utilized qualifiers as a possible solution to conflicting headlines.

Reaction 3

When I was reading I was excited to learn that they attempted to include whether exposure to the conflicting headlines had any positive effects. The reason I enjoyed seeing this was that the tone of not only this paper, but when dealing with prevalent scientific issues in the public (anti-vax, flat earth, fad diets), there is an increasing amount of negativity with the mistrust of good science practices. So I really liked to see that the article attempted to not only look at the expected variables such as mistrust in expertise, confusion and contradiction but whether epistemic knowledge in science could be increased. In a society where the validity of science in the eye of the public is seemingly lowered day by day as seen here, a crack at seeing if positive effects could be induced was heart-warming. If I had to quibble, the inclusion of these positive effects felt like it was a component tacked on at the end rather than a point that was well researched. So it may as well could be a point of interest for future research to emphasize looking at more positive effects when analyzing conflicting news in science!

Part 2 - Verification

Reproducibility Goal

After opening the paper, we established as a group that there were two main goals:

1. Recreate the demographic descriptives (reported in Participants)

2. Recreate all the figures which consisted of three plots and a histogram for Experiment 1 and six plots for Experiment 2.

Experiment 1

Loading in libraries

  • library(qualtRics) allows us to read using the read_survey()function, and its further benefits of which will be listed above the second chunk.
  • library(tidyverse) loads in the packages for dplyr and ggplot.
  • library(ggbeeswarm) is for the violin plots.
  • library(ggeasy) allows for simpler ggplot functions.
  • library(patchwork)is to put all outputted graphs together in one combined output, as seen in the plots from the paper.
#Loading relevant packages
library(qualtRics)
library(tidyverse)
library(ggbeeswarm) 
library(ggeasy) 
library(patchwork)

Reading in Data

mydata=read_survey("Study 8 data.csv") as previously mentioned was to read in the data set which in the OSF.R read as “Study 8 data.csv” for the first experiment. And utilizing the read_survey()function has a couple of benefits:

  • It not only reads in the data but this function removes unnecessary text in rows 2 and 3 which is not relevant to the data-set.
  • It also sets all relevant columns to ‘numeric’ rather than ‘character’, which allows us to calculate the descriptive, mutate and create new variables and plot the data.
  • mydata<- as.data.frame(mydata) is converting the data set from a tibble into a data frame to solve for rounding issues in numerical calculations in following chunks.

When getting rid of the two unnecessary lines of code, what we used earlier was the slice() function which allows you to remove the rows through a minus such as the following slice(-1:-2). However using read_survey() not only saves us from doing this manually but also saved me from issues down the line where utilizing slice, would also fail to reduce participant numbers. So overall, functions from the qualtRics package was alot simpler and less problematic to use.

mydata=read_survey("Study 8 data.csv")
 
mydata<- as.data.frame(mydata)

Filtering Data

  • First thing to do in emulating what the experimenters did is by testing how many times the survey was opened, which included anyone with a progress greater than 1%, and I did this using count(). With this dataset, n should = 371.
  • Then I had too rename two variables with strange names to help clarify the data set using rename() with new name = old name. Sometimes this rename function was confusing to use since normally I thought that the new name would be at the end of equal signs ‘= new name’ and so frequently ran into mistakes where I swapped the two around until I eventually got used to it!

Unplanned Exclusions

When participants completed the survey, some did it twice and so there are duplicate IDs which have to be filtered out:

  • This is done by first counting how many duplicates are present using count(duplicated(Prolific_PID)), with Prolific_PID referring to the ID variable. The ID’s with the value = TRUE refers to it being a duplicate.
  • Once I attained the number of duplicates,it was then time to remove the known double-ups by using !duplicated(). After doing so it should have n = 312.
  • What I had done previously was utilize HaighData1 <- HaighData1 %>% distinct(Prolific_PID, .keep_all = TRUE) which did exactly the same thing but instead ran with all the TRUE values for values that were not doubled up. While there was almost no difference to the efficiency of the code, simply utilizing functions that reflected the duplicate portion simply it made it easier to read and why we stuck with !duplicated() instead.
#Initial Data Check  
mydata %>% count(Progress>=1) #371
##   Progress >= 1   n
## 1          TRUE 371
mydata <- mydata %>% rename(Recall_score = SC0, Condition = FL_10_DO)
#Unplanned Exclusions
mydata %>% count(duplicated(Prolific_PID)) 
##   duplicated(Prolific_PID)   n
## 1                    FALSE 312
## 2                     TRUE  59
mydata <- mydata[!duplicated(mydata$Prolific_PID), ] 
mydata %>% count(Progress>=1) 
##   Progress >= 1   n
## 1          TRUE 312

Planned Exclusions

The experimenters had established a set of parameters to exclude participants and make the data set reflect on significant results. Participants who failed to to adhere to the following criteria was removed. This part was relatively easy to follow and apply as the exclusion criteria consisted of those who:

  • Finished the study (Finished == 1)
  • Declared that they answered seriously towards the end of the study (seriousness_check==1)
  • Were able to recall 4 or more headline topics Recall_score >= 4).

The data was then all filtered out according this criteria utilizing the filter() function. Then I used select() to get all the relevant variables together and make a subset of the data and then used count()to see how many participants are left after all the exclusions and then finally exported this tidied up data using the write_csv function.

The reason for doing this is when attempting other functions on the data set, it was nice to just load in the tidied code rather than having to run all the code again. This was particularly useful when testing out the data set for my exploratory analyses, in separate rmd files.

#Planned exclusions: 
mydata <- mydata %>% 
    filter(
      Consent == 1,
      Finished == 1,
      Serious_check == 1,
      Recall_score >= 4) %>% 
    select(
      Finished,
      `Duration (in seconds)`,
      Gender,
      Age,
      Serious_check, 
      Recall_score,
      Condition,
      contradiction_1:advancement)
  
count(mydata) #final n = 294
##     n
## 1 294
#Export the data to a .csv
write_csv(mydata, "MyDataSubset.csv")

count(mydata)
##     n
## 1 294

Descriptive statistics

Now calculating the demographics was also quite simple and was done so by using the summarise()function and creates a new data frame with new group values depending on what you input.The paper looked at the mean, standard deviation (sd) and range of the participant’s age and so those were the functions that was used respectively. How the participants were split by gender was done using the count() function as well but this time instead of stating the subset of data, I specified it by the variable ‘Gender’.

Previously when getting the descriptive, I had an issue where the summarise() function would return with NA for the mean: But this issue was resolved when applying the qualtRics package, I believe this is due to it automatically converting all relevant columns as numeric which includes the mean.

#AGE: Calculating the mean, SD, and range
mydata %>% summarise(
  mean(Age), 
  sd(Age), 
  range(Age)
  ) 
##   mean(Age)  sd(Age) range(Age)
## 1  34.29252 12.96633         18
## 2  34.29252 12.96633         69
mydata %>% count(Gender) 
##   Gender   n
## 1      1 126
## 2      2 168

Setting up for the plots

  • I realized that in the data, the dependent variable (DV) contradiction was separated into six ratings and so I summed up the scores of each contradiction rating and put them into one NEW variable as “contradiction” by using the mutate() function.
  • Then for all the plots, we had to sort out the independent variables (IVs) which was problematic since the variable ‘condition’ needed to be separated so we can have the two IVs, Format and Conflict as separate variables. An example of how the condition variable would normally show up is written as “Block_1_Generic_Conflict” which was quite hard to dissect.
  • However by separating them using separate() to split all the variables accordingly: “block”, “number”, “Format”, “Conflict”, the ‘block’ and ‘number’ became obsolete and the two important variables ‘format’ and ‘conflict’ show up separately.
  • The next step was to change the new variables into factors by using as.factor() so they could be used in the plots. One thing I realized is that the graph outputs were reading “Conflict” and “Consistent”, and changed the names using recode.factor()to Conf./Non-Conf. to make it easier to understand and closer to the original plots.
##Creating new variable contradiction
mydata <- mydata %>%
  mutate(contradiction = contradiction_1 + contradiction_2 + contradiction_3 + contradiction_4 + contradiction_5 + contradiction_6)
##For all plots
mydata <- separate(mydata, Condition, c("block", "number", "Format", "Conflict"))
#Set "Format" and "Conflict" as factors 
mydata <- mydata %>%
  mutate(Format=as.factor(Format)) %>%
  mutate(Conflict=as.factor(Conflict))
#Rename from Conflict/Consistent to Conf./Non-Conf.
mydata$Conflict <- recode_factor(mydata$Conflict, Conflict = "Conf.",Consistent = "Non-Conf.")
#Export the FINAL data to a .csv
write_csv(mydata, "MyDataFinalSubset.csv")

Producing the plots!

In the earlier weeks, my emphasis was on just trying to get a plot produced, and so by utilizing google and the ggplot basics from Dani’s video tutorials. I was able to produce this simple plot whereby:

  • ggplot() establishes the axes for x and y and adds the background
  • geom_violin() is to add the violin plots
  • aes(x = Conflict, y = contradiction) specifies what variables are represented in their respective x and y axes
  • facet_wrap() can get one plot and then uses it to create multi-panel plots (like the two here), and without it, it only produces one plot
  • ggtitle() was simply for the title of the plot = “Contradiction”
  • scale_y_continuous(name = "Perceived Contradiction",limits = c(0,30)) and this whole line was for the y axis and specified the name and established the rating limits of perceived contradiction
  • theme(plot.title = element_text(hjust = 0.5) is used to centre the title text
#Contradiction Plot 

ggplot(mydata) +
  geom_violin(aes(x = Conflict, y = contradiction)) + 
  facet_wrap(vars(Format), strip.position = "bottom")+
  ggtitle(label = "Contradiction")+
  scale_y_continuous(
    name = "Perceived Contradiction",
    limits = c(0,30))+
  scale_x_discrete(name = NULL)+
  theme(plot.title = element_text(hjust = 0.5))

At this point I felt like I was reaching the limit of reproducing the plots with only ggplot() and it seemed like I was at a roadblock. When looking back at the OSF of the author’s script, they utilized pirateplot() instead of ggplot.

So I had to contemplate whether I should try and continue searching for other functions within ggplot or to approach it using pirateplot. Another issue that I noted was that while the current code could be used to reproduce the other graphs to a similar standard, we weren’t able to get all three plots side by side like in the original study.

I decided to stick with my guns for ggplot, and searched for other packages that enhanced the functions within ggplot itself and the discovery of the packages library(ggeasy) and library(geom_beeswarm)did exactly that. Utilizing that and other discoveries, these were the improvements that I made:

  • geom_beeswarm()is used to add points along the violin plot and (cex = 0.2)specifies the size for the points.

  • stat_summary() allows for alot of flexibility in the specification of summary functions. The summary function can work on a data frame (with argument name fun.data) and is what I used to specify the mean and confidence intervals using "mean_cl_normal" and geom = "crossbar"is used to show that on the plot.

  • Everything from the ggeasy package does whats in the name of the function which is why it so easy to use!

    • easy_center_title() removes the title.
    • easy_remove_legend()removes the legend.
  • scale_fill_manual(values = c("slategray2", "lightpink1") was used to fill in the colours that the authors had.

  • theme_minimal() is used to make the background white.

This is now all the code put together to get one finished plot! All we had to do was copy and paste this code to the other two graphs of participant’s confusion and their perceived advancement of scientific knowledge. The only things that had to be changed were:

  • The name of the plot you set it as using <-
  • The Y variable in ggplot()
  • The label of the plot title in ggtitle()
  • The label of the Y axis and its range limits in scale_y_continuous(name = " ", limits = c("", "")
#Single Contradiction Plot 
Contradiction_Plot2 <- ggplot(mydata, 
                         aes(x = Conflict,
                             y = contradiction, 
                             fill = Conflict,)) +
  #geom_functions
  geom_violin() + 
  facet_wrap(vars(Format), strip.position = "bottom") +
  geom_beeswarm(cex = 0.2) +
  geom_crossbar(stat = "summary",
                fun.y = "mean_ci", 
                fill = "white") +
  ggtitle(label = "Contradiction") + 
  scale_y_continuous(name = "Percieved Contradiction", limits = c(1,30)) + 
  scale_fill_manual(
    values = c("slategray2", "lightpink1"))+
  theme_minimal()
  #ggeasy functions 
  easy_center_title()+ 
  easy_remove_legend()+
  easy_remove_x_axis(what = c ("title"))
## List of 3
##  $ plot.title     :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0.5
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ legend.position: chr "none"
##  $ axis.title.x   : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE
print(Contradiction_Plot2)

Note: I haven’t included all 3 graphs for Experiment 1 here since another discovery just below will do exactly that! To preface, once we reached this point of solving a working code that could be applied to every graph, our emphasis was on making the code more efficient and working on the histogram.

A function revelation!

Now this was probably the most satisfying bit of coding to learn about as while I had finished this copy and pasteable code, applying it to every single graph from both Experiment 1 and 2 still led to over 300 lines of code. And one of the golden rules of coding we have learnt is that if code repeats itself, there’s opportunity to tidy it up and make it more efficient! Prior successes was more about making the code understandable for myself rather than reducing redundancies like going from theme(plot.title = element_text(hjust = 0.5) to easy_center_title(). I hit a brick wall until Week 7 of the term where I was taught how to apply function() by Jenny S!!

Once I learned that it essentially creates your own plot template, it was perfect for solving my efficiency issue! First, I assigned ‘figure.1.fun’ as our plot name using the output function() and then we specify all the variables that need to be changed from plot to plot. For this case it includes: the plot title, the y variable, the y axis title and its lower and upper limits.

Now that the function is created, we define each unique component of the specific plot into the function, so for example: the contradiction plot has:

  • The unique plot title = “Contradiction”
  • The unique Y variable = y_var = mydata$contradiction
  • The unique Y axis title = “Perceived Contradiction”
  • The unique lower limit = 1 and upper limit = 30

Then repeating this process for the other two graphs for participant’s confusion and their perceived advancement of scientific knowledge.Now this may seem exactly the same as the previous copy and paste method - as did it for me, however the unique aspect of function() is that it RE-RUNS the same code and applies it to each unique plot without the need for another copied chunk to do so. The result is that what previously took over 300 lines of code to produce all 9 violin plots now reduced to about 60 lines to do so, which is such a satisfying revelation!!

##Violin ggplots

#Making our own function 
figure.1.fun <- function(y_var, plot_title, y_title, lim_1, lim_2)  {
  ggplot(mydata,aes(x = Conflict, y = y_var, fill = Conflict)) +
  geom_violin() +
  theme_minimal()+
  facet_wrap(vars(Format), strip.position = "bottom")+
  stat_summary(fun.data = "mean_cl_normal", geom = "crossbar", fill = "white",alpha = .7) +
  geom_beeswarm(cex = 0.2) + 
  ggtitle(label = plot_title) +
  easy_center_title() +
  easy_remove_legend() +
  scale_x_discrete(name = NULL) +
  scale_y_continuous(name = y_title, limits = c(lim_1, lim_2)) +
  scale_fill_manual(values = c("slategray2", "lightpink1")) }
#Plotting Contradiction, Advancement and Confusion plots using function 
contradiction.plot <- figure.1.fun(y_var = mydata$contradiction, plot_title = "Contradiction", y_title = "Perceived Contradiction", lim_1 = 1, lim_2= 30) 
advancement.plot <- figure.1.fun(y_var = mydata$advancement, plot_title = "Advancement", y_title = "Perceived Scientific Advancement", lim_1 = -1, lim_2 = 1)
confusion.plot <- figure.1.fun(y_var = mydata$confusion, plot_title = "Confusion", y_title = "Confusion", lim_1 = 1, lim_2 = 5)
print(contradiction.plot)

print(advancement.plot)

print(confusion.plot)

Now I used the package patchwork()to get all the individual plots together, however graphs were a bit warped and distorted and so the size was adjusted in the r chunk using{r, fig.width=10, fig.height=11}.

# Combine plots using the package patchwork()
combinedplots1 <- contradiction.plot + advancement.plot + confusion.plot + plot_layout(ncol = 2)
print(combinedplots1)

Pre-registered analysis of group means

This was another descriptive analysis for conflicting vs non conflicting group means across different DVs. All I had to was specify the data frame, then indicate the group using group_by(), specify the column using summarise_at(vars() and finally list() the mean. I repeated this process for advancement and confusion aswell.

##Conflicting vs non. conflicting group means (pre-registered analysis)
#Perceived Contradiction
contradiction_means <- mydata %>%       
  group_by(Conflict) %>%                
  summarise_at(vars(contradiction),     
               list(name = mean))       
print(contradiction_means) 
## # A tibble: 2 x 2
##   Conflict   name
##   <fct>     <dbl>
## 1 Conf.      25.3
## 2 Non-Conf.  13.4
#Advancement
advancement_means <- mydata %>%         
  group_by(Conflict) %>%                
  summarise_at(vars(advancement),       
               list(name = mean))       
print(advancement_means) 
## # A tibble: 2 x 2
##   Conflict      name
##   <fct>        <dbl>
## 1 Conf.     -0.245  
## 2 Non-Conf.  0.00680
#Confusion
confusion_means <- mydata %>%          
  group_by(Conflict) %>%                
  summarise_at(vars(confusion),         
               list(name = mean))       
print(confusion_means) 
## # A tibble: 2 x 2
##   Conflict   name
##   <fct>     <dbl>
## 1 Conf.      4.52
## 2 Non-Conf.  3.65

Descriptive Historgram

Plotting the advancement histogram was a bit more challenging to solve since it was my first attempt at a complex graph that wasn’t the violin plot.

One of the first attempts was thinking facetwrap()could work to have a singular histogram when applied to geom_bar(). Then I thought I could manipulate the values and create side by side labels of “less” and “more” but this failed to work.

However, in a similar logic I later worked out that since the plot had levels to it, the first step was to use the function ordered(), which could specify the levels of advancement (-1,0,1). These levels reflect whereby participants rated -1 (less), 0 (same), or 1 (more) on the advancement of their scientific knowledge. We assign this new variable ‘advancementvalue’ for the x axis.

Next to plot the histogram, we use geom_bar() and specify the x axis variable as ‘advancementvalue’. The group/fill use both conditions and I do this by stating = Conflict:Format.Then while formatting the plot as usual by stating the names of the X variable, Y variable and the legend’s labels. One small issue I ran into is that the bar plots were stacked on top of each other like so:

The way to solve this was by including the code position = "dodge" which places the bars side to side thus completing the histogram!!

# Histogram
advancementvalue <- ordered(mydata$advancement,
        levels = c(-1, 0, 1),
        labels = c("Less", "Same", "More")) #labeling the levels of advancement
advancement_histogram <- ggplot(mydata) + geom_bar(
  aes(
    x=advancementvalue, 
    group = Conflict:Format,    
    fill = Conflict:Format),
  position = "dodge")+     #plotting side by side
  scale_x_discrete(name = "Advancement")+
  scale_y_continuous(name = "Number of Participants")+
  scale_fill_grey(name = "Condition", 
                  labels = c("Conflicting/Generic", 
                             "Conflicting/Qualified",
                          "Non-conflicting/Generic",
                        "Non-conflicting/Qualified"))+
  theme_minimal()
print(advancement_histogram)

Experiment 2

Reading in the Data

Now once everything came together for reconstructing and verifying the code for Experiment 1, the process of doing the same for Experiment 2 became a lot easier! I utilize the same process but just change the data we read in which is now “Study 7 data.csv”, from the OSF and refer to this data set as “mydata2”.

mydata2=read_survey("Study 7 data.csv")
mydata2<- as.data.frame(mydata2)

Filtering Data

Filtering the data for both unplanned and planned exclusions abides by the same criteria and use of functions. However, I had one unique issue present with the number of participants which occurred using slice() where the number of participants was filtered to 406, instead of 400. Hence using the qualtRics package, instead of slice, is what solved this issue for me.

mydata2 %>% count(Progress>=1) 
##   Progress >= 1   n
## 1          TRUE 412
mydata2 <- mydata2 %>% rename(Recall_score = SC0, Condition = FL_12_DO)
mydata2 <- mydata2 %>% 
  filter(
    Consent == 1,
    Finished == 1,
    Serious_check == 1,
    Recall_score >= 4) %>% 
  select(
    Finished,
    `Duration (in seconds)`,
    Gender,
    Age,
    Serious_check, 
    Recall_score,
    Condition,
    NC_1:Development_sci_know_6)

mydata2 %>% summarise(n())
##   n()
## 1 400
Total_n2 <- mydata2 %>% summarise(n())
#Export the subset of data to a .csv
write_csv(mydata2, "MyDataSubset2.csv")

Descriptive statistics

Calculating the descriptive statistics also works in the same way as Experiment 1. The only source of difference is the number of participants from the data set which is now N = 400, after the data filtering.

mydata2 %>% summarise(mean(Age), sd(Age), range(Age))
##   mean(Age)  sd(Age) range(Age)
## 1    33.465 12.03415         18
## 2    33.465 12.03415         73
mydata2 %>% count(Gender) 
##   Gender   n
## 1      1 150
## 2      2 248
## 3      3   2

Set up for the plots

Once again, the coding process for tidying up the IVs for the graphs is the same, the only difference being the
code to make a NEW variable using the using the mutate() function is now applied to five new variables: nutritional confusion, nutritional backlash, mistrust of expertise, confidence in scientific community and development of knowledge mean.

One thing to note when creating each new variable score, is that when having added up all the individual scores, you have to divide by the total number of ratings which differs from variable to variable. For example, mistrust of expertise only has three ratings and is thus divided by three whereas nutritional backlash has six.

mydata2 <- separate(mydata2, Condition, c("block", "number", "Format", "Conflict"))
mydata2 <- mydata2 %>%
  mutate(Format=as.factor(Format)) %>%
  mutate(Conflict=as.factor(Conflict)) 
#Renaming factors of ‘Conflict’
mydata2$Conflict <- recode_factor(mydata2$Conflict, Conflict = "Conf.",Consistent = "Non-Conf.")
#calculate the average for each scale -- dplyr
#nutritional confusion mean
mydata2 <- mydata2 %>%
  mutate(confusion = (NC_1 + NC_2 + NC_3 + NC_4 + NC_5 + NC_6)/6)
#nutritional backlash mean
mydata2 <- mydata2 %>%
  mutate(backlash = (NBS_1 + NBS_2 + NBS_3 + NBS_4 + NBS_5 + NBS_6)/6)
#Mistrust of expertise mean
mydata2 <- mydata2 %>%
  mutate(mistrust = (Mistrust_expertise_1 + Mistrust_expertise_2 +
    Mistrust_expertise_3)/3)
#Confidence in scientific community = single column ('GSS')
#Certainty of knowledge mean
mydata2 <- mydata2 %>%
  mutate(certainty = (Certainty_sci_know_1 + Certainty_sci_know_2 +
    Certainty_sci_know_3 + Certainty_sci_know_4 + Certainty_sci_know_5 + 
    Certainty_sci_know_6)/6)
#Development of knowledge mean
mydata2 <- mydata2 %>%
  mutate(development = (Development_sci_know_1 + Development_sci_know_2 +
    Development_sci_know_3 + Development_sci_know_4 + Development_sci_know_5 +
    Development_sci_know_6)/6)
#Export the FINAL data to a .csv
write_csv(mydata2, "MyDataFinalSubset2.csv")

Producing the plots

Now I had to once again create another function() and applied it to create all the six plots: “Nutritional Confusion”, “Nutritional Backlash”, “Mistrust of Expertise”, “Confidence in Scientific Community”, “Certainty of Knowledge” and “Development of Knowledge”.All of the plots had the same y variable limits except for “Confidence in Scientific Community” which had an upper limit of three instead of five.

##Violin ggplots
#Make our own function
figure.2.fun <- function(y_var, plot_title, y_title, lim_2 = 5) {
  ggplot(mydata2, aes(x= Conflict, y = y_var, fill = Conflict)) +
  geom_violin() +
  theme_minimal()+
  facet_wrap(vars(Format), strip.position = "bottom") +
  stat_summary(fun.data = "mean_cl_normal",geom = "crossbar", fill = "white",
    alpha = .7) +
  geom_beeswarm(cex = 0.2) +
  ggtitle(label = plot_title) +
  easy_center_title() +
  easy_remove_legend() +
  scale_x_discrete(name = NULL) +
  scale_y_continuous(name = y_title, limits= c(1,lim_2)) +
  scale_fill_manual(values = c("slategray2", "lightpink1")) } 
#Use function to make plots
nutritionalconfusion.plot <- figure.2.fun(y_var = mydata2$confusion, plot_title = "Nutritional Confusion", y_title = "Nutritional Confusion")
nutritionalbacklash.plot <- figure.2.fun(y_var = mydata2$backlash, plot_title = "Nutritional Backlash", y_title = "Nutritional Backlash")
mistrust.plot <- figure.2.fun(y_var = mydata2$mistrust, plot_title = "Mistrust of Expertise", y_title = "Mistrust of Expertise")
confidence.plot <- figure.2.fun(y_var = mydata2$GSS, plot_title = "Confidence in Scientific Community", y_title = "Confidence in Scientific Community", lim_2 = 3)
certainty.plot <- figure.2.fun(y_var = mydata2$certainty, plot_title = "Certainty of Knowledge", y_title = "Certainty of Knowledge")
development.plot <- figure.2.fun(y_var = mydata2$development, plot_title = "Development of Knowledge", y_title = "Development of Knowledge")
print(nutritionalconfusion.plot)

print(nutritionalbacklash.plot)

print(mistrust.plot)

print(confidence.plot)

print(certainty.plot)

print(development.plot)

Then for the final time applying the same patchwork() to combine all the individual plots together. While the process for coding Experiment 2 was simpler this time off, the satisfaction of succeeding in re-creating all the plots still felt super rewarding!!

# Combined plots
combinedplots2 <- nutritionalconfusion.plot + nutritionalbacklash.plot + mistrust.plot + confidence.plot + certainty.plot + development.plot + plot_layout(ncol = 2)
print(combinedplots2)

Part 3 - Exploratory Analysis

Loading Libraries

Both gt and gtsummary are packages used to table data

library(gt)
library(gtsummary)

Exploratory question 1

The first question I wanted to know is whether age correlated with a mistrust in science? I would be hypothesizing that a higher age would correlate with a higher rate of mistrust in science. While Haigh and Birch did not hint at any generational effects along their variables, the paper’s emphasis on variables such as a participant’s mistrust in science, confidence and certainty in the scientific community, infers a need for a certain flexibility in mindset. Why this is required is because the scientific consensus is always changing which remains a “healthy and normal” component of it, thus not as applicable to rigid mindsets.

This necessity of a flexible and open mind is at odds with the notion that older generations tend to have a higher resistance and “stubbornness” to developing new ideas as expressed by Heid et al.That is why I was curious to explore whether there was any generational effect present between “young” and “old” people.

I wanted two distinct age ranges, and so young people will be between 18-45 and old people between 46-73, and the values are simply split along an average value.

  • Reading in the data using read_csv().
  • The select(Age, mistrust) specifies the two variables I want to utilize for the exploratory analysis.
  • Changing the data name to Age_Mistrust so its more clear it is filtered and separated from the original data set.
  • mutate() is used to add a new variable.
  • ‘age_by_group’ is the name of the new variable, since I want to group the participants by age into two categories and is split using case_when().
  • Age <= 45 ~ "Young" specifies for all participants equal to and under the age of 45 to be categorized as Young.
  • Age <= 45 ~ "Old" specifies for all participants equal to and over the age of 46 to be categorized as Old.
ExploratoryData1 <- read_csv("MyDataFinalSubset2.csv")  # loading in the data 
  
  
Age_Mistrust <- ExploratoryData1 %>% 
    select(Age, mistrust) %>% 

#Adding a new variable "age_by_group"
  mutate(
    age_by_group = case_when(
      Age <= 45 ~ "Young", 
      Age >= 46 ~ "Old")
    )

Descriptive

I then get the new variable and use summarise() to get the results for the mean, standard deviation, number of participants and standard error. I used gt() to table these results, tab_header() to write out the table’s title and fmt_number() to specify each variable by 2 decimal places.In the end, I added tab_options() just to add a bit of colour to the table!

Age_Mistrust %>% 
  group_by(age_by_group) %>% 
  summarise(mean = mean(mistrust), 
            sd = sd(mistrust), 
            n = n(),
            se = sd/sqrt(n)) %>%  
gt() %>% 
  tab_header(title = md("Mean of Mistrust in Science by Age Group")) %>% 
  fmt_number(columns = vars(mean, sd, se), decimals = 2) %>%   
  tab_options(heading.background.color = "#EFFBFC")
Mean of Mistrust in Science by Age Group
age_by_group mean sd n se
Old 2.07 0.71 72 0.08
Young 1.95 0.63 328 0.03

Also wanting a better reflection of these descriptives, I use the dataset and make a plot:

  • ‘mistrust_age_plot` is the name of the new plot to relate it to the variables of ’Age’ and ‘Mistrust’.
  • geom_jitter is used to include the scatterplot, with a bit of noise so the points do not stack on top of each other. The prior scatterplot had a uniform colour, however was hard to distinguish where the difference between young and old points were. Therefore after implementing color = age_by_group within geom_jitter made it much clearer to distinguish the difference.
  • geom_smooth(method = "lm")is for the line of best fit with “lm” standing for linear model.
  • theme(legend.title = element_blank()) was just to remove the ggplot legend title since it looks a little cleaner without “group_by_age”.

Data Visualization

#Mistrust and Age Plot 
mistrust_age_plot2 <- ggplot(
  data = Age_Mistrust, 
  aes(
    x = Age, 
    y = mistrust)) + 
  geom_jitter(aes(color = age_by_group)) + 
  theme_minimal()+
  geom_smooth(method = "lm")+
  scale_y_continuous(name = "Mistrust in Science")+
  ggtitle(label = "Mistrust in Science Grouped by Age")+
  theme(legend.title = element_blank())

print(mistrust_age_plot2)

Inferential statistics

The scatterplot shows some kind of a trend between a mistrust in science and age, yet whether this trend is statistically significant or not needs to be done through a correlation test. cor.test() states what the correlation value is and lets you know whether its a statistically significant correlation.

cor.test(ExploratoryData1$Age, ExploratoryData1$mistrust)
## 
##  Pearson's product-moment correlation
## 
## data:  ExploratoryData1$Age and ExploratoryData1$mistrust
## t = 3.0808, df = 398, p-value = 0.002208
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.05539569 0.24697431
## sample estimates:
##       cor 
## 0.1526184
# So the result is as follows: t = 3.0808, df = 398, p-value = 0.002208

Since the correlation is has a p value <0.05, it is a statistically significant result between age and mistrust in science. The correlation is 0.152, which according to Pearson’s correlation, it is ≈ 2 and thus equates to an extremely small effect size. Thus, according to the results from this experiment, a higher age is correlated with a higher rate of mistrust in the scientific community to a minor degree of significance.

Exploratory Question 2

One other question that had peaked my interest utilizes people’s recall score. I wanted to compare the average rate of people with a higher recall score (>7), compared to people with a lower recall score (<=6) and how it relates to their development of scientific knowledge. My hypothesis is that people in the experiment with a higher recall score, have a slightly better edge with attention to detail and thus must pay more attention to the nuance within the scientific community. While Haigh and Birch intended the recall score to simply be used as a reason for filtering data, another study by Alloway & Alloway 2010 has precedent about the strong links between working memory and learning. Similar in Exploratory Question 1, I used mutate() and case_when() to specify the two groups of less and more recall.

Recall_Development <- ExploratoryData1 %>% 
    select(Recall_score, development) %>% 

#Adding a new variable "recall_by_group"
  mutate(
    recall_by_group = case_when(
      Recall_score <= 6 ~ "Less_Recall", 
      Recall_score >= 7 ~ "More_Recall"
    ))

Descriptives

Development refers to the “epistemic beliefs about the development of scientific knowledge” and on a five point scale, it was dealing with the topic of the belief about science as an evolving and changing subject. The higher the score, the more sophisticated their beliefs were, and a greater awareness that science is uncertain and constantly evolving. This descriptive is grouped by their recall and analyzes the development for the mean.

Recall_Development %>% 
  group_by(recall_by_group) %>% 
  summarise(mean = mean(development), 
            sd = sd(development), 
            n = n(),
            se = sd/sqrt(n)) %>%  
gt() %>% 
  tab_header(title = md("Mean of Development in Scientific Knowledge by Recall Group")) %>% 
  fmt_number(columns = vars(mean, sd, se), decimals = 2) %>%   
  tab_options(heading.background.color = "orange")
Mean of Development in Scientific Knowledge by Recall Group
recall_by_group mean sd n se
Less_Recall 4.36 0.60 82 0.07
More_Recall 4.59 0.42 318 0.02

According to the table, those who had more recall (=7), had a slightly higher mean = 4.59 compared to those who had less recall = 4.35, indicating that those with more recall had a slightly higher development belief overall. As to what effect though is up to the inferential statistics to showcase.

Inferential statistics

I utilized a t.test to compare the means and to see whether there is any statistical significance between the variables. I created two data points Less_Recall and More_Recall, and this separates more and less recall into two separate data sets allowing for a t.test() to look at the development numbers separately and compare the means.

Less_Recall <- Recall_Development %>% 
  filter(recall_by_group == "Less_Recall")

More_Recall <- Recall_Development %>% 
  filter(recall_by_group == "More_Recall")

t.test(Less_Recall$development, More_Recall$development)
## 
##  Welch Two Sample t-test
## 
## data:  Less_Recall$development and More_Recall$development
## t = -3.3499, df = 102.39, p-value = 0.001133
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.37337458 -0.09566463
## sample estimates:
## mean of x mean of y 
##  4.357724  4.592243

As P<0.05, it indicates that the difference in the development of scientific knowledge between those with a high and low recall to a statistically significant degree.

Data Visualization

#Recall and Development plot 
recall_development_plot <- ggplot(
  data = Recall_Development, 
  aes(
    x = recall_by_group, 
    y = development, fill = recall_by_group)) +
  geom_boxplot(aes(fill= recall_by_group),
                   alpha = 0.6, 
                   show.legend = FALSE) + 
  scale_y_continuous(name = "Development of Scientific Knowledge", limits = c(1,5))+
  scale_x_discrete(name = "Recall Groups") +
  ggtitle(label = "Effect of Recall on Development of Scientific Knowledge")+ 
  theme_minimal()

  
print(recall_development_plot)

The statistically significant t-test and the plot both showcase that the recall ability of a participant is related to the development of scientific knowledge. To be specific, those with more recall in headlines had a higher sense of developing scientific knowledge in comparison to those with less recall.

Exploratory Question 3

I wanted to combine the variables of nutritional backlash and nutritional confusion and put them on a new single measure known as “nutritional knowledge”. But first to see if I am able to combine them, I need to see if they are correlated with each other - otherwise it makes no sense to combine the two into a new single variable.

Reading in Data

cor.test(ExploratoryData1$confusion,ExploratoryData1$backlash)
## 
##  Pearson's product-moment correlation
## 
## data:  ExploratoryData1$confusion and ExploratoryData1$backlash
## t = 10.52, df = 398, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3860562 0.5398137
## sample estimates:
##       cor 
## 0.4664511

It had a correlation value = 0.4664511 which demonstrates, a moderately strong correlation and thus gave me precedent to combine the two into a new variable known as “Nutritional Knowledge (NK).” Then I want to compare this new variable against gender to see whether there is any difference in the amount of nutritional knowledge between men and women. A study by Parmenter, Waller and Wardle (2000) on the demographic variations in nutritional knowledge in England showcased that men significantly had poorer knowledge than women and so I wanted to see if this experiment’s results reflect that.

In terms of filtering the data, usually I only need to select two variables however for this exploratory analysis, since I wanted to combine two, I had to include both of them along with gender. I filtered out of the data set, gender numbers that were less than 3 as there were only two participants that opted as neither male nor female and thus is insufficient for any meaningful statistical data.I then changed the gender name using mutate() from 1 and 2 to Male and Female respectively for better clarity.

# Filtering Data 

NK_Gender <- ExploratoryData1 %>% 
    select(backlash, confusion, Gender) 

NK_Gender <- NK_Gender %>% 
  filter(Gender < 3) %>% 
  mutate(Gender = case_when(
    Gender == 1 ~ "Male",
    Gender == 2 ~ "Female"))

I tried to combine the values into one by trying to merge() the datasets and I thought it would combine the values, however it failed to do so and just overlapped them together, essentially doing nothing.

Backlash <- NK_Gender

Confusion <- NK_Gender
nutritionalknowledge1 <- merge(Backlash, Confusion, by = "Gender")

I looked for other ways to create the new variable and knew I could either use mutate(), or use newvariable <- oldvariable.I opted for the latter in this case and assigned the old variables I want added, and since they’re on the same scale from 1-5, I simply needed to divide by 2 to get the average value per participant. In all honesty I would not have been able to do this component without help from this coding blog. And would just like to use this as an opportunity to express how far I’ve come in terms of googling and my successful takeaways:

  • Make sure to know what key words to use when googling, in this case which was merging/combining variable values.
  • Blog posts and forums are some of the best places to look for solutions, over the help section of R as they use an exemplary step-by-step format. But of course consulting both is the optimal way to go.
NK_Gender$NutritionalKnowledge <- ((NK_Gender$backlash + NK_Gender$confusion)/2)

print(NK_Gender) 
## # A tibble: 398 x 4
##    backlash confusion Gender NutritionalKnowledge
##       <dbl>     <dbl> <chr>                 <dbl>
##  1     1.5       1.83 Female                 1.67
##  2     3         2.67 Male                   2.83
##  3     1.83      1.67 Female                 1.75
##  4     3.67      3.33 Female                 3.5 
##  5     2.33      4    Male                   3.17
##  6     2.83      3.17 Male                   3   
##  7     3.5       3.5  Female                 3.5 
##  8     2.83      2.5  Female                 2.67
##  9     2.67      3    Female                 2.83
## 10     2.67      3    Female                 2.83
## # ... with 388 more rows

Descriptive

I use the same method to get the descriptive however this time create a new data frame ‘NK_Gender_Descriptives’ so I can utilize it for the descriptive plot. The fortify() function establishes the new descriptive set as a data frame and allows its variables to be used.

#Descriptive Table 
NK_Gender_Descriptives <- NK_Gender %>% 
  group_by(Gender) %>% 
  summarise(mean = mean(NutritionalKnowledge), 
            sd = sd(NutritionalKnowledge), 
            n = n(),
            se = sd/sqrt(n)) %>% 
  
fortify(NK_Gender_Descriptives)

gt(NK_Gender_Descriptives) %>% 
  tab_header(title = md("Mean of Nutritional Knowledge by Gender")) %>% 
  fmt_number(columns = vars(mean, sd, se), decimals = 2) %>%   
  tab_options(heading.background.color = "blue")
Mean of Nutritional Knowledge by Gender
Gender mean sd n se
Female 2.77 0.60 248 0.04
Male 2.77 0.60 150 0.05

Inferential Statistics

Now the mean value between both male (M= 2.766) and female (M= 2.77) is extremely close to each other, however I will use inferential statistics to confirm through a T-Test.

#Creating new frames and t.test 
Female_Knowledge <- NK_Gender %>% 
  filter(Gender == "Female")

Male_Knowledge <- NK_Gender %>% 
  filter(Gender == "Male")

t.test(Male_Knowledge$NutritionalKnowledge, Female_Knowledge$NutritionalKnowledge)
## 
##  Welch Two Sample t-test
## 
## data:  Male_Knowledge$NutritionalKnowledge and Female_Knowledge$NutritionalKnowledge
## t = -0.073927, df = 312.07, p-value = 0.9411
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1271893  0.1179778
## sample estimates:
## mean of x mean of y 
##  2.765556  2.770161

The p-value = 0.9411, which clearly shows that there is no relation between gender and the amount of Nutritional Knowledge a person has, contrary to the results expressed by Parmenter, Waller and Wardle (2000).

Data Visualization

This is my first plot that utilities the summary statistics rather than the entire data frame and so it took a bit longer to adjust to, as I had to create a separate data frame for NK_Gender, under descriptives. The important change here from other plots is that since it uses summarized data, the y variable should equal the mean from the descriptive data frame piped in the descriptive chunk.

NK_Gender_Plot <- ggplot(data = NK_Gender_Descriptives, 
                         aes(
                           x = Gender,
                           y = mean, 
                           fill = Gender)) +
    geom_col()+
    theme_minimal()+ 
    scale_y_continuous(name = "Nutritional Knowledge", limits = c(0,5))+
    scale_x_discrete(name = "Gender")+
    ggtitle(label = "Effect of Gender on Nutritional Knowledge")

print(NK_Gender_Plot)

To conclude, the results from the very close means from the descriptive, its visual representation on the bar plot and the analysis from the t-test all reinforce that the role of gender has no effect on the amount of nutritional knowledge a participant had.

Part 4 - Recommendations

Naming items and files consistently

One of the first issues we spotted as a group is that it felt like the authors were a little bit lax when it came to naming items with purpose and consistency. One of the notable examples is their oddly named variables of SC0 and FL_10_DO, it would be easier to have named them properly from the beginning so it is less troublesome to work with them instead of having to rename them to what it was originally meant to be. This mistake was present for both data sets and could have been avoided if tabling the values were done with more care.

Another example is their data frame names with “Study 8 data.csv” being used for the first experiment and “Study 7 data.csv” being used for the second experiment, which seems logically backwards. Any initial confusion could be avoided by simply naming the files better and more consistent with what the paper showcases, such as “Exp_1_Data” and “Exp_2_Data”. Another relevant issue is with the use of spaces in their file names as some programs have trouble processing empty space and so should have utilized an underscore (_) or or periods(.) instead.

Now this may seem like a pedantic suggestion, however the importance of naming consistently goes far beyond a few oddly named variables. A habit like this extends to all file and variable names for the sake of open science reproducibility. The underlying principle being that everything including the code is open source, however emphasis must be placed on making it easier for other researchers to do so as well.So the suggestion that has helped me is to name variables, with the mindset that someone else has to read your variable and file names, and immediately understand what they represent! There is a great talk by Danielle Navarro on how to organise one’s workflow, with specific reference of how to properly name files.

Rubberducking and code lacking justification

A lot of their code explained WHAT to do, but failed to include WHY they were using the functions/packages. This made it frustrating when their functions would not apply, as we had to figure out ourselves via trial and error as to what the purpose of the function was and then search for a substitute function that could do the same. Their use of pirate plot over ggplot is a great example as there is no justification for it.

,

Functions such as formula and ylab did not apply to our code and so once again, the relevant functions for our side had to be manually searched for through trial and error. When you’re the author of the paper and the writer of the code, it all becomes increasingly familiar and thus the need to break everything down line by line seems pointless. However, I would argue that having a meticulous rubber ducking process is not only beneficial to others, but to yourself as well.

I had so many moments where I was stumped as to the purpose of a certain function or chunk, but the rubber duck debugging would serve as a consistent reminder as to its purpose and make the process of coding alot less stressful.So therefore my biggest suggestion would be to explain every single line of code as if a complete novice had to attempt it, I think this is best encapsulated by Kenny Kelly on Learning R: Rubber ducking. “I mentioned that I hadn’t been taking rigorous notes, so some of the code went over my head. It’s useful for me to look at code line by line, argument by argument and break down what is being done.” I believe it to be a simple yet useful exemplar of how one should rubber duck.

The second and arguably just as important reason for rubber-ducking is it future-proofs your code. This is an especially prevalent point as coding language and functions consistently evolve and older ones become obsolete, so it is a very real experience that older code no longer runs with the R program. So by justifying the intent behind each function, other researchers will be able to easily discern what current functions they should use in its place.

Tidying code better

If some time was taken to tidy the code and make it more efficient, it may make it easier for other people attempting to verify their data. In the scientific community with the push for more open science, some aspect of verifying code has to be efficient and therefore a stark difference between tidied and non-tidied code clearly paints this picture of a need for efficient code. I believe the clearest example is the difference between their lines of code for the plots compared to ours. There plots required over hundreds of lines of code where as we managed to reduce ours, keeping in mind to create the exact same plots, with less than a hundred lines of code. The difference here is efficiency and the fact that we took the time to make the code as streamlined as possible.

My suggestion would be to search for areas where code can be improved, shortened and tidied by using alternate functions. A very linear example being of how we changed the code for centering the title from theme(plot.title = element_text(hjust = 0.5) to easy_center_title(). However, the greatest revelation for the efficiency was of course, the use of functions, and one source that not only includes this but a whole array of other methods to make coding more efificient can be found here! Other than that, googling relevant functions and packages that can have the same coding output evident here, where other related functions are listed could tidy the code up.