My goals for this week

  1. Finish recording for our slides and attend the group presentations!!
  2. Attend the Q&A and take notes on what Jenny is saying
  3. Try one exploratory analysis, it does not have to be fully refined - just trying to see what I can do on my own~

How I achieved that

I believe I was able to achieve my goals this week at a comfortable rate. We finished the recording with our group and celebrated when it was finished, could not have asked for a better group!! Attending the Q&A was really useful, although it was a bit long, I believe that was necessary to give us all the tools to be able to tackle the exploratory analysis section of the verification report, so a massive shout-out to Jenny for all the guidance! I think I was able to trial at least one piece of exploratory analysis on my own but there’s alot that I’m still unsure about and so need to focus on a deep dive into Week 9.

Successes

Completed all my goals for the week!

Challenges

I think my ability for the exploratory analysis is nowhere near as refined as it could be and alot of it was based on the help from the Q&A. I think one of my greatest issues is still trying to google relevant sources online.I need to spend further time trying to understand how to table the descriptive statistics better, justify why I’m doing so, clean up both the code and graph and also do the same for statistical tests.

Steps for Next Week

  1. As embarassing as it is, I still have not even started up the verification report - I’ve been referring to it all this time without even a physical document so an immediate step 1 is to create the document and include all the subheaders that are mentioned on the internship website!!
  2. Come up with 3 finalized exploratory analysis questions
  3. Refine all the components for question 1, and if possible doing so for all three questions

Loading in the packages from the Week 8 Q&A

Currently, not sure as to which packages I will have to use and the only one I did not include is the penguins() package since it is a preinstalled data set - and I already have a working data set from the Haigh Journal article.

library(ggpubr)
## Loading required package: ggplot2
library(jmv)
library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.2     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks rstatix::filter(), stats::filter()
## x dplyr::lag()    masks stats::lag()
library(here)
## here() starts at D:/Sync/UNSW - Year 3 THIS IS THE YEAR BABY/Term 2/PSYC3361 - Research Internship/R Markdown Projects/Group_1_projects
library(janitor)
## 
## Attaching package: 'janitor'
## The following object is masked from 'package:rstatix':
## 
##     make_clean_names
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(ggeasy)
library(readxl)
library(gt)
library(gtsummary)

Dot points from the Week 8 Q&A

  • Have to think about 1 of my exploratory analyses questions
  • What type of variables will you be looking at (continuous or categorical)
  • draw a plot with your x and y axes labeled and do yoru best to create what you think your results would look like? *example: penguins dataset: are males heavier than females? (x = sex, y = body mass, type of plot = boxplot) note: don’t need to justify the type of plot, it will make more sense just based on your variables for the type of graph relating it to other articles and one point of reference and justifications for your prediction
  • Make an analysis plan, what the variables will be, what types of plot, what types of statistical tests, and the overall aim is that there are different types of visualization

The first question that came to my mind was “Whether age is correlated with a mistrust in science?” To be more specific, as in my head I would be hypothesizing that a higher age would correlate with a higher rate of mistrust in science. The basis for my thought isn’t based on research but as an initial point of conception to test whether this angle would work for exploratory analysis!

Descriptive statistics

So first things first is to get the descriptive statistics and to do that, I have to identify the type of variables, and I believe age and mistrust in science to be continuous and continuous. I was also thinking about doing age as a categorical variable by splitting it into ‘high age’ and ‘low age’ along the average but I will stick with the former for now.

‘ExploratoryData1’ refers to the new data set title I am working with and it is changed and read in using the read_csv() function. select() is used to select the only variables I want to work with which is the ‘Age’ and ‘mistrust’. The cor variable is used for the correlation between two quantitative variables.
- Its noted in one of the articles that I was reading that the correlation between variables (X,Y) is equal to the correlation between variables (Y,X). - The correlation is computed as the Pearson correlation by defualt by using this function. na_omit is to get rid of any NA’s just in case, not sure if it matters to the data though since it is using data that is already tidied.

ExploratoryData1 <- read_csv("MyDataFinalSubset2.csv") %>% # loading in the data 
  select(Age, mistrust)
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   block = col_character(),
##   Format = col_character(),
##   Conflict = col_character()
## )
## i Use `spec()` for the full column specifications.
ExploratoryData1 <- ExploratoryData1 %>% 
  na.omit()

Descriptive statistics

In terms of getting the descriptives, I’m just using the template that was shown in the Week 8 Q&A, I’m sure this even the right way to table it.With regards to the descriptives, I want to ask the right questions on slack with how I would fix this because the Q&A with regards to continuous variables had it grouped by the contains("length") function and so it was able to include both bill and flipper length. - What I think it’s done is included all the participants under the age 18 and then grouped them together but I’m not sure if I do want them grouped like that and instead have every point of data individually. _ I do need to put greater thought into the descriptives and as to the WHY I’m doing this. The issue may stem from it not even being proper to table two continuous variables - gt() is used to table the results

ExploratoryData1 %>% 
  group_by(Age) %>% 
  summarise(mean = mean(mistrust), 
            sd = sd(mistrust), 
            n = n(),
            se = sd/sqrt(n)) %>%  
gt()
Age mean sd n se
18 1.805556 0.4371241 12 0.12618685
19 2.030303 0.4334499 11 0.13069005
20 2.070175 0.9201379 19 0.21109410
21 1.833333 0.6992059 16 0.17480147
22 1.473684 0.4488090 19 0.10296384
23 1.541667 0.3959116 8 0.13997590
24 1.825397 0.4548388 21 0.09925396
25 1.750000 0.6382847 16 0.15957118
26 1.809524 0.6877615 21 0.15008186
27 1.846154 0.5021322 13 0.13926642
28 1.958333 0.6025738 8 0.21304203
29 2.228070 0.6763430 19 0.15516373
30 2.176471 0.5542468 17 0.13442460
31 2.176471 0.6467617 17 0.15686275
32 1.784314 0.5644709 17 0.13690431
33 1.952381 0.7559289 7 0.28571429
34 2.454545 0.8065389 11 0.24318064
35 2.333333 0.6356417 12 0.18349396
36 1.933333 0.5837300 10 0.18459164
37 1.948718 0.4683729 13 0.12990328
38 1.916667 0.6309898 4 0.31549491
39 1.888889 0.3849002 3 0.22222222
40 2.000000 0.8819171 3 0.50917508
41 1.800000 0.4472136 5 0.20000000
42 2.121212 0.5430925 11 0.16374856
43 2.142857 0.5727498 7 0.21647907
44 2.083333 0.6871843 4 0.34359214
45 1.583333 0.7876359 4 0.39381797
46 1.714286 0.4879500 7 0.18442778
47 1.916667 0.8766519 4 0.43832594
48 1.777778 0.5018484 6 0.20487877
49 1.600000 0.3651484 5 0.16329932
50 2.277778 0.7428673 6 0.30327431
51 1.916667 0.1666667 4 0.08333333
52 1.833333 1.0000000 4 0.50000000
53 2.133333 0.6912147 5 0.30912062
54 2.000000 NA 1 NA
55 2.666667 NA 1 NA
56 2.083333 0.1666667 4 0.08333333
57 2.166667 0.2357023 2 0.16666667
58 2.888889 1.8358568 3 1.05993245
59 1.833333 1.1785113 2 0.83333333
60 2.666667 0.9428090 2 0.66666667
61 2.000000 0.0000000 2 0.00000000
62 2.444444 0.8388705 3 0.48432210
63 2.666667 1.1547005 3 0.66666667
64 2.000000 0.4714045 4 0.23570226
65 2.333333 0.9428090 2 0.66666667
72 2.000000 NA 1 NA
73 2.333333 NA 1 NA

Data visualization

’mistrust_age_plotis the name of the new plot to relate it to the variables of 'Age' and 'Mistrust'ggplot(data = ExploratoryData1)refers to the plot includes the exploratory data setx = Agespecifies that along the x axis is Agey = mistrustspecifies that along the y variable is the rate of mistrust in the scientific communitygeom_pointis used to include the scatterplotgeom_smooth(method = “lm”)for the line of best fit with "lm" standing for linear model and was taught to us by Jenny in the Week 8 Q&A!theme_minimal()`just changes the background of the plot from grey to a white

#Mistrust and Age Plot 
mistrust_age_plot <- ggplot(
  data = ExploratoryData1, 
  aes(
    x = Age, 
    y = mistrust, 
  )
) + 
  geom_point() + 
  geom_smooth(method = "lm") + #line of best fit
  theme_minimal()

print(mistrust_age_plot)
## `geom_smooth()` using formula 'y ~ x'

Correlation test

cor() is for the correlation coefficient. cor.test() says where the correlation and lets you know whether its a statistically significant correlation, can just use this one as cor() is redundant. I’ll include both for now, just to show that they can both be utilised! As for the subsequent analysis and the IMPORTANCE of the statistics, I’m still have alot I need to catch up on for the stats.

cor(ExploratoryData1$Age, ExploratoryData1$mistrust)
## [1] 0.1526184
cor.test(ExploratoryData1$Age, ExploratoryData1$mistrust)
## 
##  Pearson's product-moment correlation
## 
## data:  ExploratoryData1$Age and ExploratoryData1$mistrust
## t = 3.0808, df = 398, p-value = 0.002208
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.05539569 0.24697431
## sample estimates:
##       cor 
## 0.1526184
# So the result is as follows: t = 3.0808, df = 398, p-value = 0.002208