Week 9

Goals

My goals this week were:

  1. Pick another question for exploratory analysis

  2. Gain descriptive statistics for the question

  3. Plot the statistics into a boxplot

  4. Gain inferential statistics for the question

Question 2 for exploratory analysis

For my second question, I chose to answer:

  • Does participants’ education level affect their dogmatism scores? More specifically, is there a difference in dogmatism scores between the highest and lowest levels of education?

Coding Steps

Step 1: Renaming the variables

As I did for male and female last week, I renamed all the numerical values in the education column within data_attn to their corresponding categorical name (character).

I then placed this into a tibble to check that it changed the variable to character, and correctly changed the variables themselves.

data_attn$education[data_attn$education == 1] <- "No_school"
data_attn$education[data_attn$education == 2] <- "Eighth_grade_or_less"
data_attn$education[data_attn$education == 3] <- "More_than_eighth grade_less_than_highschool"
data_attn$education[data_attn$education == 4] <- "Highschool_or_equivalent"
data_attn$education[data_attn$education == 5] <- "Some_college"
data_attn$education[data_attn$education == 6] <- "Four_year_college_degree"
data_attn$education[data_attn$education == 7] <- "Graduate_or_professional_training"
data_attn$education[data_attn$education == 8] <- "Other"
tibble(data_attn$education)
## # A tibble: 707 x 1
##    `data_attn$education`            
##    <chr>                            
##  1 Eighth_grade_or_less             
##  2 Graduate_or_professional_training
##  3 Eighth_grade_or_less             
##  4 Four_year_college_degree         
##  5 Four_year_college_degree         
##  6 Four_year_college_degree         
##  7 Highschool_or_equivalent         
##  8 Four_year_college_degree         
##  9 Four_year_college_degree         
## 10 Highschool_or_equivalent         
## # … with 697 more rows

Step 2: Education descriptive statistics summary

Again, similar to obtaining the descriptive statistics from last week, I created a new data frame that showed the number, proportion and percentage of participants’ education levels using the summarise() function. This was then placed into a table using gt().

education_summary<-data_attn %>% group_by(education) %>% summarise(n=n(), proportion=n/707, percentage=proportion*100)
education_summary %>% gt()
education n proportion percentage
Eighth_grade_or_less 152 0.21499293 21.499293
Four_year_college_degree 249 0.35219236 35.219236
Graduate_or_professional_training 93 0.13154173 13.154173
Highschool_or_equivalent 134 0.18953324 18.953324
More_than_eighth grade_less_than_highschool 13 0.01838755 1.838755
No_school 11 0.01555870 1.555870
Other 10 0.01414427 1.414427
Some_college 45 0.06364922 6.364922

Step 3: Mutating in dogscale

I added the education column from data_attn into dogscale by using the select() function. I then created a D column using mutate(), which added all items relating to dogmatism divided by the total number of items to obtain a mean dogmatism score. I then glimpsed this to check that the mutated variable was added.

dogscale=dplyr::select(data_attn,starts_with('Q37'), education, gender)

dogscale<-mutate(dogscale, D= Q37_1/20+Q37_2/20+Q37_3/20+Q37_4/20+Q37_5/20+Q37_6/20+Q37_7/20+Q37_8/20+Q37_9/20+Q37_10/20+Q37_11/20+Q37_12/20+Q37_13/20+Q37_14/20+Q37_15/20+Q37_16/20+Q37_17/20+Q37_18/20+Q37_19/20+Q37_20/20)

glimpse(dogscale$D)
##  num [1:707] 3.1 5.05 4.05 3.1 4.7 4.9 4.55 6 5.45 3.4 ...

Step 4: Dogmatism descriptive statistics summary

I used the same method as last week for my beliefsup_summary summary table to create a table that summarises dogmatism scores among participants of varying education levels. To sort participants by education I used the group_by() function. I then used mean(), sd(), and se() functions, whilst ignoring na values with na.rm=TRUE, to find the means, standard deviations, standard errors and n. I then used gt() to place this into a table.

dog_summary<-dogscale %>% group_by(education) %>% summarise(mean = mean(D, na.rm=TRUE), sd = sd(D, na.rm=TRUE), n=n(), se=sd/sqrt(n))

dog_summary %>% gt()
education mean sd n se
Eighth_grade_or_less 4.378621 0.9623601 152 0.07805771
Four_year_college_degree 4.468565 1.2053963 249 0.07638888
Graduate_or_professional_training 4.486364 1.0794509 93 0.11193384
Highschool_or_equivalent 4.320400 1.0989932 134 0.09493856
More_than_eighth grade_less_than_highschool 4.220833 1.4921702 13 0.41385356
No_school 4.920000 0.3599383 11 0.10852547
Other 4.245000 1.7327643 10 0.54794819
Some_college 4.172222 1.0057536 45 0.14992890

Step 5: Boxplot 1

This boxplot is very pretty! However, I don’t want to compare all 8 variables (I don’t know how to do that with ANOVA). I need to create a boxplot that filters only no school and graduate or professional training.

dogscale %>% group_by(education) %>%
  ggplot(aes(education, D, fill=education)) +
  geom_boxplot(alpha=.4) + scale_x_discrete(labels = NULL)+labs(x='Education', y='Mean Dogmatism Score') + theme_minimal() + theme(axis.ticks.y = element_line(color="black"))+ theme(axis.line= element_line(color="black"))+ geom_jitter(alpha = 0.6, aes(color=education)) + easy_all_text_size(10) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
## Warning: Removed 35 rows containing non-finite values (stat_boxplot).
## Warning: Removed 35 rows containing missing values (geom_point).

Step 6: Boxplot 2

I won’t explain all this code as I have explained it before in my other learning logs. The new code I included in this graph is:

  • After many hours of googling and mutating columns and creating new data frames I had no idea how to select certain data to display in the plot. Jenny R during workshop however, suggested I used the filter() and %in% function to show only two vectors in the plot. This worked beautifully!
dogscale %>% filter(education %in% c("No_school", "Graduate_or_professional_training")) %>%
  ggplot(aes(education, D, fill=education)) +
  geom_boxplot(alpha=.4) + scale_x_discrete(labels = c("Graduate or Professional Training", "No School"))+labs(x='Education', y='Mean Dogmatism Score') + theme_minimal() + theme(axis.ticks.y = element_line(color="black"))+ theme(axis.line= element_line(color="black"))+ geom_jitter(alpha = 1.5, size= 1, aes(color=education)) + easy_remove_legend() + easy_all_text_size(10) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
## Warning: Removed 6 rows containing non-finite values (stat_boxplot).
## Warning: Removed 6 rows containing missing values (geom_point).

Step 7: Boxplot 3

As the no school group only has a small n, I thought it would be interesting to compare two groups with similar n and also a large difference in education level. So, I used the same code as above except I replaced No_school with Eighth_grade_or_less for the vectors in the filter() function.

dogscale %>% filter(education %in% c("Eighth_grade_or_less", "Graduate_or_professional_training")) %>%
  ggplot(aes(education, D, fill=education)) +
  geom_boxplot(alpha=.4) + theme_minimal() + theme(axis.ticks.y = element_line(color="black"))+ theme(axis.line= element_line(color="black"))+ geom_jitter(alpha = 1, aes(color=education))+ scale_x_discrete(labels = c("Eighth Grade Or Less", "Graduate or Professional Training"))+labs(x='Education', y='Mean Dogmatism Score') + easy_all_text_size(10) + easy_remove_legend() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
## Warning: Removed 12 rows containing non-finite values (stat_boxplot).
## Warning: Removed 12 rows containing missing values (geom_point).

Step 8: Inferential statistics Boxplot 2

Similar to the workflow for this section last week, I obtained a t-test for the groups I am most interested in for this question. I did this by creating two new data frames with either only those who had the minimum level of education (no school) or the maximum level of education (graduate or professional training).

No_School <- dogscale %>% filter(education=="No_school")
Graduate <- dogscale %>% filter(education=="Graduate_or_professional_training")
t.test(No_School$D, Graduate$D)
## 
##  Welch Two Sample t-test
## 
## data:  No_School$D and Graduate$D
## t = 2.6792, df = 33.209, p-value = 0.01139
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1044213 0.7628514
## sample estimates:
## mean of x mean of y 
##  4.920000  4.486364

Looks significant! The large mean from No School however, could be a function of its low n (i.e. 93 vs 11)

Step 9: Inferential statistics Boxplot 3

Same code as above except I swapped No_School for Eight_grade_or_less and filtered Eighth_grade_or_less into a new data frame.

Graduate <- dogscale %>% filter(education=="Graduate_or_professional_training")
Eight_Grade <- dogscale %>% filter(education=="Eighth_grade_or_less")
t.test(Eight_Grade$D, Graduate$D)
## 
##  Welch Two Sample t-test
## 
## data:  Eight_Grade$D and Graduate$D
## t = -0.76904, df = 167.61, p-value = 0.443
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3843324  0.1688465
## sample estimates:
## mean of x mean of y 
##  4.378621  4.486364

Not significant!

Quick Question for Jenny! :)

  • Hi! I was just wondering for my question, would it be worth it to conduct an ANOVA on all 8 groups or if I should keep it as a t-test comparing No School/Eight Grade and Graduate? If it’s worth it, how would I go about conducting an ANOVA that large? Thank you!

Challenges

  • My main challenge for this week was understanding how to incorporate only two vectors into a boxplot. I attempted everything from creating a new data frame to mutating new columns within existing data frames and I was still unable to figure out how to select only certain columns or data within a column to plot. During workshop, Jenny R suggested I used the filter() function in ggplot and %in% which I had never seen before! I think its used to create 2 vectors to be plotted rather than eight on the x-axis. This code worked!

  • I had a bit of a challenge with my graph’s width as it looked very squished from all the different groups but Jenny S suggested to use fig.width, which worked wonderfully!

Successes

  • My success of this week was completing a second question for my exploratory analysis! I am really happy with my progress for this section! I am also really happy with the workflow pattern I have developed!

  • A specific coding success I had this week was figuring out how to use summary vs larger data sets when graphing in R. I specifically learned that boxplot graphs do not work for summary data frames but do work for a large data frame.

  • Another success of this week was finding out about the filter() and %in% function so that I can filter specific groups for my graphs!

Next Steps

  • My next steps for coding in Week 10 is to finish my exploratory analysis by exploring a third question.

  • My next steps are to also add annotations to my graph, particularly for boxplot 1. This is because I want to note the difference in n between the groups, which might be a reason for the significant difference. Jenny R suggested I look up ggannotate so I will have a look at that next!

  • As I have two other assignments due on Friday next week, I hope I can continue progressing with this exploratory section, I’ll attempt to spend as much time as I can on it before the weekend.