My goals this week were:
Pick another question for exploratory analysis
Gain descriptive statistics for the question
Plot the statistics into a boxplot
Gain inferential statistics for the question
For my second question, I chose to answer:
As I did for male
and female
last week, I renamed all the numerical values in the education column within data_attn
to their corresponding categorical name (character).
I then placed this into a tibble to check that it changed the variable to character, and correctly changed the variables themselves.
data_attn$education[data_attn$education == 1] <- "No_school"
data_attn$education[data_attn$education == 2] <- "Eighth_grade_or_less"
data_attn$education[data_attn$education == 3] <- "More_than_eighth grade_less_than_highschool"
data_attn$education[data_attn$education == 4] <- "Highschool_or_equivalent"
data_attn$education[data_attn$education == 5] <- "Some_college"
data_attn$education[data_attn$education == 6] <- "Four_year_college_degree"
data_attn$education[data_attn$education == 7] <- "Graduate_or_professional_training"
data_attn$education[data_attn$education == 8] <- "Other"
tibble(data_attn$education)
## # A tibble: 707 x 1
## `data_attn$education`
## <chr>
## 1 Eighth_grade_or_less
## 2 Graduate_or_professional_training
## 3 Eighth_grade_or_less
## 4 Four_year_college_degree
## 5 Four_year_college_degree
## 6 Four_year_college_degree
## 7 Highschool_or_equivalent
## 8 Four_year_college_degree
## 9 Four_year_college_degree
## 10 Highschool_or_equivalent
## # … with 697 more rows
Again, similar to obtaining the descriptive statistics from last week, I created a new data frame that showed the number, proportion and percentage of participants’ education levels using the summarise()
function. This was then placed into a table using gt()
.
education_summary<-data_attn %>% group_by(education) %>% summarise(n=n(), proportion=n/707, percentage=proportion*100)
education_summary %>% gt()
education | n | proportion | percentage |
---|---|---|---|
Eighth_grade_or_less | 152 | 0.21499293 | 21.499293 |
Four_year_college_degree | 249 | 0.35219236 | 35.219236 |
Graduate_or_professional_training | 93 | 0.13154173 | 13.154173 |
Highschool_or_equivalent | 134 | 0.18953324 | 18.953324 |
More_than_eighth grade_less_than_highschool | 13 | 0.01838755 | 1.838755 |
No_school | 11 | 0.01555870 | 1.555870 |
Other | 10 | 0.01414427 | 1.414427 |
Some_college | 45 | 0.06364922 | 6.364922 |
I added the education column from data_attn
into dogscale
by using the select()
function. I then created a D
column using mutate()
, which added all items relating to dogmatism divided by the total number of items to obtain a mean dogmatism score. I then glimpsed this to check that the mutated variable was added.
dogscale=dplyr::select(data_attn,starts_with('Q37'), education, gender)
dogscale<-mutate(dogscale, D= Q37_1/20+Q37_2/20+Q37_3/20+Q37_4/20+Q37_5/20+Q37_6/20+Q37_7/20+Q37_8/20+Q37_9/20+Q37_10/20+Q37_11/20+Q37_12/20+Q37_13/20+Q37_14/20+Q37_15/20+Q37_16/20+Q37_17/20+Q37_18/20+Q37_19/20+Q37_20/20)
glimpse(dogscale$D)
## num [1:707] 3.1 5.05 4.05 3.1 4.7 4.9 4.55 6 5.45 3.4 ...
I used the same method as last week for my beliefsup_summary
summary table to create a table that summarises dogmatism scores among participants of varying education levels. To sort participants by education I used the group_by()
function. I then used mean()
, sd()
, and se()
functions, whilst ignoring na values with na.rm=TRUE
, to find the means, standard deviations, standard errors and n. I then used gt()
to place this into a table.
dog_summary<-dogscale %>% group_by(education) %>% summarise(mean = mean(D, na.rm=TRUE), sd = sd(D, na.rm=TRUE), n=n(), se=sd/sqrt(n))
dog_summary %>% gt()
education | mean | sd | n | se |
---|---|---|---|---|
Eighth_grade_or_less | 4.378621 | 0.9623601 | 152 | 0.07805771 |
Four_year_college_degree | 4.468565 | 1.2053963 | 249 | 0.07638888 |
Graduate_or_professional_training | 4.486364 | 1.0794509 | 93 | 0.11193384 |
Highschool_or_equivalent | 4.320400 | 1.0989932 | 134 | 0.09493856 |
More_than_eighth grade_less_than_highschool | 4.220833 | 1.4921702 | 13 | 0.41385356 |
No_school | 4.920000 | 0.3599383 | 11 | 0.10852547 |
Other | 4.245000 | 1.7327643 | 10 | 0.54794819 |
Some_college | 4.172222 | 1.0057536 | 45 | 0.14992890 |
This boxplot is very pretty! However, I don’t want to compare all 8 variables (I don’t know how to do that with ANOVA). I need to create a boxplot that filters only no school and graduate or professional training.
dogscale %>% group_by(education) %>%
ggplot(aes(education, D, fill=education)) +
geom_boxplot(alpha=.4) + scale_x_discrete(labels = NULL)+labs(x='Education', y='Mean Dogmatism Score') + theme_minimal() + theme(axis.ticks.y = element_line(color="black"))+ theme(axis.line= element_line(color="black"))+ geom_jitter(alpha = 0.6, aes(color=education)) + easy_all_text_size(10) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
## Warning: Removed 35 rows containing non-finite values (stat_boxplot).
## Warning: Removed 35 rows containing missing values (geom_point).
I won’t explain all this code as I have explained it before in my other learning logs. The new code I included in this graph is:
filter()
and %in%
function to show only two vectors in the plot. This worked beautifully!dogscale %>% filter(education %in% c("No_school", "Graduate_or_professional_training")) %>%
ggplot(aes(education, D, fill=education)) +
geom_boxplot(alpha=.4) + scale_x_discrete(labels = c("Graduate or Professional Training", "No School"))+labs(x='Education', y='Mean Dogmatism Score') + theme_minimal() + theme(axis.ticks.y = element_line(color="black"))+ theme(axis.line= element_line(color="black"))+ geom_jitter(alpha = 1.5, size= 1, aes(color=education)) + easy_remove_legend() + easy_all_text_size(10) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
## Warning: Removed 6 rows containing non-finite values (stat_boxplot).
## Warning: Removed 6 rows containing missing values (geom_point).
As the no school group only has a small n, I thought it would be interesting to compare two groups with similar n and also a large difference in education level. So, I used the same code as above except I replaced No_school
with Eighth_grade_or_less
for the vectors in the filter()
function.
dogscale %>% filter(education %in% c("Eighth_grade_or_less", "Graduate_or_professional_training")) %>%
ggplot(aes(education, D, fill=education)) +
geom_boxplot(alpha=.4) + theme_minimal() + theme(axis.ticks.y = element_line(color="black"))+ theme(axis.line= element_line(color="black"))+ geom_jitter(alpha = 1, aes(color=education))+ scale_x_discrete(labels = c("Eighth Grade Or Less", "Graduate or Professional Training"))+labs(x='Education', y='Mean Dogmatism Score') + easy_all_text_size(10) + easy_remove_legend() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
## Warning: Removed 12 rows containing non-finite values (stat_boxplot).
## Warning: Removed 12 rows containing missing values (geom_point).
Similar to the workflow for this section last week, I obtained a t-test for the groups I am most interested in for this question. I did this by creating two new data frames with either only those who had the minimum level of education (no school) or the maximum level of education (graduate or professional training).
No_School <- dogscale %>% filter(education=="No_school")
Graduate <- dogscale %>% filter(education=="Graduate_or_professional_training")
t.test(No_School$D, Graduate$D)
##
## Welch Two Sample t-test
##
## data: No_School$D and Graduate$D
## t = 2.6792, df = 33.209, p-value = 0.01139
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1044213 0.7628514
## sample estimates:
## mean of x mean of y
## 4.920000 4.486364
Looks significant! The large mean from No School however, could be a function of its low n (i.e. 93 vs 11)
Same code as above except I swapped No_School
for Eight_grade_or_less
and filtered Eighth_grade_or_less
into a new data frame.
Graduate <- dogscale %>% filter(education=="Graduate_or_professional_training")
Eight_Grade <- dogscale %>% filter(education=="Eighth_grade_or_less")
t.test(Eight_Grade$D, Graduate$D)
##
## Welch Two Sample t-test
##
## data: Eight_Grade$D and Graduate$D
## t = -0.76904, df = 167.61, p-value = 0.443
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3843324 0.1688465
## sample estimates:
## mean of x mean of y
## 4.378621 4.486364
Not significant!
My main challenge for this week was understanding how to incorporate only two vectors into a boxplot. I attempted everything from creating a new data frame to mutating new columns within existing data frames and I was still unable to figure out how to select only certain columns or data within a column to plot. During workshop, Jenny R suggested I used the filter()
function in ggplot
and %in%
which I had never seen before! I think its used to create 2 vectors to be plotted rather than eight on the x-axis. This code worked!
I had a bit of a challenge with my graph’s width as it looked very squished from all the different groups but Jenny S suggested to use fig.width
, which worked wonderfully!
My success of this week was completing a second question for my exploratory analysis! I am really happy with my progress for this section! I am also really happy with the workflow pattern I have developed!
A specific coding success I had this week was figuring out how to use summary vs larger data sets when graphing in R. I specifically learned that boxplot graphs do not work for summary data frames but do work for a large data frame.
Another success of this week was finding out about the filter()
and %in%
function so that I can filter specific groups for my graphs!
My next steps for coding in Week 10 is to finish my exploratory analysis by exploring a third question.
My next steps are to also add annotations to my graph, particularly for boxplot 1. This is because I want to note the difference in n between the groups, which might be a reason for the significant difference. Jenny R suggested I look up ggannotate so I will have a look at that next!
As I have two other assignments due on Friday next week, I hope I can continue progressing with this exploratory section, I’ll attempt to spend as much time as I can on it before the weekend.