For this chunk of code, I’m calling DSlabs into my document and listing them out so that I can figure out which dataset I would like to work with.
data("heights")heights <- heights
I called the “heights” dataset into the quarto document. I was running into an issue where I could not see the number of observations and variables so I used the second line of code to force R to allow it.
head(heights)
sex height
1 Male 75
2 Male 70
3 Male 68
4 Male 74
5 Male 61
6 Female 65
I ran this code to get an idea of what the data looked like and if I could meet the assignment requirements with it.
sum(is.na(heights))
[1] 0
I checked to see if there were NA values I would need to address.
Since the dataset I chose only had two variables, I ultimately decided on breaking the height category up by being shorter than average and taller than average for both males and females. I originally planned to use the actual average height for males and females in the world and compare this data to those averages, but then I as I was trying to figure out how exactly I needed to code this, I realized it made more sense to use the averages from the data itself instead of comparing it to numbers that may be far from relevant.
heights <- heights %>%left_join(average_heights, by ="sex")
I used the “left_join” code to merge my my average heights with the original dataset. I kept accidentally overriding my data at first but I was able to figure out how to prevent that with the category names.
This code is what actually created the “Below Average” and “Above Average” categories. I used ifelse to force the data to create a distinction between the numbers that fall above and below the average height.
ggplot(heights, aes(x = sex, y = height, fill = height_type)) +geom_boxplot() +labs(title ="Self Reported Height Distribution by Sex", x ="Sex", y ="Height (Inches)",fill ="Height Distinction") +theme_minimal() +scale_fill_manual(values =c("maroon", "seagreen"))
Originally, my graph was going to show the distinction between the spread of Male and Female average height. Then I realized it did not have enough variables so I reworked all of the code. I decided to create a comparison of not only self reported average height, but also the spread of males and females that are above and below the average height. I thought it was interesting how similar the spreads look at first glance, but when you really look closely you notice some interesting differences.