DS Labs Assignment

Author

Marie-Anne Kemajou

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library("dslabs")
data(package="dslabs")
list.files(system.file("script", package = "dslabs"))

 [1] "make-admissions.R"                   
 [2] "make-brca.R"                         
 [3] "make-brexit_polls.R"                 
 [4] "make-calificaciones.R"               
 [5] "make-death_prob.R"                   
 [6] "make-divorce_margarine.R"            
 [7] "make-gapminder-rdas.R"               
 [8] "make-greenhouse_gases.R"             
 [9] "make-historic_co2.R"                 
[10] "make-mice_weights.R"                 
[11] "make-mnist_127.R"                    
[12] "make-mnist_27.R"                     
[13] "make-movielens.R"                    
[14] "make-murders-rda.R"                  
[15] "make-na_example-rda.R"               
[16] "make-nyc_regents_scores.R"           
[17] "make-olive.R"                        
[18] "make-outlier_example.R"              
[19] "make-polls_2008.R"                   
[20] "make-polls_us_election_2016.R"       
[21] "make-pr_death_counts.R"              
[22] "make-reported_heights-rda.R"         
[23] "make-research_funding_rates.R"       
[24] "make-stars.R"                        
[25] "make-temp_carbon.R"                  
[26] "make-tissue-gene-expression.R"       
[27] "make-trump_tweets.R"                 
[28] "make-weekly_us_contagious_diseases.R"
[29] "save-gapminder-example-csv.R"

For this chunk of code, I’m calling DSlabs into my document and listing them out so that I can figure out which dataset I would like to work with.

data("heights")
heights <- heights

I called the “heights” dataset into the quarto document. I was running into an issue where I could not see the number of observations and variables so I used the second line of code to force R to allow it.

head(heights)

     sex height
1   Male     75
2   Male     70
3   Male     68
4   Male     74
5   Male     61
6 Female     65

I ran this code to get an idea of what the data looked like and if I could meet the assignment requirements with it.

sum(is.na(heights))

[1] 0

I checked to see if there were NA values I would need to address.

average_heights <-heights %>%
  group_by(sex) %>%
  summarize(average_height = mean(height))

Since the dataset I chose only had two variables, I ultimately decided on breaking the height category up by being shorter than average and taller than average for both males and females. I originally planned to use the actual average height for males and females in the world and compare this data to those averages, but then I as I was trying to figure out how exactly I needed to code this, I realized it made more sense to use the averages from the data itself instead of comparing it to numbers that may be far from relevant.

heights <- heights %>%
  left_join(average_heights, by = "sex")

I used the “left_join” code to merge my my average heights with the original dataset. I kept accidentally overriding my data at first but I was able to figure out how to prevent that with the category names.

heights$height_type <- ifelse(heights$height < heights$average_height, "Below Average", "Above Average")

This code is what actually created the “Below Average” and “Above Average” categories. I used ifelse to force the data to create a distinction between the numbers that fall above and below the average height.

ggplot(heights, aes(x = sex, y = height, fill = height_type)) +
  geom_boxplot() + 
  labs(title = "Self Reported Height Distribution by Sex",  
       x = "Sex",                             
       y = "Height (Inches)",
       fill = "Height Distinction") +
  theme_minimal() +                           
  scale_fill_manual(values = c("maroon", "seagreen"))

Originally, my graph was going to show the distinction between the spread of Male and Female average height. Then I realized it did not have enough variables so I reworked all of the code. I decided to create a comparison of not only self reported average height, but also the spread of males and females that are above and below the average height. I thought it was interesting how similar the spreads look at first glance, but when you really look closely you notice some interesting differences.