DS Labs Assignment

Author

Renato Chavez

Published

March 27, 2023

For this project I will create a graph based on a dataset included in DS Labs.

I will start by installing the package, and then load the library dslabs and tidyverse.

# install.packages("dslabs")
library("dslabs")
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

After that, I will show the datasets in dslabs to be able to choose one.

data(package="dslabs")
list.files(system.file("script", package = "dslabs"))
 [1] "make-admissions.R"                   
 [2] "make-brca.R"                         
 [3] "make-brexit_polls.R"                 
 [4] "make-death_prob.R"                   
 [5] "make-divorce_margarine.R"            
 [6] "make-gapminder-rdas.R"               
 [7] "make-greenhouse_gases.R"             
 [8] "make-historic_co2.R"                 
 [9] "make-mnist_27.R"                     
[10] "make-movielens.R"                    
[11] "make-murders-rda.R"                  
[12] "make-na_example-rda.R"               
[13] "make-nyc_regents_scores.R"           
[14] "make-olive.R"                        
[15] "make-outlier_example.R"              
[16] "make-polls_2008.R"                   
[17] "make-polls_us_election_2016.R"       
[18] "make-reported_heights-rda.R"         
[19] "make-research_funding_rates.R"       
[20] "make-stars.R"                        
[21] "make-temp_carbon.R"                  
[22] "make-tissue-gene-expression.R"       
[23] "make-trump_tweets.R"                 
[24] "make-weekly_us_contagious_diseases.R"
[25] "save-gapminder-example-csv.R"        

I will choose the death_prob dataset because I found interesting to see if there would be a meaningful difference between male and female.

data("death_prob")

To create my graph I used ggplot. My x axis would be age and y axis death probability. Finally, for a third variable which is sex, I would choose the set 2 color palette and a legend.

deathprob <- death_prob
ggplot(deathprob, aes(x = age, y = prob, color = sex)) + 
  ggtitle("Death probability according to age and sex") + 
  xlab("age") + 
  ylab("death probability") + 
  theme_minimal(base_size = 12) + 
  geom_point() + 
  geom_line() + 
  scale_color_brewer(palette = 'Set2')

As expected, there was a curve that increased significantly after 75 years old. Therefore, I wanted to see into more detail the graph by filtering the age to 75 years old or older.

deathprob1 <- filter(deathprob, age >= 75)
ggplot(deathprob1, aes(x = age, y = prob, color = sex)) + 
  ggtitle("Death probability according to age and sex (75 or older)") + 
  xlab("age") + 
  ylab("probability") + 
  theme_minimal(base_size = 14) + 
  geom_point() + 
  geom_line() + 
  scale_color_brewer(palette = 'Set2')

After taking a better look at the death probability rates after turning 75 years old, I can see that males are more likely to die after reaching this age. Or one could also say that females tend to live longer than males. Even though it appears to be a small difference, when talking about percentages, one percent means more than what one could think. When a person turns 95 years old I see the biggest difference with approximately five percent difference between male and female. Very interesting dataset, and I would like to do more reasearch to see the reasons behind this gap.