Homework 3:Visualization in Practice

The NHANES dataset is survey data collected by the US National Center for Health Statistics (NCHS) which has conducted a series of health and nutrition surveys since the early 1960’s. Since 1999 approximately 5,000 individuals of all ages are interviewed in their homes every year and complete the health examination component of the survey. You can load it by installing the package

#install.packages(“NHANES”)
library(NHANES)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
data(NHANES)
#?NHANES
nhanes_df <- NHANES %>%
  select("Height", "BMI", "Age", "Gender")
head(nhanes_df)
## # A tibble: 6 × 4
##   Height   BMI   Age Gender
##    <dbl> <dbl> <int> <fct> 
## 1   165.  32.2    34 male  
## 2   165.  32.2    34 male  
## 3   165.  32.2    34 male  
## 4   105.  15.3     4 male  
## 5   168.  30.6    49 female
## 6   133.  16.8     9 male
head(nhanes_df)
## # A tibble: 6 × 4
##   Height   BMI   Age Gender
##    <dbl> <dbl> <int> <fct> 
## 1   165.  32.2    34 male  
## 2   165.  32.2    34 male  
## 3   165.  32.2    34 male  
## 4   105.  15.3     4 male  
## 5   168.  30.6    49 female
## 6   133.  16.8     9 male

Using the NHANES dataset (continued from last homework)

1.

Create a smooth density plot for male & female heights (use different fill colors so you can see both distributions and adjust alpha so you can see both distributions where they overlap)

nhanes_df %>%
  ggplot(aes(x = Height, fill=Gender)) +
  geom_density(alpha = 0.4)
## Warning: Removed 353 rows containing non-finite values (`stat_density()`).

2.

Create a histogram of female heights and overlay a standard normal curve. Are female heights representative of a standard normal curve in this dataset? Why or why not?

female_heights<- nhanes_df$Height[nhanes_df$Gender=="female"]
mean(female_heights, na.rm=TRUE)
## [1] 156.6159
sd(female_heights, na.rm=TRUE)
## [1] 16.79195
nhanes_df %>% 
  filter(Gender=="female") %>%
    ggplot(aes(Height)) +
    geom_density(fill = "red") +
    stat_function(fun = dnorm, args=list(mean= 156.6, sd=16.8)) +
    labs(x = "Female Heights (cm)")
## Warning: Removed 173 rows containing non-finite values (`stat_density()`).

This graph is not representative of a standard normal curve. The graph is Leptokurtic or “too pointy” and has a negative skew.

Using the gapminder dataset

library(dslabs)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(geomtextpath)
data(gapminder)
gapminder %>% as_tibble()
## # A tibble: 10,545 × 9
##    country          year infan…¹ life_…² ferti…³ popul…⁴      gdp conti…⁵ region
##    <fct>           <int>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl> <fct>   <fct> 
##  1 Albania          1960   115.     62.9    6.19  1.64e6 NA       Europe  South…
##  2 Algeria          1960   148.     47.5    7.65  1.11e7  1.38e10 Africa  North…
##  3 Angola           1960   208      36.0    7.32  5.27e6 NA       Africa  Middl…
##  4 Antigua and Ba…  1960    NA      63.0    4.43  5.47e4 NA       Americ… Carib…
##  5 Argentina        1960    59.9    65.4    3.11  2.06e7  1.08e11 Americ… South…
##  6 Armenia          1960    NA      66.9    4.55  1.87e6 NA       Asia    Weste…
##  7 Aruba            1960    NA      65.7    4.82  5.42e4 NA       Americ… Carib…
##  8 Australia        1960    20.3    70.9    3.45  1.03e7  9.67e10 Oceania Austr…
##  9 Austria          1960    37.3    68.8    2.7   7.07e6  5.24e10 Europe  Weste…
## 10 Azerbaijan       1960    NA      61.3    5.57  3.90e6 NA       Asia    Weste…
## # … with 10,535 more rows, and abbreviated variable names ¹​infant_mortality,
## #   ²​life_expectancy, ³​fertility, ⁴​population, ⁵​continent
unique(gapminder$region)
##  [1] Southern Europe           Northern Africa          
##  [3] Middle Africa             Caribbean                
##  [5] South America             Western Asia             
##  [7] Australia and New Zealand Western Europe           
##  [9] Southern Asia             Eastern Europe           
## [11] Central America           Western Africa           
## [13] Southern Africa           South-Eastern Asia       
## [15] Eastern Africa            Northern America         
## [17] Eastern Asia              Northern Europe          
## [19] Melanesia                 Polynesia                
## [21] Central Asia              Micronesia               
## 22 Levels: Australia and New Zealand Caribbean Central America ... Western Europe

3.

Create a plot comparing life expectancy for the United Kingdom and Chile over time (years included in the survey). Make these lines and include labels (while also excluding the legend).

countries <- c("United Kingdom", "Chile")
gapminder %>% 
  filter(country %in% countries) %>% 
  ggplot(aes(year, life_expectancy, col=country, label=country)) + 
    geom_textpath() +
    theme(legend.position = "none")

4.

Create the plot shown in 10.8.2 (represents data from year = 2010). You will have to use trans=“logit”. Don’t worry about getting country labels.If you want, you can include the arguments limits and breaks, but they are not required.

10.8.2

I start by adding the column dollars_per_day togapminder dataframe.

gapminder <- gapminder %>% 
  mutate(dollars_per_day = gdp/population/365)
head(gapminder)
##               country year infant_mortality life_expectancy fertility
## 1             Albania 1960           115.40           62.87      6.19
## 2             Algeria 1960           148.20           47.50      7.65
## 3              Angola 1960           208.00           35.98      7.32
## 4 Antigua and Barbuda 1960               NA           62.97      4.43
## 5           Argentina 1960            59.87           65.39      3.11
## 6             Armenia 1960               NA           66.86      4.55
##   population          gdp continent          region dollars_per_day
## 1    1636054           NA    Europe Southern Europe              NA
## 2   11124892  13828152297    Africa Northern Africa        3.405458
## 3    5270844           NA    Africa   Middle Africa              NA
## 4      54681           NA  Americas       Caribbean              NA
## 5   20619075 108322326649  Americas   South America       14.393153
## 6    1867396           NA      Asia    Western Asia              NA
gapminder <- gapminder |> 
  mutate(group = case_when(
    region %in% c("Western Europe", "Northern Europe","Southern Europe", "Northern America", "Australia and New Zealand") ~ "West",
    region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
    region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
    continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan",
    region %in% c("Southern Asia")~ "Southern Asia",
    region %in% c("Northern Africa") ~ "Northern Africa",
    region %in% c("Polynesia", "Melanesia", "Micronesia")~ "Pacific Islands",
    TRUE ~ "Others"))
unique(gapminder$group)
## [1] "West"            "Northern Africa" "Sub-Saharan"     "Latin America"  
## [5] "Others"          "Southern Asia"   "East Asia"       "Pacific Islands"
gapminder <- gapminder |> 
  mutate(group = factor(group, levels = c("East Asia", "Latin America", "Northern Africa", "Pacific Islands", "Southern Asia",
                                          "Sub-Saharan","West", "Others")))
ten.eight.two <- gapminder |>
  filter (year==2010, group!="Others") |>
  group_by(group) |>
  ggplot(aes(dollars_per_day, 1-infant_mortality/1000, col = group)) +
  geom_point(alpha = 0.6, size = 4) +
  scale_y_continuous(trans="logit", breaks=c(0.850, 0.900, 0.950, 0.990, 0.995, 0.998)) +
  scale_x_continuous(trans = "log2", limit = c(.25, 200)) 
ten.eight.two
## Warning: Removed 11 rows containing missing values (`geom_point()`).