The NHANES dataset is survey data collected by the US National Center for Health Statistics (NCHS) which has conducted a series of health and nutrition surveys since the early 1960’s. Since 1999 approximately 5,000 individuals of all ages are interviewed in their homes every year and complete the health examination component of the survey. You can load it by installing the package
#install.packages(“NHANES”)
library(NHANES)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
data(NHANES)
#?NHANES
nhanes_df <- NHANES %>%
select("Height", "BMI", "Age", "Gender")
head(nhanes_df)
## # A tibble: 6 × 4
## Height BMI Age Gender
## <dbl> <dbl> <int> <fct>
## 1 165. 32.2 34 male
## 2 165. 32.2 34 male
## 3 165. 32.2 34 male
## 4 105. 15.3 4 male
## 5 168. 30.6 49 female
## 6 133. 16.8 9 male
head(nhanes_df)
## # A tibble: 6 × 4
## Height BMI Age Gender
## <dbl> <dbl> <int> <fct>
## 1 165. 32.2 34 male
## 2 165. 32.2 34 male
## 3 165. 32.2 34 male
## 4 105. 15.3 4 male
## 5 168. 30.6 49 female
## 6 133. 16.8 9 male
Create a smooth density plot for male & female heights (use different fill colors so you can see both distributions and adjust alpha so you can see both distributions where they overlap)
nhanes_df %>%
ggplot(aes(x = Height, fill=Gender)) +
geom_density(alpha = 0.4)
## Warning: Removed 353 rows containing non-finite values (`stat_density()`).
Create a histogram of female heights and overlay a standard normal curve. Are female heights representative of a standard normal curve in this dataset? Why or why not?
female_heights<- nhanes_df$Height[nhanes_df$Gender=="female"]
mean(female_heights, na.rm=TRUE)
## [1] 156.6159
sd(female_heights, na.rm=TRUE)
## [1] 16.79195
nhanes_df %>%
filter(Gender=="female") %>%
ggplot(aes(Height)) +
geom_density(fill = "red") +
stat_function(fun = dnorm, args=list(mean= 156.6, sd=16.8)) +
labs(x = "Female Heights (cm)")
## Warning: Removed 173 rows containing non-finite values (`stat_density()`).
This graph is not representative of a standard normal curve. The graph is Leptokurtic or “too pointy” and has a negative skew.
library(dslabs)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(geomtextpath)
data(gapminder)
gapminder %>% as_tibble()
## # A tibble: 10,545 × 9
## country year infan…¹ life_…² ferti…³ popul…⁴ gdp conti…⁵ region
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 Albania 1960 115. 62.9 6.19 1.64e6 NA Europe South…
## 2 Algeria 1960 148. 47.5 7.65 1.11e7 1.38e10 Africa North…
## 3 Angola 1960 208 36.0 7.32 5.27e6 NA Africa Middl…
## 4 Antigua and Ba… 1960 NA 63.0 4.43 5.47e4 NA Americ… Carib…
## 5 Argentina 1960 59.9 65.4 3.11 2.06e7 1.08e11 Americ… South…
## 6 Armenia 1960 NA 66.9 4.55 1.87e6 NA Asia Weste…
## 7 Aruba 1960 NA 65.7 4.82 5.42e4 NA Americ… Carib…
## 8 Australia 1960 20.3 70.9 3.45 1.03e7 9.67e10 Oceania Austr…
## 9 Austria 1960 37.3 68.8 2.7 7.07e6 5.24e10 Europe Weste…
## 10 Azerbaijan 1960 NA 61.3 5.57 3.90e6 NA Asia Weste…
## # … with 10,535 more rows, and abbreviated variable names ¹infant_mortality,
## # ²life_expectancy, ³fertility, ⁴population, ⁵continent
unique(gapminder$region)
## [1] Southern Europe Northern Africa
## [3] Middle Africa Caribbean
## [5] South America Western Asia
## [7] Australia and New Zealand Western Europe
## [9] Southern Asia Eastern Europe
## [11] Central America Western Africa
## [13] Southern Africa South-Eastern Asia
## [15] Eastern Africa Northern America
## [17] Eastern Asia Northern Europe
## [19] Melanesia Polynesia
## [21] Central Asia Micronesia
## 22 Levels: Australia and New Zealand Caribbean Central America ... Western Europe
Create a plot comparing life expectancy for the United Kingdom and Chile over time (years included in the survey). Make these lines and include labels (while also excluding the legend).
countries <- c("United Kingdom", "Chile")
gapminder %>%
filter(country %in% countries) %>%
ggplot(aes(year, life_expectancy, col=country, label=country)) +
geom_textpath() +
theme(legend.position = "none")
Create the plot shown in 10.8.2 (represents data from
year = 2010). You will have to use
trans=“logit”. Don’t worry about getting country labels.If
you want, you can include the arguments limits and
breaks, but they are not required.
10.8.2
I start by adding the column dollars_per_day
togapminder dataframe.
gapminder <- gapminder %>%
mutate(dollars_per_day = gdp/population/365)
head(gapminder)
## country year infant_mortality life_expectancy fertility
## 1 Albania 1960 115.40 62.87 6.19
## 2 Algeria 1960 148.20 47.50 7.65
## 3 Angola 1960 208.00 35.98 7.32
## 4 Antigua and Barbuda 1960 NA 62.97 4.43
## 5 Argentina 1960 59.87 65.39 3.11
## 6 Armenia 1960 NA 66.86 4.55
## population gdp continent region dollars_per_day
## 1 1636054 NA Europe Southern Europe NA
## 2 11124892 13828152297 Africa Northern Africa 3.405458
## 3 5270844 NA Africa Middle Africa NA
## 4 54681 NA Americas Caribbean NA
## 5 20619075 108322326649 Americas South America 14.393153
## 6 1867396 NA Asia Western Asia NA
gapminder <- gapminder |>
mutate(group = case_when(
region %in% c("Western Europe", "Northern Europe","Southern Europe", "Northern America", "Australia and New Zealand") ~ "West",
region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan",
region %in% c("Southern Asia")~ "Southern Asia",
region %in% c("Northern Africa") ~ "Northern Africa",
region %in% c("Polynesia", "Melanesia", "Micronesia")~ "Pacific Islands",
TRUE ~ "Others"))
unique(gapminder$group)
## [1] "West" "Northern Africa" "Sub-Saharan" "Latin America"
## [5] "Others" "Southern Asia" "East Asia" "Pacific Islands"
gapminder <- gapminder |>
mutate(group = factor(group, levels = c("East Asia", "Latin America", "Northern Africa", "Pacific Islands", "Southern Asia",
"Sub-Saharan","West", "Others")))
ten.eight.two <- gapminder |>
filter (year==2010, group!="Others") |>
group_by(group) |>
ggplot(aes(dollars_per_day, 1-infant_mortality/1000, col = group)) +
geom_point(alpha = 0.6, size = 4) +
scale_y_continuous(trans="logit", breaks=c(0.850, 0.900, 0.950, 0.990, 0.995, 0.998)) +
scale_x_continuous(trans = "log2", limit = c(.25, 200))
ten.eight.two
## Warning: Removed 11 rows containing missing values (`geom_point()`).