Harold Nelson
2026-06-16
Most of the data you get will have some bad observations. It is always a good idea to examine your data and consider dropping some. I will illustrate with the cdc dataset.
Load the dataset and make the tidyverse available.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.3 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Run summary() on the cdc dataframe. Look at the minimum and maximum values for each of the quantitative variables.
## genhlth exerany hlthplan smoke100
## excellent:4657 Min. :0.0000 Min. :0.0000 Min. :0.000
## very good:6972 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.000
## good :5675 Median :1.0000 Median :1.0000 Median :0.000
## fair :2019 Mean :0.7457 Mean :0.8738 Mean :0.472
## poor : 677 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :1.0000 Max. :1.0000 Max. :1.000
## height weight wtdesire age gender
## Min. :48.00 Min. : 68.0 Min. : 68.0 Min. :18.00 m: 9569
## 1st Qu.:64.00 1st Qu.:140.0 1st Qu.:130.0 1st Qu.:31.00 f:10431
## Median :67.00 Median :165.0 Median :150.0 Median :43.00
## Mean :67.18 Mean :169.7 Mean :155.1 Mean :45.07
## 3rd Qu.:70.00 3rd Qu.:190.0 3rd Qu.:175.0 3rd Qu.:57.00
## Max. :93.00 Max. :500.0 Max. :680.0 Max. :99.00
I see some problems.
Why would anybody want to weigh 680 pounds?
Is a height of 93 inches realistic.
Before cleaning, create a new variable BMI, the body mass index. This corrects for the influence of height on weight in looking for obesity. It will allow us to spot some strange people.
The body mass index (BMI) is a measure which incorprates both height and weight.
The standard interpetation of this measure is as follows:
cdc$BMI = (cdc$weight*703)/(cdc$height)^2
cdc$BMIDes = (cdc$wtdesire*703)/(cdc$height)^2
cdc$DesActRatio = cdc$BMIDes/cdc$BMI
cdc$BMICat = cut(cdc$BMI,c(18,5,24.9,29.9,39.9,200),labels =
c("Underweight","Normal","Overweight",
"Obese","Morbidly Obese"),include.lowest=T)
cdc$BMIDesCat = cut(cdc$BMIDes,c(18,5,24.9,29.9,39.9,200),labels =
c("Underweight","Normal","Overweight",
"Obese","Morbidly Obese"),include.lowest=T)
cdc$ageCat = cut_number(cdc$age,n=4,labels=c("18-31","32-43","44-57","58-99"))
table(cdc$BMICat,cdc$BMIDesCat)##
## Underweight Normal Overweight Obese Morbidly Obese
## Underweight 124 139 8 0 0
## Normal 79 8065 304 13 0
## Overweight 8 3392 3850 45 1
## Obese 9 826 2098 602 6
## Morbidly Obese 6 81 191 138 15
Run this code. I will hand it to you in the zoom chat.
I would classify anyone who wants to be Morbidly Obese as very odd.
I want to drop from the dataframe:
Anybody with a height greater than 84 inches.
Anyone who wants to be morbidly obese unless there is a desire to lose weight.
In the dataframe create a boolean variable drop to capture these conditions.
Use this to create a dataframe to_drop.
cdc = cdc %>%
mutate(drop = height > 84 | (BMIDes >= 40 & wtdesire > weight))
to_drop = cdc %>%
filter(drop)
to_drop## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 8600 very good 1 0 1 72 240 300 23 m
## 9111 very good 1 1 0 69 285 325 29 m
## 10034 very good 1 1 1 73 290 601 56 m
## 10467 good 1 1 0 66 200 250 37 m
## 13715 excellent 0 1 1 69 220 300 30 m
## 15465 very good 0 1 1 48 115 135 69 f
## 16153 very good 1 1 0 68 250 270 31 m
## 16874 good 0 1 0 69 180 680 24 m
## 17534 very good 1 0 0 93 179 100 31 m
## BMI BMIDes DesActRatio BMICat BMIDesCat ageCat drop
## 8600 32.54630 40.682870 1.2500000 Obese Morbidly Obese 18-31 TRUE
## 9111 42.08255 47.988868 1.1403509 Morbidly Obese Morbidly Obese 18-31 TRUE
## 10034 38.25671 79.283731 2.0724138 Obese Morbidly Obese 44-57 TRUE
## 10467 32.27732 40.346648 1.2500000 Obese Morbidly Obese 32-43 TRUE
## 13715 32.48477 44.297417 1.3636364 Obese Morbidly Obese 18-31 TRUE
## 15465 35.08898 41.191406 1.1739130 Obese Morbidly Obese 58-99 TRUE
## 16153 38.00822 41.048875 1.0800000 Obese Morbidly Obese 18-31 TRUE
## 16874 26.57845 100.407477 3.7777778 Overweight Morbidly Obese 18-31 TRUE
## 17534 14.54931 8.128107 0.5586592 Underweight Underweight 18-31 TRUE
Use the boolean drop to filter the cdc dataframe and run summary again.
## genhlth exerany hlthplan smoke100
## excellent:4656 Min. :0.0000 Min. :0.0000 Min. :0.0000
## very good:6966 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## good :5673 Median :1.0000 Median :1.0000 Median :0.0000
## fair :2019 Mean :0.7457 Mean :0.8738 Mean :0.4721
## poor : 677 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## height weight wtdesire age gender
## Min. :48.00 Min. : 68.0 Min. : 68 Min. :18.00 m: 9561
## 1st Qu.:64.00 1st Qu.:140.0 1st Qu.:130 1st Qu.:31.00 f:10430
## Median :67.00 Median :165.0 Median :150 Median :43.00
## Mean :67.18 Mean :169.7 Mean :155 Mean :45.07
## 3rd Qu.:70.00 3rd Qu.:190.0 3rd Qu.:175 3rd Qu.:57.00
## Max. :84.00 Max. :500.0 Max. :350 Max. :99.00
## BMI BMIDes DesActRatio BMICat
## Min. :12.40 Min. :10.44 Min. :0.2667 Underweight : 270
## 1st Qu.:22.71 1st Qu.:21.73 1st Qu.:0.8710 Normal :8461
## Median :25.60 Median :23.73 Median :0.9444 Overweight :7295
## Mean :26.30 Mean :23.96 Mean :0.9265 Obese :3535
## 3rd Qu.:28.89 3rd Qu.:25.80 3rd Qu.:1.0000 Morbidly Obese: 430
## Max. :73.09 Max. :54.86 Max. :1.9681
## BMIDesCat ageCat drop
## Underweight : 225 18-31:5081 Mode :logical
## Normal :12503 32-43:5262 FALSE:19991
## Overweight : 6451 44-57:4786
## Obese : 798 58-99:4862
## Morbidly Obese: 14
##