Cleaning

Harold Nelson

2026-06-16

Intro

Most of the data you get will have some bad observations. It is always a good idea to examine your data and consider dropping some. I will illustrate with the cdc dataset.

Task

Load the dataset and make the tidyverse available.

Solution

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.3     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
load("cdc.Rdata")

Task

Run summary() on the cdc dataframe. Look at the minimum and maximum values for each of the quantitative variables.

Solution

summary(cdc)
##       genhlth        exerany          hlthplan         smoke100    
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  very good:6972   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.000  
##  good     :5675   Median :1.0000   Median :1.0000   Median :0.000  
##  fair     :2019   Mean   :0.7457   Mean   :0.8738   Mean   :0.472  
##  poor     : 677   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.000  
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.000  
##      height          weight         wtdesire          age        gender   
##  Min.   :48.00   Min.   : 68.0   Min.   : 68.0   Min.   :18.00   m: 9569  
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130.0   1st Qu.:31.00   f:10431  
##  Median :67.00   Median :165.0   Median :150.0   Median :43.00            
##  Mean   :67.18   Mean   :169.7   Mean   :155.1   Mean   :45.07            
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175.0   3rd Qu.:57.00            
##  Max.   :93.00   Max.   :500.0   Max.   :680.0   Max.   :99.00

I see some problems.

  1. Why would anybody want to weigh 680 pounds?

  2. Is a height of 93 inches realistic.

Task

Before cleaning, create a new variable BMI, the body mass index. This corrects for the influence of height on weight in looking for obesity. It will allow us to spot some strange people.

The body mass index (BMI) is a measure which incorprates both height and weight.

The standard interpetation of this measure is as follows:

New Variables.

cdc$BMI = (cdc$weight*703)/(cdc$height)^2
cdc$BMIDes = (cdc$wtdesire*703)/(cdc$height)^2
cdc$DesActRatio = cdc$BMIDes/cdc$BMI
cdc$BMICat = cut(cdc$BMI,c(18,5,24.9,29.9,39.9,200),labels = 
       c("Underweight","Normal","Overweight",
       "Obese","Morbidly Obese"),include.lowest=T)
cdc$BMIDesCat = cut(cdc$BMIDes,c(18,5,24.9,29.9,39.9,200),labels = 
       c("Underweight","Normal","Overweight",
       "Obese","Morbidly Obese"),include.lowest=T)
cdc$ageCat = cut_number(cdc$age,n=4,labels=c("18-31","32-43","44-57","58-99"))


table(cdc$BMICat,cdc$BMIDesCat)
##                 
##                  Underweight Normal Overweight Obese Morbidly Obese
##   Underweight            124    139          8     0              0
##   Normal                  79   8065        304    13              0
##   Overweight               8   3392       3850    45              1
##   Obese                    9    826       2098   602              6
##   Morbidly Obese           6     81        191   138             15

Run this code. I will hand it to you in the zoom chat.

Task

I would classify anyone who wants to be Morbidly Obese as very odd.

I want to drop from the dataframe:

  1. Anybody with a height greater than 84 inches.

  2. Anyone who wants to be morbidly obese unless there is a desire to lose weight.

In the dataframe create a boolean variable drop to capture these conditions.

Use this to create a dataframe to_drop.

Solution

cdc = cdc %>% 
  mutate(drop = height > 84 | (BMIDes >= 40 & wtdesire > weight))

to_drop = cdc %>% 
  filter(drop)

to_drop
##         genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 8600  very good       1        0        1     72    240      300  23      m
## 9111  very good       1        1        0     69    285      325  29      m
## 10034 very good       1        1        1     73    290      601  56      m
## 10467      good       1        1        0     66    200      250  37      m
## 13715 excellent       0        1        1     69    220      300  30      m
## 15465 very good       0        1        1     48    115      135  69      f
## 16153 very good       1        1        0     68    250      270  31      m
## 16874      good       0        1        0     69    180      680  24      m
## 17534 very good       1        0        0     93    179      100  31      m
##            BMI     BMIDes DesActRatio         BMICat      BMIDesCat ageCat drop
## 8600  32.54630  40.682870   1.2500000          Obese Morbidly Obese  18-31 TRUE
## 9111  42.08255  47.988868   1.1403509 Morbidly Obese Morbidly Obese  18-31 TRUE
## 10034 38.25671  79.283731   2.0724138          Obese Morbidly Obese  44-57 TRUE
## 10467 32.27732  40.346648   1.2500000          Obese Morbidly Obese  32-43 TRUE
## 13715 32.48477  44.297417   1.3636364          Obese Morbidly Obese  18-31 TRUE
## 15465 35.08898  41.191406   1.1739130          Obese Morbidly Obese  58-99 TRUE
## 16153 38.00822  41.048875   1.0800000          Obese Morbidly Obese  18-31 TRUE
## 16874 26.57845 100.407477   3.7777778     Overweight Morbidly Obese  18-31 TRUE
## 17534 14.54931   8.128107   0.5586592    Underweight    Underweight  18-31 TRUE

Task

Use the boolean drop to filter the cdc dataframe and run summary again.

Solution

cdc = cdc %>% 
  filter(!drop)

summary(cdc)
##       genhlth        exerany          hlthplan         smoke100     
##  excellent:4656   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  very good:6966   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  good     :5673   Median :1.0000   Median :1.0000   Median :0.0000  
##  fair     :2019   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721  
##  poor     : 677   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      height          weight         wtdesire        age        gender   
##  Min.   :48.00   Min.   : 68.0   Min.   : 68   Min.   :18.00   m: 9561  
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130   1st Qu.:31.00   f:10430  
##  Median :67.00   Median :165.0   Median :150   Median :43.00            
##  Mean   :67.18   Mean   :169.7   Mean   :155   Mean   :45.07            
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175   3rd Qu.:57.00            
##  Max.   :84.00   Max.   :500.0   Max.   :350   Max.   :99.00            
##       BMI            BMIDes       DesActRatio                BMICat    
##  Min.   :12.40   Min.   :10.44   Min.   :0.2667   Underweight   : 270  
##  1st Qu.:22.71   1st Qu.:21.73   1st Qu.:0.8710   Normal        :8461  
##  Median :25.60   Median :23.73   Median :0.9444   Overweight    :7295  
##  Mean   :26.30   Mean   :23.96   Mean   :0.9265   Obese         :3535  
##  3rd Qu.:28.89   3rd Qu.:25.80   3rd Qu.:1.0000   Morbidly Obese: 430  
##  Max.   :73.09   Max.   :54.86   Max.   :1.9681                        
##           BMIDesCat       ageCat        drop        
##  Underweight   :  225   18-31:5081   Mode :logical  
##  Normal        :12503   32-43:5262   FALSE:19991    
##  Overweight    : 6451   44-57:4786                  
##  Obese         :  798   58-99:4862                  
##  Morbidly Obese:   14                               
##