IT408 / IT408 SC:
Data Mining

Unit 3: Data cleansing in R

R Batzinger

2026-06-08

Course Textbook:

Hadley Wickham,Mine Cetinkaya-Rundel and Garrett Grolemund, R for data science: import, tidy, transform, visualize, and model data. 2nd Edition, O’Reilly Press

Schedule

  • June
Su Mn Tu We Th Fr Sa
1 2 3 4 5 6
7 [8] 9 10 [11] 12 13
14 [15] 16 17 [18] 19 20
21 [22] 23 24 [25] 26 27
28 [29] 30
  • July
Su Mn Tu We Th Fr Sa
(1)* [[2]] 3 4
5 [6] 7 (8)* [9] 10 11
12 [13] 14 (15)* [16]L 17 18
21 [20] 21 22 [23] [[24]] 25
26 27 28 29 30 31
* IT408 Special Studies; L - Lab test

Data Anomalies

  • Omissions: missing data
  • Noise: random noise introduced into the signal
  • Interference: some change in the baseline
  • Glitch: a secondary pattern added to the signal
  • Wrong recording: a mixup in the sensor input

Tidyverse

A framework to support Data Science work flow.

Penguins

  • Adélie penguins: (Pygoscelis adeliae) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5.

  • Gentoo penguins: (Pygoscelis papua) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5.

  • Chinstrap penguins: (Pygoscelis antarcticus) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 6.

Penguin Database

str(penguins)
'data.frame':   344 obs. of  8 variables:
 $ species    : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island     : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_len   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_dep   : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_len: int  181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass  : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex        : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year       : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
print("--------------------")
[1] "--------------------"
summary(penguins)
      species          island       bill_len        bill_dep    
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NAs    :2       NAs    :2      
  flipper_len      body_mass        sex           year     
 Min.   :172.0   Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0   1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0   Median :4050   NAs   : 11   Median :2008  
 Mean   :200.9   Mean   :4202                Mean   :2008  
 3rd Qu.:213.0   3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0   Max.   :6300                Max.   :2009  
 NAs    :2       NAs    :2                                 

Tinyverse