IT408 / IT408 SC:
Data Mining

Unit 3: Data cleansing in R

R Batzinger

2026-06-08

Course Textbook:

Hadley Wickham,Mine Cetinkaya-Rundel and Garrett Grolemund, R for data science: import, tidy, transform, visualize, and model data. 2nd Edition, O’Reilly Press

Schedule

  • June
Su Mn Tu We Th Fr Sa
1 2 3 4 5 6
7 [8] 9 10 [11] 12 13
14 [15] 16 17 [18] 19 20
21 [22] 23 24 [25] 26 27
28 [29] 30
  • July
Su Mn Tu We Th Fr Sa
(1)* [[2]] 3 4
5 [6] 7 (8)* [9] 10 11
12 [13] 14 (15)* [16]L 17 18
21 [20] 21 22 [23] [[24]] 25
26 27 28 29 30 31
* IT408 Special Studies; L - Lab test

Tidyverse

A framework to support Data Science work flow.

Penguins

  • Adélie penguins: (Pygoscelis adeliae) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5.

  • Gentoo penguins: (Pygoscelis papua) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 5.

  • Chinstrap penguins: (Pygoscelis antarcticus) nesting along the Palmer Archipelago near Palmer Station, 2007-2009 ver 6.

Penguin Database

str(penguins)
'data.frame':   344 obs. of  8 variables:
 $ species    : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island     : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_len   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_dep   : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_len: int  181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass  : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex        : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year       : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
print("--------------------")
[1] "--------------------"
summary(penguins)
      species          island       bill_len        bill_dep    
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NAs    :2       NAs    :2      
  flipper_len      body_mass        sex           year     
 Min.   :172.0   Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0   1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0   Median :4050   NAs   : 11   Median :2008  
 Mean   :200.9   Mean   :4202                Mean   :2008  
 3rd Qu.:213.0   3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0   Max.   :6300                Max.   :2009  
 NAs    :2       NAs    :2                                 

Tinyverse

<theme> List of 2
 $ plot.title: <ggplot2::element_text>
  ..@ family       : NULL
  ..@ face         : chr "bold"
  ..@ italic       : chr NA
  ..@ fontweight   : num NA
  ..@ fontwidth    : num NA
  ..@ colour       : NULL
  ..@ size         : num 14
  ..@ hjust        : NULL
  ..@ vjust        : NULL
  ..@ angle        : NULL
  ..@ lineheight   : NULL
  ..@ margin       : NULL
  ..@ debug        : NULL
  ..@ inherit.blank: logi FALSE
 $ strip.text: <ggplot2::element_text>
  ..@ family       : NULL
  ..@ face         : chr "bold"
  ..@ italic       : chr NA
  ..@ fontweight   : num NA
  ..@ fontwidth    : num NA
  ..@ colour       : NULL
  ..@ size         : num 11
  ..@ hjust        : NULL
  ..@ vjust        : NULL
  ..@ angle        : NULL
  ..@ lineheight   : NULL
  ..@ margin       : NULL
  ..@ debug        : NULL
  ..@ inherit.blank: logi FALSE
 @ complete: logi FALSE
 @ validate: logi TRUE

Data Anomalies

  • Omissions: missing data
  • Noise: random noise introduced into the signal
  • Interference: some change in the baseline
  • Glitch: a secondary pattern added to the signal
  • Wrong recording: a mixup in the sensor input

Handling Missing Values

  • Imputation: Filling in missing values using statistical metrics like the mean, median, or mode (for numerical data) or a placeholder like “Unknown” (for categorical data).

  • Deletion: Dropping rows or columns with missing data entirely, though this is usually done as a last resort if a significant chunk of data is missing.

  • Prediction: Using machine learning models to predict and fill in the missing values based on other available data.

Removing Duplicates

Duplicate entries skew analysis by artificially inflating the weight of certain data points.

  • Exact Duplicates: Identifying and removing rows that are 100% identical.

  • Partial Duplicates: Catching entries that represent the same entity but have slight variations (e.g., the same customer signing up twice with different emails, or two entries for John Smith at the same address).

Structural and Format Standardization

  • Data Type Conversion: Ensuring numbers are stored as floats/integers, dates as datetime objects, and text as strings. (e.g., converting “100” from text to the number 100).

  • String Uniformity: Standardizing text casing (e.g., converting all entries to lowercase), removing accidental leading or trailing whitespaces, and unifying date formats (e.g., converting everything to YYYY-MM-DD).

  • Categorical Unification: Mapping synonymous labels to a single standard value (e.g., collapsing N/A, Not Applicable, and n/a into a single NA label).

Detecting and Correcting Outliers

  • Investigation: Determining if the outlier is a legitimate, extreme real-world anomaly or simply a data entry error (e.g., a person’s age entered as 320 instead of 32).

  • Cap/Floor Adjustment: Resetting extreme values to a designated maximum or minimum threshold if they skew statistical models too aggressively.

Validating and Verifying Constraints

This operation checks that the data makes logical sense according to real-world rules.

  • Cross-field Validation: Making sure fields don’t contradict each other (e.g., a Delivery Date cannot occur before a Purchase Date).

  • Range Checks: Ensuring values fall within expected bounds (e.g., a monthly satisfaction score must be between 1 and 10).

Dropping Irrelevant Data

Not all collected data is useful for the specific problem you are trying to solve.

  • Feature Selection: Strip away columns that don’t add value to your specific analysis (e.g., dropping IP addresses if you are analyzing overall sales trends) to reduce processing time and noise.

Why It Matters

  • Skipping these steps leads to the classic “Garbage In, Garbage Out” dilemma.

  • Dirty data will produce flawed products even from the most sophisticated machine learning algorithms or statistical models.

A dirty data set

# Load the required library
library(tidyverse)

dirty_data <- tibble(
  Customer_ID = c(101, 102, 103, 104, 101, 105, 106),
  Name        = c("john smith", "Jane Doe ", "Alice ", "Bob Costas", "john smith", "Charlie", "Error_User"),
  Age         = c(34, NA, 28, 250, 34, 42, -5),  # Contains NA, an extreme outlier (250), and a negative age
  Join_Date   = c("2023-01-15", "15/01/2023", "01-02-2023", "2023-03-10", "2023-01-15", NA, "2023-05-12"),
  Status      = c("Active", "active", "PENDING", "Active", "Active", "N/A", "Active"),
  IP_Address  = c("192.168.1.1", "192.168.1.2", "192.168.1.3", "192.168.1.4", "192.168.1.1", "192.168.1.5", "192.168.1.6")
)

print(dirty_data)
# A tibble: 7 × 6
  Customer_ID Name           Age Join_Date  Status  IP_Address 
        <dbl> <chr>        <dbl> <chr>      <chr>   <chr>      
1         101 "john smith"    34 2023-01-15 Active  192.168.1.1
2         102 "Jane Doe "     NA 15/01/2023 active  192.168.1.2
3         103 "Alice "        28 01-02-2023 PENDING 192.168.1.3
4         104 "Bob Costas"   250 2023-03-10 Active  192.168.1.4
5         101 "john smith"    34 2023-01-15 Active  192.168.1.1
6         105 "Charlie"       42 <NA>       N/A     192.168.1.5
7         106 "Error_User"    -5 2023-05-12 Active  192.168.1.6

Processing Pipe Characters

Cleansing processes can be chained by adding pipe characters between function calls

  • |>
  • %>%

The chain will be executed sequentually from left to right.

AN DATA CLEANSING PIPELINE

library(lubridate)

clean_data <- dirty_data %>%
  
  distinct() |> # Remove Exact Duplicates
  
  filter(Name != "Error_User") |> # Remove false entry
  
  select(-IP_Address) |> # Drop Irrelevant Data 
  
  mutate(
 
    Name = str_to_title(str_trim(Name)), # trim off white space
    Status = str_to_title(Status), # 
    
    Status = if_else(Status == "N/A", NA_character_, Status) # ap 'N/A' to true R NA
    ) |> 
  
  mutate(
    Age = if_else(Age < 0 | Age > 110, NA_real_, Age) # Fix bad Age values
  ) |>
  
  mutate(
    Join_Date = parse_date_time(Join_Date,
      orders = c("Y-m-d", "d-m-Y", "d/m/Y","m-d-Y")) %>% as.Date()
    ) |>
  
  mutate(
    Age = if_else(is.na(Age), median(Age, na.rm = TRUE), Age)
  ) |>
  
  filter(!is.na(Status)) 

print(clean_data)
# A tibble: 4 × 5
  Customer_ID Name         Age Join_Date  Status 
        <dbl> <chr>      <dbl> <date>     <chr>  
1         101 John Smith    34 2023-01-15 Active 
2         102 Jane Doe      34 2023-01-15 Active 
3         103 Alice         28 2023-02-01 Pending
4         104 Bob Costas    34 2023-03-10 Active 

Another bad dataset

# A tibble: 5 × 15
     id category date      jan     feb   mar   apr   may   jun   jul   aug   sep
  <dbl> <chr>    <chr>     <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1 " A "    2024/01/… "10"     NA    12    11     9    10    11    10    12
2     1 "A"      2024-01-… "10"     NA    12    11     9    10    11    10    12
3     2 "b"      15-02-20… " 5"      6     7    NA     6     7     8     7     6
4     3 "B "     2024 Mar… "3"       4  -999     4     4     5     6     5     4
5     4 " c"     2024.04.… "8"       9    10     9     8     9    10     9     8
# ℹ 3 more variables: oct <chr>, nov <dbl>, dec <dbl>

Cleanup pipe

Goals:

  • Trim whitespace and standardize category values (e.g., ” A ” / “A” / “A” -> “A”, “b” / “B” -> “B”, ” c” -> “C”)
  • Fix messy date into a proper Date
  • Remove duplicates
  • Pivot the 12 month columns into long format (one row per month)
  • Ensure month values are numeric
  • Handle missing values (example: replace with NA-kept; or impute—here we just keep them)
  • Detect/remove/mark outliers (example: set to NA if outside reasonable bounds)

Tidycode

# A tibble: 48 × 7
      id category date       month value month_num month_date
   <dbl> <chr>    <date>     <chr> <dbl>     <int> <date>    
 1     1 A        2024-01-15 jan      10         1 2024-01-01
 2     1 A        2024-01-15 feb      NA         2 2024-02-01
 3     1 A        2024-01-15 mar      12         3 2024-03-01
 4     1 A        2024-01-15 apr      11         4 2024-04-01
 5     1 A        2024-01-15 may       9         5 2024-05-01
 6     1 A        2024-01-15 jun      10         6 2024-06-01
 7     1 A        2024-01-15 jul      11         7 2024-07-01
 8     1 A        2024-01-15 aug      10         8 2024-08-01
 9     1 A        2024-01-15 sep      12         9 2024-09-01
10     1 A        2024-01-15 oct      NA        10 2024-10-01
# ℹ 38 more rows

An exponential pulse

\[y(x, t) = A e^{-\frac{(x - vt)^2}{2\sigma^2}}\]

An exponential pulse with missing data points

Compare estimate vs actual points

day estimate actual
15 0.023113 0.021875
16 0.032527 0.030887
29 1.121285 1.102505
33 2.379680 2.357461
35 3.266742 3.246525
41 6.663398 6.669768
63 4.310317 4.295574
65 3.266742 3.246525
66 2.801993 2.780373
75 0.450911 0.439369
82 0.062533 0.059760
92 0.001602 0.001477