Introduction

This module integrates concepts from Module IV (Management), Module VI (Validation), and Module VII (Cleaning). In real-world health data science, these steps are inseparable. We will cover:

  1. Data Management: Importing/Exporting data (CSV, Excel, STATA).
  2. Data Validation: Checking integrity, consistency, and rules.
  3. Data Cleaning: Handling duplicates, outliers, string normalization, and missing values.

Part 1: Data Management (Importing & Exporting)

Before analysis, data must be loaded into R. Health data often comes in CSV, Excel, or STATA (.dta) formats.

1.1 Creating a “Dirty” Dataset for Demonstration

To practice, we will first generate a simulated “dirty” dataset containing common errors found in medical statistics (duplicates, typos, missing values, and outliers).

# Simulating a dirty dataset
raw_data <- data.frame(
  Patient_ID = c(101, 102, 103, 102, 104, 105, 106, 107), # Note duplicate 102
  Age = c(25, 340, 28, 340, 45, NA, 12, 65),              # 340 is an outlier/error
  Gender = c("Male", "female", "F", "female", "MALE", "Male", "F", "Male"), # Inconsistent casing
  HbA1c = c(5.4, 9.1, 5.8, 9.1, 15.0, 6.2, 5.5, NA),      # Missing values
  Visit_Date = c("2023-01-01", "2023-01-02", "2023/01/03", "2023-01-02", 
                 "2023-01-05", "2023-01-06", "2023-01-07", "2023-01-08")
)

# Save this to disk to simulate the import process
write.csv(raw_data, "patient_data_raw.csv", row.names = FALSE)

1.2 Importing Data

We use the tidyverse (specifically readr), readxl, and haven packages.

Importing CSV

library(readr)
df_csv <- read_csv("patient_data_raw.csv")
head(df_csv)
## # A tibble: 6 × 5
##   Patient_ID   Age Gender HbA1c Visit_Date
##        <dbl> <dbl> <chr>  <dbl> <date>    
## 1        101    25 Male     5.4 2023-01-01
## 2        102   340 female   9.1 2023-01-02
## 3        103    28 F        5.8 2023-01-03
## 4        102   340 female   9.1 2023-01-02
## 5        104    45 MALE    15   2023-01-05
## 6        105    NA Male     6.2 2023-01-06

Importing STATA Files (.dta)

In medical statistics, STATA is widely used. We use the haven package.

library(haven)
# Syntax example (file doesn't exist in this session, but this is the code)
df_stata <- read_dta("dataset.dta")

Importing Excel Files

library(readxl)
# Syntax example
df_excel <- read_excel("dataset.xlsx", sheet = 1)

1.3 Exporting Data

After cleaning, you often need to save data back to a specific format.

# Export to CSV
write_csv(df_csv, "cleaned_data.csv")

# Export to STATA (useful for sharing with colleagues using STATA)
write_dta(df_csv, "cleaned_data.dta")

Part 2: Data Validation

Data validation involves checking if the data meets the expected structure, constraints, and logic before we attempt to clean or analyze it.

2.1 Structural Inspection

First, check the structure and data types.

glimpse(df_csv)
## Rows: 8
## Columns: 5
## $ Patient_ID <dbl> 101, 102, 103, 102, 104, 105, 106, 107
## $ Age        <dbl> 25, 340, 28, 340, 45, NA, 12, 65
## $ Gender     <chr> "Male", "female", "F", "female", "MALE", "Male", "F", "Male"
## $ HbA1c      <dbl> 5.4, 9.1, 5.8, 9.1, 15.0, 6.2, 5.5, NA
## $ Visit_Date <date> 2023-01-01, 2023-01-02, 2023-01-03, 2023-01-02, 2023-01-05,…
summary(df_csv)
##    Patient_ID         Age           Gender              HbA1c       
##  Min.   :101.0   Min.   : 12.0   Length:8           Min.   : 5.400  
##  1st Qu.:102.0   1st Qu.: 26.5   Class :character   1st Qu.: 5.650  
##  Median :103.5   Median : 45.0   Mode  :character   Median : 6.200  
##  Mean   :103.8   Mean   :122.1                      Mean   : 8.014  
##  3rd Qu.:105.2   3rd Qu.:202.5                      3rd Qu.: 9.100  
##  Max.   :107.0   Max.   :340.0                      Max.   :15.000  
##                  NA's   :1                          NA's   :1       
##    Visit_Date        
##  Min.   :2023-01-01  
##  1st Qu.:2023-01-02  
##  Median :2023-01-04  
##  Mean   :2023-01-04  
##  3rd Qu.:2023-01-06  
##  Max.   :2023-01-08  
## 

Observations: * Age max is 340 (impossible). * Gender is character type with inconsistent entries. * HbA1c has NA values.

2.2 Rule-Based Validation using validate

The validate package allows us to define formal rules for our data.

library(validate)

# Define validation rules
rules <- validator(
  age_range = Age >= 0 & Age <= 120,    # Age must be realistic
  hba1c_range = HbA1c > 0 & HbA1c < 20, # Biological limits for HbA1c
  unique_id = is_unique(Patient_ID),    # IDs should be unique
  complete_gender = !is.na(Gender)      # Gender should not be missing
)

# Confront the data with the rules
check <- confront(df_csv, rules)

# Summarize results
summary(check)
##              name items passes fails nNA error warning
## 1       age_range     8      5     2   1 FALSE   FALSE
## 2     hba1c_range     8      7     0   1 FALSE   FALSE
## 3       unique_id     8      6     2   0 FALSE   FALSE
## 4 complete_gender     8      8     0   0 FALSE   FALSE
##                               expression
## 1 Age - 0 >= -1e-08 & Age - 120 <= 1e-08
## 2                 HbA1c > 0 & HbA1c < 20
## 3                  is_unique(Patient_ID)
## 4                         !is.na(Gender)

Interpretation: * age_range: We have failures (Age 340). * unique_id: We have failures (Duplicate IDs). * hba1c_range: Passes (ignoring NAs).


Part 3: Data Cleaning

Now that we have identified the issues, we apply cleaning techniques.

3.1 Handling Duplicates

Duplicate records can bias statistical results.

# Check for duplicates
sum(duplicated(df_csv))
## [1] 1
# Remove duplicates based on all columns
df_clean_step1 <- df_csv %>%
  distinct()

# Verify
nrow(df_csv) - nrow(df_clean_step1) # Number of rows removed
## [1] 1

3.2 String Normalization (Categorical Data)

In the Gender column, we have “Male”, “female”, “F”, “MALE”. These need to be standardized.

df_clean_step2 <- df_clean_step1 %>%
  mutate(
    # Convert to Title Case (e.g., "female" -> "Female")
    Gender = str_to_title(Gender),
    # Recode "F" to "Female" and "M" to "Male" if necessary
    Gender = case_when(
      Gender == "F" ~ "Female",
      Gender == "M" ~ "Male",
      TRUE ~ Gender
    ),
    # Convert to factor
    Gender = as.factor(Gender)
  )

table(df_clean_step2$Gender)
## 
## Female   Male 
##      3      4

3.3 Handling Outliers

We identified an Age of 340. We can handle this by removing it or treating it as missing.

# Visualizing outliers with a Boxplot
boxplot(df_clean_step2$Age, main = "Boxplot of Age (Before Cleaning)")

# Handling the outlier: Replace ages > 120 with NA
df_clean_step3 <- df_clean_step2 %>%
  mutate(Age = ifelse(Age > 120, NA, Age))

# Check summary again
summary(df_clean_step3$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      12      25      28      35      45      65       2

3.4 Handling Missing Values (Imputation)

We have missing values in Age and HbA1c.

Visualizing Missingness

library(naniar)
gg_miss_var(df_clean_step3)

Imputation Strategies

  1. Deletion: Remove rows with NA (only if data is Missing Completely At Random and N is large).
  2. Mean/Median Imputation: Replace NA with the average (common for continuous variables).
df_final <- df_clean_step3 %>%
  mutate(
    # Impute Age with Median (robust to outliers)
    Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age),
    
    # Impute HbA1c with Mean
    HbA1c = ifelse(is.na(HbA1c), mean(HbA1c, na.rm = TRUE), HbA1c)
  )

# Verify no NAs remain
colSums(is.na(df_final))
## Patient_ID        Age     Gender      HbA1c Visit_Date 
##          0          0          0          0          0

3.5 Date Formatting

Ensure dates are recognized as Date objects, not text.

library(lubridate)

df_final <- df_final %>%
  mutate(Visit_Date = ymd(Visit_Date)) # ymd handles 2023-01-01 and 2023/01/01

glimpse(df_final)
## Rows: 7
## Columns: 5
## $ Patient_ID <dbl> 101, 102, 103, 104, 105, 106, 107
## $ Age        <dbl> 25, 28, 28, 45, 28, 12, 65
## $ Gender     <fct> Male, Female, Female, Male, Male, Female, Male
## $ HbA1c      <dbl> 5.400000, 9.100000, 5.800000, 15.000000, 6.200000, 5.500000…
## $ Visit_Date <date> 2023-01-01, 2023-01-02, 2023-01-03, 2023-01-05, 2023-01-06,…

Summary: Before vs. After

Let’s compare our raw data to the processed data.

Raw Data:

kable(head(raw_data), caption = "Raw Dirty Data")
Raw Dirty Data
Patient_ID Age Gender HbA1c Visit_Date
101 25 Male 5.4 2023-01-01
102 340 female 9.1 2023-01-02
103 28 F 5.8 2023/01/03
102 340 female 9.1 2023-01-02
104 45 MALE 15.0 2023-01-05
105 NA Male 6.2 2023-01-06

Cleaned Data:

kable(head(df_final), caption = "Cleaned, Validated Data")
Cleaned, Validated Data
Patient_ID Age Gender HbA1c Visit_Date
101 25 Male 5.4 2023-01-01
102 28 Female 9.1 2023-01-02
103 28 Female 5.8 2023-01-03
104 45 Male 15.0 2023-01-05
105 28 Male 6.2 2023-01-06
106 12 Female 5.5 2023-01-07

Conclusion

In this module, we have successfully:

  1. Imported a messy dataset.
  2. Validated it against logic rules to find errors.
  3. Cleaned it by removing duplicates, fixing strings, handling outliers, and imputing missing values.

This dataset is now ready for the Descriptive Statistics (Module III) and Visualization (Module VIII) phases of the course. ```