This module integrates concepts from Module IV (Management), Module VI (Validation), and Module VII (Cleaning). In real-world health data science, these steps are inseparable. We will cover:
Before analysis, data must be loaded into R. Health data often comes
in CSV, Excel, or STATA (.dta) formats.
To practice, we will first generate a simulated “dirty” dataset containing common errors found in medical statistics (duplicates, typos, missing values, and outliers).
# Simulating a dirty dataset
raw_data <- data.frame(
Patient_ID = c(101, 102, 103, 102, 104, 105, 106, 107), # Note duplicate 102
Age = c(25, 340, 28, 340, 45, NA, 12, 65), # 340 is an outlier/error
Gender = c("Male", "female", "F", "female", "MALE", "Male", "F", "Male"), # Inconsistent casing
HbA1c = c(5.4, 9.1, 5.8, 9.1, 15.0, 6.2, 5.5, NA), # Missing values
Visit_Date = c("2023-01-01", "2023-01-02", "2023/01/03", "2023-01-02",
"2023-01-05", "2023-01-06", "2023-01-07", "2023-01-08")
)
# Save this to disk to simulate the import process
write.csv(raw_data, "patient_data_raw.csv", row.names = FALSE)We use the tidyverse (specifically readr),
readxl, and haven packages.
## # A tibble: 6 × 5
## Patient_ID Age Gender HbA1c Visit_Date
## <dbl> <dbl> <chr> <dbl> <date>
## 1 101 25 Male 5.4 2023-01-01
## 2 102 340 female 9.1 2023-01-02
## 3 103 28 F 5.8 2023-01-03
## 4 102 340 female 9.1 2023-01-02
## 5 104 45 MALE 15 2023-01-05
## 6 105 NA Male 6.2 2023-01-06
In medical statistics, STATA is widely used. We use the
haven package.
Data validation involves checking if the data meets the expected structure, constraints, and logic before we attempt to clean or analyze it.
First, check the structure and data types.
## Rows: 8
## Columns: 5
## $ Patient_ID <dbl> 101, 102, 103, 102, 104, 105, 106, 107
## $ Age <dbl> 25, 340, 28, 340, 45, NA, 12, 65
## $ Gender <chr> "Male", "female", "F", "female", "MALE", "Male", "F", "Male"
## $ HbA1c <dbl> 5.4, 9.1, 5.8, 9.1, 15.0, 6.2, 5.5, NA
## $ Visit_Date <date> 2023-01-01, 2023-01-02, 2023-01-03, 2023-01-02, 2023-01-05,…
## Patient_ID Age Gender HbA1c
## Min. :101.0 Min. : 12.0 Length:8 Min. : 5.400
## 1st Qu.:102.0 1st Qu.: 26.5 Class :character 1st Qu.: 5.650
## Median :103.5 Median : 45.0 Mode :character Median : 6.200
## Mean :103.8 Mean :122.1 Mean : 8.014
## 3rd Qu.:105.2 3rd Qu.:202.5 3rd Qu.: 9.100
## Max. :107.0 Max. :340.0 Max. :15.000
## NA's :1 NA's :1
## Visit_Date
## Min. :2023-01-01
## 1st Qu.:2023-01-02
## Median :2023-01-04
## Mean :2023-01-04
## 3rd Qu.:2023-01-06
## Max. :2023-01-08
##
Observations: * Age max is 340
(impossible). * Gender is character type with inconsistent
entries. * HbA1c has NA values.
validateThe validate package allows us to define formal rules
for our data.
library(validate)
# Define validation rules
rules <- validator(
age_range = Age >= 0 & Age <= 120, # Age must be realistic
hba1c_range = HbA1c > 0 & HbA1c < 20, # Biological limits for HbA1c
unique_id = is_unique(Patient_ID), # IDs should be unique
complete_gender = !is.na(Gender) # Gender should not be missing
)
# Confront the data with the rules
check <- confront(df_csv, rules)
# Summarize results
summary(check)## name items passes fails nNA error warning
## 1 age_range 8 5 2 1 FALSE FALSE
## 2 hba1c_range 8 7 0 1 FALSE FALSE
## 3 unique_id 8 6 2 0 FALSE FALSE
## 4 complete_gender 8 8 0 0 FALSE FALSE
## expression
## 1 Age - 0 >= -1e-08 & Age - 120 <= 1e-08
## 2 HbA1c > 0 & HbA1c < 20
## 3 is_unique(Patient_ID)
## 4 !is.na(Gender)
Interpretation: * age_range: We have
failures (Age 340). * unique_id: We have failures
(Duplicate IDs). * hba1c_range: Passes (ignoring NAs).
Now that we have identified the issues, we apply cleaning techniques.
Duplicate records can bias statistical results.
## [1] 1
# Remove duplicates based on all columns
df_clean_step1 <- df_csv %>%
distinct()
# Verify
nrow(df_csv) - nrow(df_clean_step1) # Number of rows removed## [1] 1
In the Gender column, we have “Male”, “female”, “F”,
“MALE”. These need to be standardized.
df_clean_step2 <- df_clean_step1 %>%
mutate(
# Convert to Title Case (e.g., "female" -> "Female")
Gender = str_to_title(Gender),
# Recode "F" to "Female" and "M" to "Male" if necessary
Gender = case_when(
Gender == "F" ~ "Female",
Gender == "M" ~ "Male",
TRUE ~ Gender
),
# Convert to factor
Gender = as.factor(Gender)
)
table(df_clean_step2$Gender)##
## Female Male
## 3 4
We identified an Age of 340. We can handle this by removing it or treating it as missing.
# Visualizing outliers with a Boxplot
boxplot(df_clean_step2$Age, main = "Boxplot of Age (Before Cleaning)")# Handling the outlier: Replace ages > 120 with NA
df_clean_step3 <- df_clean_step2 %>%
mutate(Age = ifelse(Age > 120, NA, Age))
# Check summary again
summary(df_clean_step3$Age)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 12 25 28 35 45 65 2
We have missing values in Age and
HbA1c.
df_final <- df_clean_step3 %>%
mutate(
# Impute Age with Median (robust to outliers)
Age = ifelse(is.na(Age), median(Age, na.rm = TRUE), Age),
# Impute HbA1c with Mean
HbA1c = ifelse(is.na(HbA1c), mean(HbA1c, na.rm = TRUE), HbA1c)
)
# Verify no NAs remain
colSums(is.na(df_final))## Patient_ID Age Gender HbA1c Visit_Date
## 0 0 0 0 0
Ensure dates are recognized as Date objects, not text.
library(lubridate)
df_final <- df_final %>%
mutate(Visit_Date = ymd(Visit_Date)) # ymd handles 2023-01-01 and 2023/01/01
glimpse(df_final)## Rows: 7
## Columns: 5
## $ Patient_ID <dbl> 101, 102, 103, 104, 105, 106, 107
## $ Age <dbl> 25, 28, 28, 45, 28, 12, 65
## $ Gender <fct> Male, Female, Female, Male, Male, Female, Male
## $ HbA1c <dbl> 5.400000, 9.100000, 5.800000, 15.000000, 6.200000, 5.500000…
## $ Visit_Date <date> 2023-01-01, 2023-01-02, 2023-01-03, 2023-01-05, 2023-01-06,…
Let’s compare our raw data to the processed data.
Raw Data:
| Patient_ID | Age | Gender | HbA1c | Visit_Date |
|---|---|---|---|---|
| 101 | 25 | Male | 5.4 | 2023-01-01 |
| 102 | 340 | female | 9.1 | 2023-01-02 |
| 103 | 28 | F | 5.8 | 2023/01/03 |
| 102 | 340 | female | 9.1 | 2023-01-02 |
| 104 | 45 | MALE | 15.0 | 2023-01-05 |
| 105 | NA | Male | 6.2 | 2023-01-06 |
Cleaned Data:
| Patient_ID | Age | Gender | HbA1c | Visit_Date |
|---|---|---|---|---|
| 101 | 25 | Male | 5.4 | 2023-01-01 |
| 102 | 28 | Female | 9.1 | 2023-01-02 |
| 103 | 28 | Female | 5.8 | 2023-01-03 |
| 104 | 45 | Male | 15.0 | 2023-01-05 |
| 105 | 28 | Male | 6.2 | 2023-01-06 |
| 106 | 12 | Female | 5.5 | 2023-01-07 |
In this module, we have successfully:
This dataset is now ready for the Descriptive Statistics (Module III) and Visualization (Module VIII) phases of the course. ```