Cleaning Data in R

Getting Started

The examples in this document make use of the tidyverse and janitor libraries. If you haven’t installed them, do so using the Packages tab in RStudio. Once we have installed them, we can go ahead and load them.

library(tidyverse)
library(janitor)

Missing Data

Missing data is a common challenge in real-world datasets. In the realm of social justice, missing data can have serious implications. Imagine you are examining a dataset on school disciplinary actions. If data on students’ race is missing for a significant portion of the records, your analysis of racial disparities in discipline could be skewed, potentially perpetuating harmful narratives or obscuring the true extent of the problem.

The first step in handling missing data is recognizing how it’s represented in your raw data. Different data sources may encode missing values in various ways. It could be a blank cell (represented as "" in R), a special code like 999, a textual description of missing data, or even a seemingly valid value that’s actually a placeholder for missing data.

Before working substantively on analyzing your data, it’s crucial to standardize how missing data is represented. The goal is to ensure all missing values are consistently coded as NA (short for Not Available) which R recognizes as a missing value.

Let’s consider a hypothetical dataset on student suspensions, where missing values for the ‘race’ column are represented as blank cells (""), and missing values for the ‘grade_level’ column are represented as 999. We’ll first create a sample dataframe to illustrate this:

# Create a sample dataframe with missing value representations
student_data <- data.frame(
  race = c("White", "", "Black", "Asian", "Hispanic", "999"),
  grade_level = c(10, 11, 999, 12, 11, "")
)

# Look at the data
student_data

##       race grade_level
## 1    White          10
## 2                   11
## 3    Black         999
## 4    Asian          12
## 5 Hispanic          11
## 6      999

# Clean up missing values
student_data <- student_data %>%
  mutate(race = ifelse(race %in% c("", "999"), NA, race),
         grade_level = ifelse(grade_level %in% c("", "999"), NA, as.numeric(grade_level)))

# Look at the data
student_data

##       race grade_level
## 1    White          10
## 2     <NA>          11
## 3    Black          NA
## 4    Asian          12
## 5 Hispanic          11
## 6     <NA>          NA

It’s important to be aware of how commands in R handle missing values. You can usually find out by read the help documentation for a command and by experimenting. For example, suppose we try to calculate the mean grade level in our data set.

student_data %>%
  pull(grade_level) %>%
  mean

## [1] NA

We get the answer NA because the data itself includes NA values. These aren’t numbers, so there’s no way to take the mean of them. If we want R to calculate the mean based on the data we have, eliminating the missing values, we can do that:

student_data %>%
  pull(grade_level) %>%
  mean(na.rm = TRUE)

## [1] 11

Always be mindful of how missing data is handled in your analysis. Blindly excluding NA values can lead to bad results to much data is missing or if the missingness is not random.

Handling Duplicate Data

Duplicate data can arise from data entry errors, merging datasets from multiple sources, or even intentional duplication. How we handle duplication depends on our context and goals, and mishandling it can produce misleading results.

Let’s imagine you are researching the different types of hate speech prevalent on a social media platform and you have data from one user. In this case, you are primarily interested in the unique hateful messages in the content column.

# Create a sample dataset
hate_speech_data <- data.frame(
  content = c("Hateful comment 1", "Hateful comment 2", "Hateful comment 3", "Hateful comment 2", "Hateful comment 4", "Hateful comment 2", "Hateful comment 2"), 
  timestamp = as.POSIXct(c("2023-08-15 10:30:00", "2023-08-16 14:15:00", "2023-08-17 09:45:00", "2023-08-16 18:30:00", "2023-08-18 12:00:00", "2023-08-19 20:20:00", "2023-08-16 14:15:00"))
)

# Look at the content column
hate_speech_data %>%
  pull(content)

## [1] "Hateful comment 1" "Hateful comment 2" "Hateful comment 3"
## [4] "Hateful comment 2" "Hateful comment 4" "Hateful comment 2"
## [7] "Hateful comment 2"

# Identify unique hateful messages (distinct on 'content')
hate_speech_data %>%
  distinct(content) %>%
  pull(content)

## [1] "Hateful comment 1" "Hateful comment 2" "Hateful comment 3"
## [4] "Hateful comment 4"

Now, let’s say you are interested in understanding the frequency of hate speech incidents. Here’s we care about all the data, since different people might have posted the same comment at different times. However, there is an error in the data because rows 2 and 7 are identical, possibly due to a technical error. We can eliminate the duplication.

# Keep all rows, including duplicates (distinct on the entire dataframe)
hate_speech_data

##             content           timestamp
## 1 Hateful comment 1 2023-08-15 10:30:00
## 2 Hateful comment 2 2023-08-16 14:15:00
## 3 Hateful comment 3 2023-08-17 09:45:00
## 4 Hateful comment 2 2023-08-16 18:30:00
## 5 Hateful comment 4 2023-08-18 12:00:00
## 6 Hateful comment 2 2023-08-19 20:20:00
## 7 Hateful comment 2 2023-08-16 14:15:00

hate_speech_data %>%
  distinct()

##             content           timestamp
## 1 Hateful comment 1 2023-08-15 10:30:00
## 2 Hateful comment 2 2023-08-16 14:15:00
## 3 Hateful comment 3 2023-08-17 09:45:00
## 4 Hateful comment 2 2023-08-16 18:30:00
## 5 Hateful comment 4 2023-08-18 12:00:00
## 6 Hateful comment 2 2023-08-19 20:20:00

Remember, data cleaning is not just about technical skills; it’s about making thoughtful decisions that uphold the integrity of your analysis.

Working with Dates and Times

Dates and times are crucial for understanding trends and patterns over time. In a dataset tracking environmental pollution levels, for example, accurate date and time information allows you to analyze how pollution varies throughout the day, week, or year, and identify potential links to human activity or policy changes.

When you import data into R, dates and times are often read as simple text strings. This means R won’t recognize them as dates or times, preventing you from performing calculations or analyses that involve temporal relationships. Therefore, it’s essential to convert these text strings into appropriate date/time formats that R understands.

There are R commands to convert text strings to date objects, but you need to help R by telling it what format your text strings use to represent dates. Let’s see a few examples:

   # Example 1: "YYYY-MM-DD" format
   date_text1 <- "2023-05-15"
   as.Date(date_text1, format = "%Y-%m-%d")

## [1] "2023-05-15"

   # Example 2: "MM/DD/YYYY" format
   date_text2 <- "08/20/2022"
   as.Date(date_text2, format = "%m/%d/%Y")

## [1] "2022-08-20"

   # Example 3: "DD-Mon-YYYY" format
   date_text3 <- "12-Dec-2021"
   as.Date(date_text3, format = "%d-%b-%Y")

## [1] "2021-12-12"

In R, there isn’t a dedicated “time” data type that’s separate from date-times. The POSIXct format stores both date and time information. Let’s do an example where we convert text to this format, and then create some new variables (derived from the date-time variable) that contain just the month and just whether the time was morning or evening.

    # Create sample data with date and time
    pollution_data <- data.frame(
      date_time_text = c("2023-05-15 10:30:00", "2022-08-20 18:45:00", "2021-12-12 06:15:00", "2020-01-31 23:59:59")
    )

    # Parse date and time
    pollution_data <- pollution_data %>%
      mutate(date_time = as.POSIXct(date_time_text, format = "%Y-%m-%d %H:%M:%S"))

    # Extract month
    pollution_data <- pollution_data %>%
      mutate(eventmonth = month(date_time, label = TRUE)) 

    # Create a morning/evening column
    pollution_data <- pollution_data %>%
      mutate(time_of_day = ifelse(hour(date_time) < 12, "Morning", "Evening"))

    print(pollution_data)

##        date_time_text           date_time eventmonth time_of_day
## 1 2023-05-15 10:30:00 2023-05-15 10:30:00        May     Morning
## 2 2022-08-20 18:45:00 2022-08-20 18:45:00        Aug     Evening
## 3 2021-12-12 06:15:00 2021-12-12 06:15:00        Dec     Morning
## 4 2020-01-31 23:59:59 2020-01-31 23:59:59        Jan     Evening

Text Data Cleaning and Manipulation

Text data, while rich in information, often requires careful cleaning before it can be used for analysis. Inconsistent capitalization, extra spaces, and punctuation can create challenges when trying to identify and group similar entities. This is especially crucial in social justice contexts, where accurate identification of individuals or groups can be vital for understanding and addressing systemic inequities.

Consider a dataset on court rulings where the judge’s name is recorded. Inconsistencies like “Anita R. Jones,” ” anita r jones “, and”ANITA R. JONES” would be treated as separate entities if left uncleaned. This can significantly impact your analysis, particularly if you’re interested in examining a judge’s decision-making patterns.

Common cleaning tasks include:

Converting to Lowercase or Uppercase: Standardizing the case ensures that variations like “Anita Jones” and “anita jones” are treated as the same.
Trimming Whitespace: Removing leading and trailing spaces ensures that ” Anita Jones ” is recognized as the same as “Anita Jones”.
Removing Punctuation: Eliminating punctuation helps focus on the core words and prevents issues when searching or matching text.
Handling Special Characters and Encoding Issues: This involves addressing accented characters or other symbols that might cause problems during analysis.

Let’s create a sample dataset with inconsistent judge names:

# Create sample data
court_data <- data.frame(
  case_id = 1:6,
  judge_name = c("Anita  R.  Jones", "  anita r jones  ", "ANITA R. JONES", "john  d.  o'brian", "  John D. O'Brian  ", "  JOHN  D.  OBRIAN  ")
)
print(court_data)

##   case_id           judge_name
## 1       1     Anita  R.  Jones
## 2       2      anita r jones  
## 3       3       ANITA R. JONES
## 4       4    john  d.  o'brian
## 5       5    John D. O'Brian  
## 6       6   JOHN  D.  OBRIAN

court_data_clean <- court_data %>%
  mutate(judge_name_clean = str_to_lower(judge_name), # Convert to lowercase
         judge_name_clean = str_squish(judge_name_clean), # Trim whitespace
         judge_name_clean = str_remove_all(judge_name_clean, "[[:punct:]]")) # Remove punctuation

print(court_data_clean)

##   case_id           judge_name judge_name_clean
## 1       1     Anita  R.  Jones    anita r jones
## 2       2      anita r jones      anita r jones
## 3       3       ANITA R. JONES    anita r jones
## 4       4    john  d.  o'brian    john d obrian
## 5       5    John D. O'Brian      john d obrian
## 6       6   JOHN  D.  OBRIAN      john d obrian

In this example, all variations of “Anita R. Jones” and “John D. O’Brian” are now standardized, enabling accurate grouping and analysis. By investing time in text cleaning, you lay a strong foundation for subsequent analysis steps, ensuring that your findings are grounded in reliable and consistent data.

Transforming Variable Types

In R, the way data appears can be deceiving. Just because a value looks like a number or a logical value doesn’t mean R will automatically treat it as such. The underlying data type—whether it’s numeric, character, logical, or factor–dictates how R handles and interprets that value. This distinction is critical. Incorrect/mismatched data types can lead to errors, inaccurate analyses, and misleading conclusions.

For instance, a dataset on incarceration rates might have a column labeled “recidivism” with values like “True” and “False”. While these seems like values of type logical, if imported as text, R will treat these as character strings, hindering logical operations and accurate summarization of recidivism rates. Similarly, an “age” column with values like “25” and “30” might appear numeric, but if stored as characters, calculations like average age become impossible.

To ensure accuracy, always check variable classes when you acquire data into R. Then make careful choices about converting. Arguably one of the more challenging choices is whether to treat data as character (textual) or factor (categorical). Character data is suitable for free-form text like comments or descriptions without predefined categories. Factor data is ideal for categorical data with limited distinct values (levels), such as a set of incarceration facilities within a particular jurisdiction.

Let’s illustrate some work on data class with a hypothetical dataset on healthcare access:

# Create sample data
healthcare_data <- data.frame(
  patient_id = 1:6,
  patient_name = c("Alice", "Bob", "Charlie", "David", "Eva", "Frank"),
  has_insurance = c("True", "False", "True", "False", "True", "False"),
  age = c("25 years old", 30, " 35", " 40 years", '28', '32 years old'),
  race = c("white", "Black", "Asian", "Hispanic", "Black", "white")
)

# Inspect data types
str(healthcare_data)

## 'data.frame':    6 obs. of  5 variables:
##  $ patient_id   : int  1 2 3 4 5 6
##  $ patient_name : chr  "Alice" "Bob" "Charlie" "David" ...
##  $ has_insurance: chr  "True" "False" "True" "False" ...
##  $ age          : chr  "25 years old" "30" " 35" " 40 years" ...
##  $ race         : chr  "white" "Black" "Asian" "Hispanic" ...

# Transform variables to correct types
healthcare_data_clean <- healthcare_data %>%
  mutate(has_insurance = as.logical(has_insurance), 
         age = str_extract(age, "\\d+"), 
         age = as.numeric(age), 
         race = as.factor(race)) 

# Inspect data types again
str(healthcare_data_clean)

## 'data.frame':    6 obs. of  5 variables:
##  $ patient_id   : int  1 2 3 4 5 6
##  $ patient_name : chr  "Alice" "Bob" "Charlie" "David" ...
##  $ has_insurance: logi  TRUE FALSE TRUE FALSE TRUE FALSE
##  $ age          : num  25 30 35 40 28 32
##  $ race         : Factor w/ 4 levels "Asian","Black",..: 4 2 1 3 2 4

We directly converted has_insurance to logical since it already contains only “True” or “False” values (as text). For age, we extracted the numeric part and then converted the digits (as character data) to numeric type. We converted race to a factor, recognizing that our (fake) data uses Census race categories, which are limited to a fixed list. Finally, we left patient_name as character class since it represents free-form text data.

Remember, ensuring data types are correct is fundamental for accurate and meaningful analysis.

Data Validation

Data validation, often overlooked, is an unsung hero of reliable data analysis. It’s the process of ensuring your data is accurate, consistent, and adheres to logical rules, safeguarding your analysis from inaccuracies that could perpetuate biases or lead to harmful conclusions, especially in the sensitive realm of social justice.

Consider a dataset on police stops. A negative stop duration or one lasting days raises red flags, potentially distorting your analysis of police practices and their impact on communities. An artist in a data set about diversity in modern art who has a birth year of 974 has probably been recorded incorrectly and could skew analysis.

The “right” way to validate data will strongly depend on context. That said, here are a few general suggestions for maintaining data integrity.

Range Checks: Confirm numeric variables, like ages and times, stay within reasonable bounds.
Logical Constraints: Enforce relationships; ‘start_date’ preceding ‘end_date’ in employment data, for instance.
External Validation: If possible/applicable, compare your data against external sources to spot discrepancies.

Let’s apply this to some (fabricated) police stop data:

# Sample data with potential issues
police_stop_data <- data.frame(
  stop_id = 1:4,
  stop_duration_minutes = c(15, -5, 3600, 45), 
  stop_date = c("2023-09-01", "2023-08-15", "2023-07-10", "2023-06-20"),
  arrest_made = c(FALSE, TRUE, FALSE, TRUE),
  arrest_date = c(NA, "2023-08-16", NA, "2023-05-15") 
)

# Convert to Date format
police_stop_data$stop_date <- as.Date(police_stop_data$stop_date)
police_stop_data$arrest_date <- as.Date(police_stop_data$arrest_date)

# Look at raw data
police_stop_data

##   stop_id stop_duration_minutes  stop_date arrest_made arrest_date
## 1       1                    15 2023-09-01       FALSE        <NA>
## 2       2                    -5 2023-08-15        TRUE  2023-08-16
## 3       3                  3600 2023-07-10       FALSE        <NA>
## 4       4                    45 2023-06-20        TRUE  2023-05-15

# Check for problematic stop durations (positive and reasonable)
police_stop_data %>%
  filter(stop_duration_minutes < 0 | stop_duration_minutes > 120)

##   stop_id stop_duration_minutes  stop_date arrest_made arrest_date
## 1       2                    -5 2023-08-15        TRUE  2023-08-16
## 2       3                  3600 2023-07-10       FALSE        <NA>

# Check for invalid arrest dates (before stop date)
police_stop_data %>%
  filter(!is.na(arrest_date) & arrest_made & arrest_date < stop_date)

##   stop_id stop_duration_minutes  stop_date arrest_made arrest_date
## 1       4                    45 2023-06-20        TRUE  2023-05-15

Data validation isn’t just a technicality; it’s a commitment to accuracy and trustworthiness.

Cleaning Up Variable Names

Variable (column) names might seem like a trivial detail but they can play a crucial role in data analysis. Inconsistent, cryptic, or overly complex column names can hinder understanding, create confusion, and even introduce errors during coding and analysis.

The goal of cleaning column names is to create names that are:

Consistent: Follow a uniform naming convention (e.g., snake_case, camelCase)
Descriptive: Clearly convey the meaning of the data in each column
Concise: Avoid overly long or complex names
Machine-Readable: Avoid spaces, special characters, or reserved words that might cause issues during analysis

There are two primary scenarios where you might need to clean up column names. First, you might need to standardize inconsistent names. This involves addressing variations in capitalization, spacing, and punctuation to ensure consistency. SEcond, you might need to rename uninformative names. Sometimes, datasets come with generic or cryptic column names like “V1,” “V2,” or “Column1.” In such cases, it’s essential to rename these columns based on the data dictionary, metadata, or your understanding of the data.

Let’s create a sample dataset with messy column names and clean them up. We’ll use the janitor package.

library(janitor)

# Create sample data with messy column names
pollution_data <- data.frame(
  `PM 2.5 Concentration (µg/m³)` = c(10, 15, 20),
  `NO2 Level (ppb)` = c(30, 40, 50),
  "Measurement Date" = c("2023-09-01", "2023-09-02", "2023-09-03")
)

names(pollution_data)

## [1] "PM.2.5.Concentration..µg.m.." "NO2.Level..ppb."             
## [3] "Measurement.Date"

# Clean column names using janitor
pollution_data <- pollution_data %>%
  clean_names()

names(pollution_data)

## [1] "pm_2_5_concentration_mg_m" "no2_level_ppb"            
## [3] "measurement_date"

Now let’s do an example with uninformative names. Imagine you have a dataset from a survey on public attitudes towards climate change. The original column names might be something like “Q1,” “Q2,” and “Q3,” representing the responses to different survey questions.

# Create sample data with uninformative column names
survey_data <- data.frame(
  Q1 = c("Strongly Agree", "Agree", "Neutral"),
  Q2 = c("Often", "Sometimes", "Rarely"),
  Q3 = c("Yes", "No", "Yes")
)

names(survey_data)

## [1] "Q1" "Q2" "Q3"

# Rename columns based on survey questions (replace with actual question text)
survey_data <- survey_data %>%
  rename(climate_change_concern = Q1,
         frequency_of_climate_discussion = Q2,
         support_for_green_policies = Q3)

names(survey_data)

## [1] "climate_change_concern"          "frequency_of_climate_discussion"
## [3] "support_for_green_policies"

Remember, taking the time to clean up your column names is a small but impactful step that can significantly enhance the quality, accessibility, and transparency of your data analysis.