In this code-through, I will show a simple and practical workflow for cleaning a dataset in R using dplyr. The goal is to make the logic clear and easy to follow, even for someone new to R.
I will use the built-in starwars dataset because it is small, clean, and good for demonstrating the most common data-cleaning steps.
We will walk through five things:
Looking at the raw dataset
Selecting and renaming columns
Filtering rows
Creating new variables
Handling missing values
Each step includes a short explanation and the actual R code needed.
Before cleaning anything, we should always check what the raw dataset looks like.
#Load the dataset
data <- starwars
#Print first rows
head(data)
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth Va… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Org… 150 49 brown light brown 19 fema… femin…
## 6 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
#Print column names
colnames(data)
## [1] "name" "height" "mass" "hair_color" "skin_color"
## [6] "eye_color" "birth_year" "sex" "gender" "homeworld"
## [11] "species" "films" "vehicles" "starships"
The raw dataset has many columns, but we usually don’t need all of
them.
A big part of data cleaning is deciding what to keep.
Below, I select a smaller set of useful variables and also rename them to cleaner names.
# Select only a few useful columns
data_small <- data %>%
select(name, height, mass, species, homeworld)
# Rename columns to simpler names
data_small <- data_small %>%
rename(
weight = mass,
planet = homeworld
)
# Print the cleaned columns
head(data_small)
## # A tibble: 6 × 5
## name height weight species planet
## <chr> <int> <dbl> <chr> <chr>
## 1 Luke Skywalker 172 77 Human Tatooine
## 2 C-3PO 167 75 Droid Tatooine
## 3 R2-D2 96 32 Droid Naboo
## 4 Darth Vader 202 136 Human Tatooine
## 5 Leia Organa 150 49 Human Alderaan
## 6 Owen Lars 178 120 Human Tatooine
After selecting the columns we want, the next step is cleaning the
rows.
Filtering lets us remove bad data and keep only the observations that
make sense.
Below are a few common examples of filtering.
#Remove rows with missing height or weight
data_clean <- data_small %>%
filter(!is.na(height), !is.na(weight))
#Keep only characters taller than 150 cm
data_clean <- data_clean %>%
filter(height > 150)
#Keep only human characters
data_humans <- data_clean %>%
filter(species == "Human")
#Preview the cleaned dataset
head(data_humans)
## # A tibble: 6 × 5
## name height weight species planet
## <chr> <int> <dbl> <chr> <chr>
## 1 Luke Skywalker 172 77 Human Tatooine
## 2 Darth Vader 202 136 Human Tatooine
## 3 Owen Lars 178 120 Human Tatooine
## 4 Beru Whitesun Lars 165 75 Human Tatooine
## 5 Biggs Darklighter 183 84 Human Tatooine
## 6 Obi-Wan Kenobi 182 77 Human Stewjon
In this step, we cleaned the dataset to make sure we only work with reliable, meaningful data. We removed rows that were missing important information, kept characters above a certain height, and filtered the data so we only analyze humans.
This gives us a clean starting point that avoids errors later and keeps the analysis focused.
Now that we have a clean dataset of humans, we can create new variables that help us analyze the data more easily. Creating new columns is one of the most common and useful things you do in data cleaning.
In this step, we’ll create:
BMI — a simple calculation using weight and height
Height category — a label for whether someone is “Tall” or “Short”
Weight in pounds — converting kilograms to pounds
These new columns make the dataset more informative and easier to use in later analysis.
# Create new variables
data_final <- data_humans %>%
mutate(
bmi = weight / (height/100)^2, # Body Mass Index
height_group = ifelse(height > 170, "Tall", "Short"), # Height category
weight_lbs = weight * 2.20462 # Convert kg → lbs
)
# View the updated dataset
head(data_final)
## # A tibble: 6 × 8
## name height weight species planet bmi height_group weight_lbs
## <chr> <int> <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 Luke Skywalker 172 77 Human Tatooi… 26.0 Tall 170.
## 2 Darth Vader 202 136 Human Tatooi… 33.3 Tall 300.
## 3 Owen Lars 178 120 Human Tatooi… 37.9 Tall 265.
## 4 Beru Whitesun Lars 165 75 Human Tatooi… 27.5 Short 165.
## 5 Biggs Darklighter 183 84 Human Tatooi… 25.1 Tall 185.
## 6 Obi-Wan Kenobi 182 77 Human Stewjon 23.2 Tall 170.
Now that the data is fully cleaned and we’ve created new variables, the next step is to show the results visually. A good code-through always includes at least one simple plot so the reader can see the outcome of the cleaning process.
We’ll keep it very beginner-friendly.
In this step, we will:
Plot a histogram of heights
Plot a scatterplot of height vs. weight
Use our cleaned dataset (data_final)
Everything is simple, readable, and easy to understand.
# Histogram: distribution of heights
hist(
data_final$height,
main = "Height Distribution of Humans",
xlab = "Height (cm)",
col = "skyblue",
border = "white"
)
# Scatterplot: height vs weight
plot(
data_final$height,
data_final$weight,
main = "Height vs. Weight",
xlab = "Height (cm)",
ylab = "Weight (kg)",
pch = 19,
col = "darkgreen"
)
# Conclusion
In this short tutorial, we walked through how to clean and prepare a dataset using dplyr. We covered the essential steps: selecting only the columns we need, filtering out bad or missing data, keeping the rows that matter, and creating new variables to make our analysis more meaningful. These are the same steps you would use in almost any real-world data project.
After cleaning the data, we also visualized the results to make sure everything looks reasonable. Good visual checks—like histograms and scatterplots—help confirm that our cleaning steps worked the way we expected.
Overall, this code-through showed how a simple and clean workflow can turn a messy dataset into something ready for analysis. Anyone, even a beginner, can follow these steps and apply them to their own data.