Introduction

In this code-through, I will show a simple and practical workflow for cleaning a dataset in R using dplyr. The goal is to make the logic clear and easy to follow, even for someone new to R.

I will use the built-in starwars dataset because it is small, clean, and good for demonstrating the most common data-cleaning steps.

We will walk through five things:

  1. Looking at the raw dataset

  2. Selecting and renaming columns

  3. Filtering rows

  4. Creating new variables

  5. Handling missing values

Each step includes a short explanation and the actual R code needed.

Step 1: Load and Inspect the Data

Before cleaning anything, we should always check what the raw dataset looks like.

#Load the dataset

data <- starwars

#Print first rows

head(data)
## # A tibble: 6 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
## 2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
## 3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
## 4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
## 5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
## 6 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>
#Print column names

colnames(data)
##  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
##  [6] "eye_color"  "birth_year" "sex"        "gender"     "homeworld" 
## [11] "species"    "films"      "vehicles"   "starships"

Step 2: Select and Rename Columns

The raw dataset has many columns, but we usually don’t need all of them.
A big part of data cleaning is deciding what to keep.

Below, I select a smaller set of useful variables and also rename them to cleaner names.

# Select only a few useful columns
data_small <- data %>%
  select(name, height, mass, species, homeworld)

# Rename columns to simpler names
data_small <- data_small %>%
  rename(
    weight = mass,
    planet = homeworld
  )

# Print the cleaned columns
head(data_small)
## # A tibble: 6 × 5
##   name           height weight species planet  
##   <chr>           <int>  <dbl> <chr>   <chr>   
## 1 Luke Skywalker    172     77 Human   Tatooine
## 2 C-3PO             167     75 Droid   Tatooine
## 3 R2-D2              96     32 Droid   Naboo   
## 4 Darth Vader       202    136 Human   Tatooine
## 5 Leia Organa       150     49 Human   Alderaan
## 6 Owen Lars         178    120 Human   Tatooine

Step 3: Filter Rows

After selecting the columns we want, the next step is cleaning the rows.
Filtering lets us remove bad data and keep only the observations that make sense.

Below are a few common examples of filtering.

#Remove rows with missing height or weight

data_clean <- data_small %>%
filter(!is.na(height), !is.na(weight))

#Keep only characters taller than 150 cm

data_clean <- data_clean %>%
filter(height > 150)

#Keep only human characters

data_humans <- data_clean %>%
filter(species == "Human")

#Preview the cleaned dataset

head(data_humans)
## # A tibble: 6 × 5
##   name               height weight species planet  
##   <chr>               <int>  <dbl> <chr>   <chr>   
## 1 Luke Skywalker        172     77 Human   Tatooine
## 2 Darth Vader           202    136 Human   Tatooine
## 3 Owen Lars             178    120 Human   Tatooine
## 4 Beru Whitesun Lars    165     75 Human   Tatooine
## 5 Biggs Darklighter     183     84 Human   Tatooine
## 6 Obi-Wan Kenobi        182     77 Human   Stewjon

In this step, we cleaned the dataset to make sure we only work with reliable, meaningful data. We removed rows that were missing important information, kept characters above a certain height, and filtered the data so we only analyze humans.

This gives us a clean starting point that avoids errors later and keeps the analysis focused.

Step 4 — Create New Variables

Now that we have a clean dataset of humans, we can create new variables that help us analyze the data more easily. Creating new columns is one of the most common and useful things you do in data cleaning.

In this step, we’ll create:

  1. BMI — a simple calculation using weight and height

  2. Height category — a label for whether someone is “Tall” or “Short”

  3. Weight in pounds — converting kilograms to pounds

These new columns make the dataset more informative and easier to use in later analysis.

# Create new variables

data_final <- data_humans %>%
mutate(
bmi = weight / (height/100)^2,                    # Body Mass Index
height_group = ifelse(height > 170, "Tall", "Short"),  # Height category
weight_lbs = weight * 2.20462                     # Convert kg → lbs
)

# View the updated dataset

head(data_final)
## # A tibble: 6 × 8
##   name               height weight species planet    bmi height_group weight_lbs
##   <chr>               <int>  <dbl> <chr>   <chr>   <dbl> <chr>             <dbl>
## 1 Luke Skywalker        172     77 Human   Tatooi…  26.0 Tall               170.
## 2 Darth Vader           202    136 Human   Tatooi…  33.3 Tall               300.
## 3 Owen Lars             178    120 Human   Tatooi…  37.9 Tall               265.
## 4 Beru Whitesun Lars    165     75 Human   Tatooi…  27.5 Short              165.
## 5 Biggs Darklighter     183     84 Human   Tatooi…  25.1 Tall               185.
## 6 Obi-Wan Kenobi        182     77 Human   Stewjon  23.2 Tall               170.

Step 5 — Visualize the Cleaned Data

Now that the data is fully cleaned and we’ve created new variables, the next step is to show the results visually. A good code-through always includes at least one simple plot so the reader can see the outcome of the cleaning process.

We’ll keep it very beginner-friendly.

In this step, we will:

  1. Plot a histogram of heights

  2. Plot a scatterplot of height vs. weight

  3. Use our cleaned dataset (data_final)

Everything is simple, readable, and easy to understand.

# Histogram: distribution of heights

hist(
data_final$height,
main = "Height Distribution of Humans",
xlab = "Height (cm)",
col = "skyblue",
border = "white"
)

# Scatterplot: height vs weight

plot(
data_final$height,
data_final$weight,
main = "Height vs. Weight",
xlab = "Height (cm)",
ylab = "Weight (kg)",
pch = 19,
col = "darkgreen"
)

# Conclusion

In this short tutorial, we walked through how to clean and prepare a dataset using dplyr. We covered the essential steps: selecting only the columns we need, filtering out bad or missing data, keeping the rows that matter, and creating new variables to make our analysis more meaningful. These are the same steps you would use in almost any real-world data project.

After cleaning the data, we also visualized the results to make sure everything looks reasonable. Good visual checks—like histograms and scatterplots—help confirm that our cleaning steps worked the way we expected.

Overall, this code-through showed how a simple and clean workflow can turn a messy dataset into something ready for analysis. Anyone, even a beginner, can follow these steps and apply them to their own data.