Systemic Data Re-coding and Feature Engineering

Author

Abdullah Al Shamim

Data manipulation isn’t just about cleaning; it’s about re-coding information so that R (and humans) can understand it better. This process is often called the “Tidy Data” workflow.

Phase 1: Dealing with the “Unknown” (Missing Data)

Before doing any math, you must decide what to do with missing values (NA). If you ignore them, your calculations will fail.

Using na.rm = TRUE inside a function. This is a “quick fix” for a single calculation.

Code
library(tidyverse)

# Returns NA because some heights are missing
mean(starwars$height)
[1] NA
Code
# Returns the actual average by ignoring NAs
mean(starwars$height, na.rm = TRUE)
[1] 174.6049

Using na.omit() to clean the entire dataset. This is safer for building complex pipelines.

Code
# This removes any row that has even one missing value
sw_clean <- starwars %>% 
  select(name, height, mass, sex) %>%
  na.omit()
Code
# Inspection: Check the new dimensions
dim(sw_clean)
[1] 56  4

Phase 2: Structural Tidying (Select & Rename)

A pro-learner always simplifies. Don’t carry 30 columns if you only need 4.

Code
# Chaining Select and Rename
sw <- starwars %>% 
  select(name, height, mass, sex) %>%
  rename(weight = mass) %>%           # Making it more intuitive
  na.omit()
Code
# see first 6 records
head(sw)
# A tibble: 6 × 4
  name           height weight sex   
  <chr>           <int>  <dbl> <chr> 
1 Luke Skywalker    172     77 male  
2 C-3PO             167     75 none  
3 R2-D2              96     32 none  
4 Darth Vader       202    136 male  
5 Leia Organa       150     49 female
6 Owen Lars         178    120 male  

Phase 3: Normalization and Filtering

Data often comes in scales that aren’t useful (like cm instead of meters) or contains categories you don’t need for a specific study.

Code
# Convert height from centimeters to meters
sw <- sw %>% 
  mutate(height = height / 100)

Using %in% to keep only specific inputs.

Code
# Keeping only male and female categories
sw <- sw %>% 
  filter(sex %in% c("male", "female"))

unique(sw$sex)
[1] "male"   "female"

Phase 4: Advanced Recoding (Value Transformation)

Sometimes values are too long or inconsistently named. We use recode() to map old values to new ones.

Code
# Changing 'male' to 'm' and 'female' to 'f'
sw <- sw %>% 
  mutate(gsex = recode(sex, 
                       "male"   = "m",
                       "female" = "f"))

head(sw %>% select(name, sex, gsex))
# A tibble: 6 × 3
  name               sex    gsex 
  <chr>              <chr>  <chr>
1 Luke Skywalker     male   m    
2 Darth Vader        male   m    
3 Leia Organa        female f    
4 Owen Lars          male   m    
5 Beru Whitesun Lars female f    
6 Biggs Darklighter  male   m    

Phase 5: Feature Engineering (Creating New Variables)

This is where the “Pro” level begins. We create new variables based on logical conditions.

Creating a TRUE/FALSE variable based on physical attributes.

Code
# Is the character tall (>1m) AND heavy (>75kg)?
sw <- sw %>% 
  mutate(size_logic = height > 1 & weight > 75)

head(sw %>% select(name, size_logic))
# A tibble: 6 × 2
  name               size_logic
  <chr>              <lgl>     
1 Luke Skywalker     TRUE      
2 Darth Vader        TRUE      
3 Leia Organa        FALSE     
4 Owen Lars          TRUE      
5 Beru Whitesun Lars FALSE     
6 Biggs Darklighter  TRUE      

Turning logic into human-readable labels using if_else().

Code
# If size_logic is TRUE, label as "Big", otherwise "Small"
sw <- sw %>% 
  mutate(size = if_else(size_logic == TRUE, "Big", "Small"))

# Final result check
sw %>% select(name, height, weight, size) %>% head()
# A tibble: 6 × 4
  name               height weight size 
  <chr>               <dbl>  <dbl> <chr>
1 Luke Skywalker       1.72     77 Big  
2 Darth Vader          2.02    136 Big  
3 Leia Organa          1.5      49 Small
4 Owen Lars            1.78    120 Big  
5 Beru Whitesun Lars   1.65     75 Small
6 Biggs Darklighter    1.83     84 Big  

The Master Pipeline: The “Pro” Way

In professional R coding, we write this entire process as one continuous, logical “story.”

Code
final_sw <- starwars %>% 
  # 1. Selection & Renaming
  select(name, height, mass, sex) %>% 
  rename(weight = mass) %>% 
  
  # 2. Cleaning
  na.omit() %>% 
  filter(sex %in% c("male", "female")) %>% 
  
  # 3. Recoding & Unit Conversion
  mutate(height = height / 100,
         gsex = recode(sex, "male" = "m", "female" = "f")) %>% 
  
  # 4. Feature Engineering
  mutate(is_big = height > 1 & weight > 75,
         size = if_else(is_big, "Big", "Small"))

# view(final_sw) # To see the final masterpiece

🎓 Summary for Learners

Technique Function Purpose
Handling NAs na.omit() Cleans the data “skeleton” before analysis.
Normalizing mutate() Ensures units (m, kg) are standard.
Recoding recode() Standardizes labels for better grouping.
Logic if_else() Creates new categorical insights from numbers.

Pro Tip: Always check your data types with str() or glimpse() after recoding to ensure your new columns are factors or characters as intended!

Courses that contain short and easy to digest video content are available at premieranalytics.com.bd Each lessons uses data that is built into R or comes with installed packages so you can replicated the work at home. premieranalytics.com.bd also includes teaching on statistics and research methods.