Systemic Data Manipulation with dplyr

Author

Abdullah Al Shamim

The dplyr package (part of the tidyverse) uses a “grammar of data manipulation” where each function acts as a verb. We use the Pipe Operator (%>%) to chain these verbs together in a clean, readable sequence.

Phase 1: Data Inspection (The Inquiry)

Once you load a dataset, use these functions to check its “health”—dimensions, column types, and sample rows.

These functions let you see the actual data points.

Code
library(tidyverse)

# Open the data in a spreadsheet-like viewer (Only works in RStudio)
View(starwars) 
Code
# See the first 6 rows
head(starwars)
# A tibble: 6 Ă— 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
6 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
Code
# See the last 6 rows
tail(starwars)
# A tibble: 6 Ă— 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Tion Med…    206    80 none       grey       black             NA male  mascu…
2 Finn          NA    NA black      dark       dark              NA male  mascu…
3 Rey           NA    NA brown      light      hazel             NA fema… femin…
4 Poe Dame…     NA    NA brown      light      brown             NA male  mascu…
5 BB8           NA    NA none       none       black             NA none  mascu…
6 Captain …     NA    NA none       none       unknown           NA fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

A systemic look at the distribution, means, and missing values of the dataset.

Code
# Statistical snapshot of the entire dataset
#summary(starwars)

Phase 2: Structural Manipulation (select & rename)

Now that you know the data, you can narrow your focus. The first step in any analysis is focusing on the variables you need. Instead of working with 30 columns, you select the relevant ones and give them meaningful names.

Picking specific variables from the dataset.

Code
# Keeping only what matters
sw_subset <- starwars %>% 
  select(name, height, mass, sex)

head(sw_subset)
# A tibble: 6 Ă— 4
  name           height  mass sex   
  <chr>           <int> <dbl> <chr> 
1 Luke Skywalker    172    77 male  
2 C-3PO             167    75 none  
3 R2-D2              96    32 none  
4 Darth Vader       202   136 male  
5 Leia Organa       150    49 female
6 Owen Lars         178   120 male  

Changing column names to make them more descriptive or easier to type. `rename’ syntax is: new_name = old_name

Code
# Making column names more intuitive
sw <- sw_subset %>% 
  rename(weight = mass)

head(sw)
# A tibble: 6 Ă— 4
  name           height weight sex   
  <chr>           <int>  <dbl> <chr> 
1 Luke Skywalker    172     77 male  
2 C-3PO             167     75 none  
3 R2-D2              96     32 none  
4 Darth Vader       202    136 male  
5 Leia Organa       150     49 female
6 Owen Lars         178    120 male  

Phase 3: Row Filtering & Cleaning (filter & na.omit)

Once the columns are organized, the next step is to remove the rows (observations) that don’t meet your criteria or contain missing data.

Extracting rows based on specific conditions (e.g., only Humans).

Code
# Keeping only the Human species
humans_only <- starwars %>% 
  select(name, mass, height, sex, species) %>% 
  filter(species == "Human")

head(humans_only)
# A tibble: 6 Ă— 5
  name                mass height sex    species
  <chr>              <dbl>  <int> <chr>  <chr>  
1 Luke Skywalker        77    172 male   Human  
2 Darth Vader          136    202 male   Human  
3 Leia Organa           49    150 female Human  
4 Owen Lars            120    178 male   Human  
5 Beru Whitesun Lars    75    165 female Human  
6 Biggs Darklighter     84    183 male   Human  

Using na.omit() to remove rows with incomplete information to ensure mathematical functions (like mean) work correctly.

Code
# Removing rows where any value is NA
clean_data <- humans_only %>% 
  na.omit()

nrow(clean_data) # Check how many rows remain
[1] 20
Code
# Check how many rows were removed
dim(humans_only)
[1] 35  5
Code
dim(clean_data)
[1] 20  5

Phase 4: Data Transformation (mutate)

Now that the data is clean, we often need to create new information from existing columns—such as calculating a ratio or converting units.

Converting units (e.g., centimeters to meters).

Code
# Changing height from cm to m
transformed_data <- clean_data %>% 
  mutate(height = height / 100)

head(transformed_data)
# A tibble: 6 Ă— 5
  name                mass height sex    species
  <chr>              <dbl>  <dbl> <chr>  <chr>  
1 Luke Skywalker        77   1.72 male   Human  
2 Darth Vader          136   2.02 male   Human  
3 Leia Organa           49   1.5  female Human  
4 Owen Lars            120   1.78 male   Human  
5 Beru Whitesun Lars    75   1.65 female Human  
6 Biggs Darklighter     84   1.83 male   Human  

Creating a new variable (like BMI) based on other columns.

Code
# Calculating BMI: Weight (mass) / Height squared
bmi_data <- transformed_data %>% 
  mutate(BMI = mass / height^2)

select(bmi_data, name, mass, height, BMI) %>% head()
# A tibble: 6 Ă— 4
  name                mass height   BMI
  <chr>              <dbl>  <dbl> <dbl>
1 Luke Skywalker        77   1.72  26.0
2 Darth Vader          136   2.02  33.3
3 Leia Organa           49   1.5   21.8
4 Owen Lars            120   1.78  37.9
5 Beru Whitesun Lars    75   1.65  27.5
6 Biggs Darklighter     84   1.83  25.1

Aggregation and Insights (group_by & summarise)

The final phase of manipulation is turning hundreds of rows into a few key insights by grouping the data and calculating statistics.

Collapsing the data to find averages per category.

Code
# Finding the average BMI for each sex
analysis_result <- bmi_data %>% 
  group_by(sex) %>% 
  summarise(Average_BMI = mean(BMI))

analysis_result
# A tibble: 2 Ă— 2
  sex    Average_BMI
  <chr>        <dbl>
1 female        20.8
2 male          25.7

The Master Pipeline: The Full Systemic View (The Power of Chaining)

In real-world R programming, we combine all these phases into a single “Pipe Chain.” This makes your entire analysis workflow repeatable and easy for others to read.

Extracting rows based on specific conditions (e.g., only Humans).

Code
# The Complete Pipeline
final_report <- starwars %>% 
  select(mass, height, sex, species) %>%     # 1. Select
  filter(species == "Human") %>%             # 2. Filter
  na.omit() %>%                              # 3. Clean
  mutate(height = height / 100,              # 4. Transform (Meters)
         BMI = mass / height^2) %>%          # 5. Transform (BMI)
  group_by(sex) %>%                          # 6. Group
  summarise(Average_BMI = mean(BMI),         # 7. Summarize (Mean)
            Total_Count = n())               # 8. Summarize (Count)

final_report
# A tibble: 2 Ă— 3
  sex    Average_BMI Total_Count
  <chr>        <dbl>       <int>
1 female        20.8           3
2 male          25.7          17

🎓 Summary of Verbs for Learners

Function Action Systemic Role
select() Pick columns Column Management
rename() Change names Column Management
filter() Pick rows Row Management
na.omit() Remove missing data Row Management
mutate() Create/Update columns Transformation
group_by() Cluster data by category Aggregation Prep
summarise() Calculate summary stats Final Insight

Pro-Tip: Always use View() or head() after each step to verify that your data manipulation is doing exactly what you expect!

Courses that contain short and easy to digest video content are available at premieranalytics.com.bd Each lessons uses data that is built into R or comes with installed packages so you can replicated the work at home. premieranalytics.com.bd also includes teaching on statistics and research methods.