The dplyr package (part of the tidyverse) uses a “grammar of data manipulation” where each function acts as a verb. We use the Pipe Operator (%>%) to chain these verbs together in a clean, readable sequence.
Phase 1: Data Inspection (The Inquiry)
Once you load a dataset, use these functions to check its “health”—dimensions, column types, and sample rows.
These functions let you see the actual data points.
# A tibble: 6 Ă— 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sky… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth Va… 202 136 none white yellow 41.9 male mascu…
5 Leia Org… 150 49 brown light brown 19 fema… femin…
6 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
# A tibble: 6 Ă— 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Tion Med… 206 80 none grey black NA male mascu…
2 Finn NA NA black dark dark NA male mascu…
3 Rey NA NA brown light hazel NA fema… femin…
4 Poe Dame… NA NA brown light brown NA male mascu…
5 BB8 NA NA none none black NA none mascu…
6 Captain … NA NA none none unknown NA fema… femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Phase 2: Structural Manipulation (select & rename)
Now that you know the data, you can narrow your focus. The first step in any analysis is focusing on the variables you need. Instead of working with 30 columns, you select the relevant ones and give them meaningful names.
Picking specific variables from the dataset.
Code
# A tibble: 6 Ă— 4
name height mass sex
<chr> <int> <dbl> <chr>
1 Luke Skywalker 172 77 male
2 C-3PO 167 75 none
3 R2-D2 96 32 none
4 Darth Vader 202 136 male
5 Leia Organa 150 49 female
6 Owen Lars 178 120 male
Changing column names to make them more descriptive or easier to type. `rename’ syntax is: new_name = old_name
# A tibble: 6 Ă— 4
name height weight sex
<chr> <int> <dbl> <chr>
1 Luke Skywalker 172 77 male
2 C-3PO 167 75 none
3 R2-D2 96 32 none
4 Darth Vader 202 136 male
5 Leia Organa 150 49 female
6 Owen Lars 178 120 male
Phase 3: Row Filtering & Cleaning (filter & na.omit)
Once the columns are organized, the next step is to remove the rows (observations) that don’t meet your criteria or contain missing data.
Extracting rows based on specific conditions (e.g., only Humans).
Code
# A tibble: 6 Ă— 5
name mass height sex species
<chr> <dbl> <int> <chr> <chr>
1 Luke Skywalker 77 172 male Human
2 Darth Vader 136 202 male Human
3 Leia Organa 49 150 female Human
4 Owen Lars 120 178 male Human
5 Beru Whitesun Lars 75 165 female Human
6 Biggs Darklighter 84 183 male Human
Using na.omit() to remove rows with incomplete information to ensure mathematical functions (like mean) work correctly.
Code
[1] 20
Phase 4: Data Transformation (mutate)
Now that the data is clean, we often need to create new information from existing columns—such as calculating a ratio or converting units.
Converting units (e.g., centimeters to meters).
Code
# A tibble: 6 Ă— 5
name mass height sex species
<chr> <dbl> <dbl> <chr> <chr>
1 Luke Skywalker 77 1.72 male Human
2 Darth Vader 136 2.02 male Human
3 Leia Organa 49 1.5 female Human
4 Owen Lars 120 1.78 male Human
5 Beru Whitesun Lars 75 1.65 female Human
6 Biggs Darklighter 84 1.83 male Human
Creating a new variable (like BMI) based on other columns.
Code
# A tibble: 6 Ă— 4
name mass height BMI
<chr> <dbl> <dbl> <dbl>
1 Luke Skywalker 77 1.72 26.0
2 Darth Vader 136 2.02 33.3
3 Leia Organa 49 1.5 21.8
4 Owen Lars 120 1.78 37.9
5 Beru Whitesun Lars 75 1.65 27.5
6 Biggs Darklighter 84 1.83 25.1
Aggregation and Insights (group_by & summarise)
The final phase of manipulation is turning hundreds of rows into a few key insights by grouping the data and calculating statistics.
The Master Pipeline: The Full Systemic View (The Power of Chaining)
In real-world R programming, we combine all these phases into a single “Pipe Chain.” This makes your entire analysis workflow repeatable and easy for others to read.
Extracting rows based on specific conditions (e.g., only Humans).
Code
# The Complete Pipeline
final_report <- starwars %>%
select(mass, height, sex, species) %>% # 1. Select
filter(species == "Human") %>% # 2. Filter
na.omit() %>% # 3. Clean
mutate(height = height / 100, # 4. Transform (Meters)
BMI = mass / height^2) %>% # 5. Transform (BMI)
group_by(sex) %>% # 6. Group
summarise(Average_BMI = mean(BMI), # 7. Summarize (Mean)
Total_Count = n()) # 8. Summarize (Count)
final_report# A tibble: 2 Ă— 3
sex Average_BMI Total_Count
<chr> <dbl> <int>
1 female 20.8 3
2 male 25.7 17
🎓 Summary of Verbs for Learners
| Function | Action | Systemic Role |
select() |
Pick columns | Column Management |
rename() |
Change names | Column Management |
filter() |
Pick rows | Row Management |
na.omit() |
Remove missing data | Row Management |
mutate() |
Create/Update columns | Transformation |
group_by() |
Cluster data by category | Aggregation Prep |
summarise() |
Calculate summary stats | Final Insight |
Pro-Tip: Always use View() or head() after each step to verify that your data manipulation is doing exactly what you expect!
Courses that contain short and easy to digest video content are available at premieranalytics.com.bd Each lessons uses data that is built into R or comes with installed packages so you can replicated the work at home. premieranalytics.com.bd also includes teaching on statistics and research methods.