Data Manipulation and Data Description using R

dplyr Package

The dplyr package provides a set of functions that make it easier to work with data frames and perform data manipulation tasks in R.

Key functions in the dplyr package are as follows:

  • filter(): Subset rows based on conditions.
  • select(): Choose specific columns.
  • mutate(): Add new variables or modify existing ones.
  • arrange(): Order rows based on one or more variables.
  • group_by(): Group data by one or more variables.
  • summarize(): Create summary statistics for both ungrouped and grouped data.

First we need to install and load the dplyr package from the tidyverse packages (Link). The tidyverse is a collection of R packages that includes: ggplot2, dplyr, etc.

Here is how to install and load the dplyr package via tidyverse in R.

# Install the tidyverse packages
install.packages("tidyverse") # Do this only once
# Load the tidyverse packages
library(tidyverse)

Selecting Columns

Selecting Columns in dplyr

We are going to learn how to select columns with the select() function from dplyr. Here too, we are going to use the Mid-Atlantic Wage Data from the ISLR2() package.

First, let’s install and load the tidyverse packages so that we can use the dplyr package.

# Install the tidyverse packages
# install.packages("tidyverse") # Do this only once
# Load the tidyverse packages
library(tidyverse)

Second, let’s load the Mid-Atlantic Wage Data from the ISLR2() package.

# Load the ISLR2 package
library(ISLR2)
# The Mid-Atlantic Wage Data
data(Wage)

Third, use the select() function to select age, race, maritl, and wage from the Mid-Atlantic Wage Data. Name the new data wage_select. Apply the head(), tail(), dim(), str(), and the summary() functions on wage_select.

wage_select <- select(Wage, age, race, maritl, wage)
head(wage_select)
##        age     race           maritl      wage
## 231655  18 1. White 1. Never Married  75.04315
## 86582   24 1. White 1. Never Married  70.47602
## 161300  45 1. White       2. Married 130.98218
## 155159  43 3. Asian       2. Married 154.68529
## 11443   50 1. White      4. Divorced  75.04315
## 376662  54 1. White       2. Married 127.11574
tail(wage_select)
##        age     race           maritl      wage
## 449482  31 1. White       2. Married 133.38061
## 376816  44 1. White       2. Married 154.68529
## 302281  30 1. White       2. Married  99.68946
## 10033   27 2. Black       2. Married  66.22941
## 14375   27 1. White 1. Never Married  87.98103
## 453557  55 1. White     5. Separated  90.48191
dim(wage_select)
## [1] 3000    4
str(wage_select)
## 'data.frame':    3000 obs. of  4 variables:
##  $ age   : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ race  : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ maritl: Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ wage  : num  75 70.5 131 154.7 75 ...
summary(wage_select)
##       age              race                   maritl          wage       
##  Min.   :18.00   1. White:2480   1. Never Married: 648   Min.   : 20.09  
##  1st Qu.:33.75   2. Black: 293   2. Married      :2074   1st Qu.: 85.38  
##  Median :42.00   3. Asian: 190   3. Widowed      :  19   Median :104.92  
##  Mean   :42.41   4. Other:  37   4. Divorced     : 204   Mean   :111.70  
##  3rd Qu.:51.00                   5. Separated    :  55   3rd Qu.:128.68  
##  Max.   :80.00                                           Max.   :318.34

Filtering Rows

Filtering Rows in dplyr

Use the filter function and the data set wage_select to get individuals who are 25 years old or younger.

wage_under_25 <- filter(wage_select, age <= 25)
head(wage_under_25)
##   age     race           maritl      wage
## 1  18 1. White 1. Never Married  75.04315
## 2  24 1. White 1. Never Married  70.47602
## 3  18 2. Black 1. Never Married  70.47602
## 4  25 1. White 1. Never Married 103.90247
## 5  25 1. White 1. Never Married  66.22941
## 6  23 1. White 1. Never Married  35.03342
tail(wage_under_25)
##     age     race           maritl     wage
## 226  25 2. Black 1. Never Married 77.49768
## 227  22 3. Asian 1. Never Married 63.18886
## 228  25 1. White       2. Married 65.11085
## 229  25 1. White 1. Never Married 87.98103
## 230  20 1. White 1. Never Married 82.67964
## 231  22 1. White 1. Never Married 87.98103

Use the filter function and the data set wage_select to get individuals greater than 25 years and less than or equal to 70 years.

wage_under_25_70 <- filter(wage_select, age > 25 & age <= 70)
head(wage_under_25_70)
##   age     race           maritl      wage
## 1  45 1. White       2. Married 130.98218
## 2  43 3. Asian       2. Married 154.68529
## 3  50 1. White      4. Divorced  75.04315
## 4  54 1. White       2. Married 127.11574
## 5  44 4. Other       2. Married 169.52854
## 6  30 3. Asian 1. Never Married 111.72085
tail(wage_under_25_70)
##      age     race           maritl      wage
## 2735  31 1. White       2. Married 133.38061
## 2736  44 1. White       2. Married 154.68529
## 2737  30 1. White       2. Married  99.68946
## 2738  27 2. Black       2. Married  66.22941
## 2739  27 1. White 1. Never Married  87.98103
## 2740  55 1. White     5. Separated  90.48191

Use the filter function and the data set wage_select to get individuals who are married.

wage_married <- filter(wage_select, maritl == "2. Married")
head(wage_married)
##   age     race     maritl      wage
## 1  45 1. White 2. Married 130.98218
## 2  43 3. Asian 2. Married 154.68529
## 3  54 1. White 2. Married 127.11574
## 4  44 4. Other 2. Married 169.52854
## 5  52 1. White 2. Married 128.68049
## 6  34 1. White 2. Married  81.28325
tail(wage_married)
##      age     race     maritl      wage
## 2069  35 1. White 2. Married 109.83399
## 2070  31 1. White 2. Married 102.87025
## 2071  31 1. White 2. Married 133.38061
## 2072  44 1. White 2. Married 154.68529
## 2073  30 1. White 2. Married  99.68946
## 2074  27 2. Black 2. Married  66.22941

Use the filter function and the data set wage_select to get individuals with 100 < wage <= 300 and are married.

wage_white_wage_more_than_250 <- filter(wage_select, wage > 100 & wage <= 300 & maritl == "2. Married")
head(wage_white_wage_more_than_250)
##   age     race     maritl     wage
## 1  45 1. White 2. Married 130.9822
## 2  43 3. Asian 2. Married 154.6853
## 3  54 1. White 2. Married 127.1157
## 4  44 4. Other 2. Married 169.5285
## 5  52 1. White 2. Married 128.6805
## 6  39 1. White 2. Married 134.7054
tail(wage_white_wage_more_than_250)
##      age     race     maritl     wage
## 1317  51 1. White 2. Married 126.0851
## 1318  50 2. Black 2. Married 132.4878
## 1319  35 1. White 2. Married 109.8340
## 1320  31 1. White 2. Married 102.8702
## 1321  31 1. White 2. Married 133.3806
## 1322  44 1. White 2. Married 154.6853

Mutating Columns

Adding New Variables in dplyr

Use the mutate function and the data set wage_select to create a new column named age_squared that calculates the square of the age variable.

wage_select <- mutate(wage_select, age_squared = age^2)
head(wage_select)
##        age     race           maritl      wage age_squared
## 231655  18 1. White 1. Never Married  75.04315         324
## 86582   24 1. White 1. Never Married  70.47602         576
## 161300  45 1. White       2. Married 130.98218        2025
## 155159  43 3. Asian       2. Married 154.68529        1849
## 11443   50 1. White      4. Divorced  75.04315        2500
## 376662  54 1. White       2. Married 127.11574        2916
tail(wage_select)
##        age     race           maritl      wage age_squared
## 449482  31 1. White       2. Married 133.38061         961
## 376816  44 1. White       2. Married 154.68529        1936
## 302281  30 1. White       2. Married  99.68946         900
## 10033   27 2. Black       2. Married  66.22941         729
## 14375   27 1. White 1. Never Married  87.98103         729
## 453557  55 1. White     5. Separated  90.48191        3025

Use the mutate function and the data set wage_select to create a new column named wage_log that calculates the log of the wage variable.

wage_select <- mutate(wage_select, wage_log = log(wage))
head(wage_select)
##        age     race           maritl      wage age_squared wage_log
## 231655  18 1. White 1. Never Married  75.04315         324 4.318063
## 86582   24 1. White 1. Never Married  70.47602         576 4.255273
## 161300  45 1. White       2. Married 130.98218        2025 4.875061
## 155159  43 3. Asian       2. Married 154.68529        1849 5.041393
## 11443   50 1. White      4. Divorced  75.04315        2500 4.318063
## 376662  54 1. White       2. Married 127.11574        2916 4.845098
tail(wage_select)
##        age     race           maritl      wage age_squared wage_log
## 449482  31 1. White       2. Married 133.38061         961 4.893207
## 376816  44 1. White       2. Married 154.68529        1936 5.041393
## 302281  30 1. White       2. Married  99.68946         900 4.602060
## 10033   27 2. Black       2. Married  66.22941         729 4.193125
## 14375   27 1. White 1. Never Married  87.98103         729 4.477121
## 453557  55 1. White     5. Separated  90.48191        3025 4.505150

Summary Statistics

Numerical Summaries in dplyr

If we are working with ungrouped data, we can calculate the summary statistics for the entire dataset, using the summarize() function from the dplyr package.

Use the data set wage_select to calculate the number of observations, mean, median (Q2), minimum, maximum, first quartile (Q1), third quartile (Q3), variance, and standard deviation for the age variable. Store your results in another variable named stats_ungrouped_age.

stats_ungrouped_age <- wage_select |>
  summarize(
    count = n(), 
    mean_age = mean(age),
    median_age = median(age),
    min_age = min(age),
    max_age = max(age),
    Q1_age = quantile(age, 0.25),  
    Q3_age = quantile(age, 0.75),   
    var_age = var(age),
    sd_age = sd(age)
  )
stats_ungrouped_age
##   count mean_age median_age min_age max_age Q1_age Q3_age  var_age   sd_age
## 1  3000 42.41467         42      18      80  33.75     51 133.2271 11.54241

Note: In R, |> is the native pipe operator, introduced in R version 4.1.0. It allows for the chaining of operations, where the output of one function becomes the input of the next.

If we are working with grouped data, we can calculate the summary statistics for the grouped data, using the grouped_by and summarize() functions.

Use the data set wage_select to calculate the number of observations, mean, median (Q2), minimum, maximum, first quartile (Q1), third quartile (Q3), variance, and standard deviation for the age variable by race. Store your results in another variable named stats_grouped_age.

stats_grouped_age <- wage_select |>
  group_by(race) |> # group by race
  summarize(
    count = n(), 
    mean_age = mean(age),
    median_age = median(age),
    min_age = min(age),
    max_age = max(age),
    Q1_age = quantile(age, 0.25),  
    Q3_age = quantile(age, 0.75),   
    var_age = var(age),
    sd_age = sd(age)
  )
stats_grouped_age
## # A tibble: 4 × 10
##   race    count mean_age median_age min_age max_age Q1_age Q3_age var_age sd_age
##   <fct>   <int>    <dbl>      <dbl>   <int>   <int>  <dbl>  <dbl>   <dbl>  <dbl>
## 1 1. Whi…  2480     42.4         42      18      80   34       51    129.   11.4
## 2 2. Bla…   293     43.6         44      18      75   33       52    169.   13.0
## 3 3. Asi…   190     41.8         40      22      76   32.2     50    126.   11.2
## 4 4. Oth…    37     37.7         39      21      65   28       47    133.   11.6

Exploratory Data Analysis

Refer to Unit 2 tutorials here!


  1. Southeast Missouri State University, ↩︎