Data Manipulation and Data Description using R

dplyr Package

The dplyr package provides a set of functions that make it easier to work with data frames and perform data manipulation tasks in R.

Key functions in the dplyr package are as follows:

  • filter(): Subset rows based on conditions.
  • select(): Choose specific columns.
  • mutate(): Add new variables or modify existing ones.
  • arrange(): Order rows based on one or more variables.
  • group_by(): Group data by one or more variables.
  • summarize(): Create summary statistics for groups of data.

First we need to install and load the dplyr package from the tidyverse packages (Link). The tidyverse is a collection of R packages that includes: ggplot2, dplyr, etc.

Here is how to install and load the dplyr package via tidyverse in R.


# Install the tidyverse packages
install.packages("tidyverse") # Do this only once
# Load the tidyverse packages
library(tidyverse)


Selecting Columns

We are going to learn how to select columns with the select() function from dplyr. Here too, we are going to use the Mid-Atlantic Wage Data from the ISLR2() package.

First, let’s install and load the tidyverse packages so that we can use the dplyr package.


# Install the tidyverse packages
# install.packages("tidyverse") # Do this only once
# Load the tidyverse packages
library(tidyverse)


Second, let’s load the Mid-Atlantic Wage Data from the ISLR2() package.


# Load the ISLR2 package
library(ISLR2)
# The Mid-Atlantic Wage Data
data(Wage)


Third, use the select() function to select age, race, maritl, and wage from the Mid-Atlantic Wage Data. Name the new data wage_select. Apply the head(), tail(), dim(), str(), and the summary() functions on wage_select.


wage_select <- select(Wage, age, race, maritl, wage)
head(wage_select)
##        age     race           maritl      wage
## 231655  18 1. White 1. Never Married  75.04315
## 86582   24 1. White 1. Never Married  70.47602
## 161300  45 1. White       2. Married 130.98218
## 155159  43 3. Asian       2. Married 154.68529
## 11443   50 1. White      4. Divorced  75.04315
## 376662  54 1. White       2. Married 127.11574
tail(wage_select)
##        age     race           maritl      wage
## 449482  31 1. White       2. Married 133.38061
## 376816  44 1. White       2. Married 154.68529
## 302281  30 1. White       2. Married  99.68946
## 10033   27 2. Black       2. Married  66.22941
## 14375   27 1. White 1. Never Married  87.98103
## 453557  55 1. White     5. Separated  90.48191
dim(wage_select)
## [1] 3000    4
str(wage_select)
## 'data.frame':    3000 obs. of  4 variables:
##  $ age   : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ race  : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ maritl: Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ wage  : num  75 70.5 131 154.7 75 ...
summary(wage_select)
##       age              race                   maritl          wage       
##  Min.   :18.00   1. White:2480   1. Never Married: 648   Min.   : 20.09  
##  1st Qu.:33.75   2. Black: 293   2. Married      :2074   1st Qu.: 85.38  
##  Median :42.00   3. Asian: 190   3. Widowed      :  19   Median :104.92  
##  Mean   :42.41   4. Other:  37   4. Divorced     : 204   Mean   :111.70  
##  3rd Qu.:51.00                   5. Separated    :  55   3rd Qu.:128.68  
##  Max.   :80.00                                           Max.   :318.34


Summary Statistics

Numerical Summaries

If we are working with ungrouped data, we can calculate the summary statistics for the entire dataset, using the summarize() function from the dplyr package.

Use the data set wage_select to calculate the number of observations, mean, median, minimum, maximum, Q1, Q3, variance, and standard deviation for the age variable. Store your results in another variable named stats_ungrouped_age.


stats_ungrouped_age <- wage_select %>%
  summarize(
    count = n(), 
    mean_age = mean(age),
    median_age = median(age),
    min_age = min(age),
    max_age = max(age),
    Q1_age = quantile(age, 0.25),  
    Q3_age = quantile(age, 0.75),   
    var_age = var(age),
    sd_age = sd(age)
  )
stats_ungrouped_age
##   count mean_age median_age min_age max_age Q1_age Q3_age  var_age   sd_age
## 1  3000 42.41467         42      18      80  33.75     51 133.2271 11.54241


Note: The %>% symbol is known as the pipe operator. It is part of the magrittr package in R. The operator helps to create a sequence of operations, improving the readability and clarity of code when working with complex data manipulation or analysis tasks.

If we are working with grouped data, we can calculate the summary statistics for the grouped data, using the grouped_by and summarize() functions.

Use the data set wage_select to calculate the number of observations, mean, median, minimum, maximum, Q1, Q3, variance, and standard deviation for the age variable by race. Store your results in another variable named stats_grouped_age.


stats_grouped_age <- wage_select %>% 
  group_by(race) %>% # group by race
  summarize(
    count = n(), 
    mean_age = mean(age),
    median_age = median(age),
    min_age = min(age),
    max_age = max(age),
    Q1_age = quantile(age, 0.25),  
    Q3_age = quantile(age, 0.75),   
    var_age = var(age),
    sd_age = sd(age)
  )
stats_grouped_age
## # A tibble: 4 × 10
##   race     count mean_age median_…¹ min_age max_age Q1_age Q3_age var_age sd_age
##   <fct>    <int>    <dbl>     <dbl>   <int>   <int>  <dbl>  <dbl>   <dbl>  <dbl>
## 1 1. White  2480     42.4        42      18      80   34       51    129.   11.4
## 2 2. Black   293     43.6        44      18      75   33       52    169.   13.0
## 3 3. Asian   190     41.8        40      22      76   32.2     50    126.   11.2
## 4 4. Other    37     37.7        39      21      65   28       47    133.   11.6
## # … with abbreviated variable name ¹​median_age


Exploratory Data Analysis

Refer to Unit 2 tutorials here!


  1. Southeast Missouri State University, ↩︎