The dplyr package provides a set of functions that make
it easier to work with data frames and perform
data manipulation tasks in R.
Key functions in the dplyr package are as follows:
filter(): Subset rows based on conditions.select(): Choose specific columns.mutate(): Add new variables or modify existing
ones.arrange(): Order rows based on one or more
variables.group_by(): Group data by one or more variables.summarize(): Create summary statistics for groups of
data.First we need to install and load the dplyr package from
the tidyverse packages (Link). The
tidyverse is a collection of R packages that includes:
ggplot2, dplyr, etc.
Here is how to install and load the dplyr package via
tidyverse in R.
# Install the tidyverse packages
install.packages("tidyverse") # Do this only once
# Load the tidyverse packages
library(tidyverse)
We are going to learn how to select columns with the
select() function from dplyr. Here too, we are
going to use the Mid-Atlantic Wage Data from the
ISLR2() package.
First, let’s install and load the tidyverse packages so
that we can use the dplyr package.
# Install the tidyverse packages
# install.packages("tidyverse") # Do this only once
# Load the tidyverse packages
library(tidyverse)
Second, let’s load the Mid-Atlantic Wage Data from the
ISLR2() package.
# Load the ISLR2 package
library(ISLR2)
# The Mid-Atlantic Wage Data
data(Wage)
Third, use the select() function to select
age, race, maritl, and
wage from the Mid-Atlantic Wage Data. Name the
new data wage_select. Apply the head(),
tail(), dim(), str(), and the
summary() functions on wage_select.
wage_select <- select(Wage, age, race, maritl, wage)
head(wage_select)
## age race maritl wage
## 231655 18 1. White 1. Never Married 75.04315
## 86582 24 1. White 1. Never Married 70.47602
## 161300 45 1. White 2. Married 130.98218
## 155159 43 3. Asian 2. Married 154.68529
## 11443 50 1. White 4. Divorced 75.04315
## 376662 54 1. White 2. Married 127.11574
tail(wage_select)
## age race maritl wage
## 449482 31 1. White 2. Married 133.38061
## 376816 44 1. White 2. Married 154.68529
## 302281 30 1. White 2. Married 99.68946
## 10033 27 2. Black 2. Married 66.22941
## 14375 27 1. White 1. Never Married 87.98103
## 453557 55 1. White 5. Separated 90.48191
dim(wage_select)
## [1] 3000 4
str(wage_select)
## 'data.frame': 3000 obs. of 4 variables:
## $ age : int 18 24 45 43 50 54 44 30 41 52 ...
## $ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
## $ maritl: Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
## $ wage : num 75 70.5 131 154.7 75 ...
summary(wage_select)
## age race maritl wage
## Min. :18.00 1. White:2480 1. Never Married: 648 Min. : 20.09
## 1st Qu.:33.75 2. Black: 293 2. Married :2074 1st Qu.: 85.38
## Median :42.00 3. Asian: 190 3. Widowed : 19 Median :104.92
## Mean :42.41 4. Other: 37 4. Divorced : 204 Mean :111.70
## 3rd Qu.:51.00 5. Separated : 55 3rd Qu.:128.68
## Max. :80.00 Max. :318.34
If we are working with ungrouped data, we can calculate the summary
statistics for the entire dataset, using the summarize()
function from the dplyr package.
Use the data set wage_select to calculate the number of
observations, mean, median, minimum, maximum, Q1, Q3, variance, and
standard deviation for the age variable. Store your results
in another variable named stats_ungrouped_age.
stats_ungrouped_age <- wage_select %>%
summarize(
count = n(),
mean_age = mean(age),
median_age = median(age),
min_age = min(age),
max_age = max(age),
Q1_age = quantile(age, 0.25),
Q3_age = quantile(age, 0.75),
var_age = var(age),
sd_age = sd(age)
)
stats_ungrouped_age
## count mean_age median_age min_age max_age Q1_age Q3_age var_age sd_age
## 1 3000 42.41467 42 18 80 33.75 51 133.2271 11.54241
Note: The %>% symbol is known as the
pipe operator. It is part of the magrittr
package in R. The operator helps to create a sequence of operations,
improving the readability and clarity of code when working with complex
data manipulation or analysis tasks.
If we are working with grouped data, we can calculate the summary
statistics for the grouped data, using the grouped_by and
summarize() functions.
Use the data set wage_select to calculate the number of
observations, mean, median, minimum, maximum, Q1, Q3, variance, and
standard deviation for the age variable by
race. Store your results in another variable named
stats_grouped_age.
stats_grouped_age <- wage_select %>%
group_by(race) %>% # group by race
summarize(
count = n(),
mean_age = mean(age),
median_age = median(age),
min_age = min(age),
max_age = max(age),
Q1_age = quantile(age, 0.25),
Q3_age = quantile(age, 0.75),
var_age = var(age),
sd_age = sd(age)
)
stats_grouped_age
## # A tibble: 4 × 10
## race count mean_age median_…¹ min_age max_age Q1_age Q3_age var_age sd_age
## <fct> <int> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1. White 2480 42.4 42 18 80 34 51 129. 11.4
## 2 2. Black 293 43.6 44 18 75 33 52 169. 13.0
## 3 3. Asian 190 41.8 40 22 76 32.2 50 126. 11.2
## 4 4. Other 37 37.7 39 21 65 28 47 133. 11.6
## # … with abbreviated variable name ¹median_age
Unit 2 tutorials here!Southeast Missouri State University, ethompson@semo.edu↩︎