The dplyr package provides a set of functions that make
it easier to work with data frames and perform
data manipulation tasks in R.
Key functions in the dplyr package are as follows:
filter(): Subset rows based on conditions.select(): Choose specific columns.mutate(): Add new variables or modify existing
ones.arrange(): Order rows based on one or more
variables.group_by(): Group data by one or more variables.summarize(): Create summary statistics for both
ungrouped and grouped data.First we need to install and load the dplyr package from
the tidyverse packages (Link). The
tidyverse is a collection of R packages that includes:
ggplot2, dplyr, etc.
Here is how to install and load the dplyr package via
tidyverse in R.
# Install the tidyverse packages
install.packages("tidyverse") # Do this only once
# Load the tidyverse packages
library(tidyverse)
We are going to learn how to select columns with the
select() function from dplyr. Here too, we are
going to use the Mid-Atlantic Wage Data from the
ISLR2() package.
First, let’s install and load the tidyverse packages so
that we can use the dplyr package.
# Install the tidyverse packages
# install.packages("tidyverse") # Do this only once
# Load the tidyverse packages
library(tidyverse)
Second, let’s load the Mid-Atlantic Wage Data from the
ISLR2() package.
# Load the ISLR2 package
library(ISLR2)
# The Mid-Atlantic Wage Data
data(Wage)
Third, use the select() function to select
age, race, maritl, and
wage from the Mid-Atlantic Wage Data. Name the
new data wage_select. Apply the head(),
tail(), dim(), str(), and the
summary() functions on wage_select.
wage_select <- select(Wage, age, race, maritl, wage)
head(wage_select)
## age race maritl wage
## 231655 18 1. White 1. Never Married 75.04315
## 86582 24 1. White 1. Never Married 70.47602
## 161300 45 1. White 2. Married 130.98218
## 155159 43 3. Asian 2. Married 154.68529
## 11443 50 1. White 4. Divorced 75.04315
## 376662 54 1. White 2. Married 127.11574
tail(wage_select)
## age race maritl wage
## 449482 31 1. White 2. Married 133.38061
## 376816 44 1. White 2. Married 154.68529
## 302281 30 1. White 2. Married 99.68946
## 10033 27 2. Black 2. Married 66.22941
## 14375 27 1. White 1. Never Married 87.98103
## 453557 55 1. White 5. Separated 90.48191
dim(wage_select)
## [1] 3000 4
str(wage_select)
## 'data.frame': 3000 obs. of 4 variables:
## $ age : int 18 24 45 43 50 54 44 30 41 52 ...
## $ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
## $ maritl: Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
## $ wage : num 75 70.5 131 154.7 75 ...
summary(wage_select)
## age race maritl wage
## Min. :18.00 1. White:2480 1. Never Married: 648 Min. : 20.09
## 1st Qu.:33.75 2. Black: 293 2. Married :2074 1st Qu.: 85.38
## Median :42.00 3. Asian: 190 3. Widowed : 19 Median :104.92
## Mean :42.41 4. Other: 37 4. Divorced : 204 Mean :111.70
## 3rd Qu.:51.00 5. Separated : 55 3rd Qu.:128.68
## Max. :80.00 Max. :318.34
Use the filter function and the data set
wage_select to get individuals who are
25 years old or younger.
wage_under_25 <- filter(wage_select, age <= 25)
head(wage_under_25)
## age race maritl wage
## 1 18 1. White 1. Never Married 75.04315
## 2 24 1. White 1. Never Married 70.47602
## 3 18 2. Black 1. Never Married 70.47602
## 4 25 1. White 1. Never Married 103.90247
## 5 25 1. White 1. Never Married 66.22941
## 6 23 1. White 1. Never Married 35.03342
tail(wage_under_25)
## age race maritl wage
## 226 25 2. Black 1. Never Married 77.49768
## 227 22 3. Asian 1. Never Married 63.18886
## 228 25 1. White 2. Married 65.11085
## 229 25 1. White 1. Never Married 87.98103
## 230 20 1. White 1. Never Married 82.67964
## 231 22 1. White 1. Never Married 87.98103
Use the filter function and the data set
wage_select to get individuals
greater than 25 years and less than or equal to 70 years.
wage_under_25_70 <- filter(wage_select, age > 25 & age <= 70)
head(wage_under_25_70)
## age race maritl wage
## 1 45 1. White 2. Married 130.98218
## 2 43 3. Asian 2. Married 154.68529
## 3 50 1. White 4. Divorced 75.04315
## 4 54 1. White 2. Married 127.11574
## 5 44 4. Other 2. Married 169.52854
## 6 30 3. Asian 1. Never Married 111.72085
tail(wage_under_25_70)
## age race maritl wage
## 2735 31 1. White 2. Married 133.38061
## 2736 44 1. White 2. Married 154.68529
## 2737 30 1. White 2. Married 99.68946
## 2738 27 2. Black 2. Married 66.22941
## 2739 27 1. White 1. Never Married 87.98103
## 2740 55 1. White 5. Separated 90.48191
Use the filter function and the data set
wage_select to get individuals who are
married.
wage_married <- filter(wage_select, maritl == "2. Married")
head(wage_married)
## age race maritl wage
## 1 45 1. White 2. Married 130.98218
## 2 43 3. Asian 2. Married 154.68529
## 3 54 1. White 2. Married 127.11574
## 4 44 4. Other 2. Married 169.52854
## 5 52 1. White 2. Married 128.68049
## 6 34 1. White 2. Married 81.28325
tail(wage_married)
## age race maritl wage
## 2069 35 1. White 2. Married 109.83399
## 2070 31 1. White 2. Married 102.87025
## 2071 31 1. White 2. Married 133.38061
## 2072 44 1. White 2. Married 154.68529
## 2073 30 1. White 2. Married 99.68946
## 2074 27 2. Black 2. Married 66.22941
Use the filter function and the data set
wage_select to get individuals with 100 < wage <= 300
and are married.
wage_white_wage_more_than_250 <- filter(wage_select, wage > 100 & wage <= 300 & maritl == "2. Married")
head(wage_white_wage_more_than_250)
## age race maritl wage
## 1 45 1. White 2. Married 130.9822
## 2 43 3. Asian 2. Married 154.6853
## 3 54 1. White 2. Married 127.1157
## 4 44 4. Other 2. Married 169.5285
## 5 52 1. White 2. Married 128.6805
## 6 39 1. White 2. Married 134.7054
tail(wage_white_wage_more_than_250)
## age race maritl wage
## 1317 51 1. White 2. Married 126.0851
## 1318 50 2. Black 2. Married 132.4878
## 1319 35 1. White 2. Married 109.8340
## 1320 31 1. White 2. Married 102.8702
## 1321 31 1. White 2. Married 133.3806
## 1322 44 1. White 2. Married 154.6853
Use the mutate function and the data set
wage_select to create a new column named
age_squared that calculates the square of the
age variable.
wage_select <- mutate(wage_select, age_squared = age^2)
head(wage_select)
## age race maritl wage age_squared
## 231655 18 1. White 1. Never Married 75.04315 324
## 86582 24 1. White 1. Never Married 70.47602 576
## 161300 45 1. White 2. Married 130.98218 2025
## 155159 43 3. Asian 2. Married 154.68529 1849
## 11443 50 1. White 4. Divorced 75.04315 2500
## 376662 54 1. White 2. Married 127.11574 2916
tail(wage_select)
## age race maritl wage age_squared
## 449482 31 1. White 2. Married 133.38061 961
## 376816 44 1. White 2. Married 154.68529 1936
## 302281 30 1. White 2. Married 99.68946 900
## 10033 27 2. Black 2. Married 66.22941 729
## 14375 27 1. White 1. Never Married 87.98103 729
## 453557 55 1. White 5. Separated 90.48191 3025
Use the mutate function and the data set
wage_select to create a new column named
wage_log that calculates the log of the
wage variable.
wage_select <- mutate(wage_select, wage_log = log(wage))
head(wage_select)
## age race maritl wage age_squared wage_log
## 231655 18 1. White 1. Never Married 75.04315 324 4.318063
## 86582 24 1. White 1. Never Married 70.47602 576 4.255273
## 161300 45 1. White 2. Married 130.98218 2025 4.875061
## 155159 43 3. Asian 2. Married 154.68529 1849 5.041393
## 11443 50 1. White 4. Divorced 75.04315 2500 4.318063
## 376662 54 1. White 2. Married 127.11574 2916 4.845098
tail(wage_select)
## age race maritl wage age_squared wage_log
## 449482 31 1. White 2. Married 133.38061 961 4.893207
## 376816 44 1. White 2. Married 154.68529 1936 5.041393
## 302281 30 1. White 2. Married 99.68946 900 4.602060
## 10033 27 2. Black 2. Married 66.22941 729 4.193125
## 14375 27 1. White 1. Never Married 87.98103 729 4.477121
## 453557 55 1. White 5. Separated 90.48191 3025 4.505150
If we are working with ungrouped data, we can calculate the
summary statistics for the entire dataset, using the
summarize() function from the dplyr
package.
Use the data set wage_select to calculate the
number of observations, mean,
median (Q2), minimum, maximum,
first quartile (Q1), third quartile (Q3),
variance, and standard deviation for the
age variable. Store your results in another variable named
stats_ungrouped_age.
stats_ungrouped_age <- wage_select |>
summarize(
count = n(),
mean_age = mean(age),
median_age = median(age),
min_age = min(age),
max_age = max(age),
Q1_age = quantile(age, 0.25),
Q3_age = quantile(age, 0.75),
var_age = var(age),
sd_age = sd(age)
)
stats_ungrouped_age
## count mean_age median_age min_age max_age Q1_age Q3_age var_age sd_age
## 1 3000 42.41467 42 18 80 33.75 51 133.2271 11.54241
Note: In R, |> is the native pipe
operator, introduced in R version 4.1.0. It allows for the
chaining of operations, where the output of one function
becomes the input of the next.
If we are working with grouped data, we can calculate the summary
statistics for the grouped data, using the grouped_by and
summarize() functions.
Use the data set wage_select to calculate the
number of observations, mean,
median (Q2), minimum, maximum,
first quartile (Q1), third quartile (Q3),
variance, and standard deviation for the
age variable by race. Store your results in
another variable named stats_grouped_age.
stats_grouped_age <- wage_select |>
group_by(race) |> # group by race
summarize(
count = n(),
mean_age = mean(age),
median_age = median(age),
min_age = min(age),
max_age = max(age),
Q1_age = quantile(age, 0.25),
Q3_age = quantile(age, 0.75),
var_age = var(age),
sd_age = sd(age)
)
stats_grouped_age
## # A tibble: 4 × 10
## race count mean_age median_age min_age max_age Q1_age Q3_age var_age sd_age
## <fct> <int> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1. Whi… 2480 42.4 42 18 80 34 51 129. 11.4
## 2 2. Bla… 293 43.6 44 18 75 33 52 169. 13.0
## 3 3. Asi… 190 41.8 40 22 76 32.2 50 126. 11.2
## 4 4. Oth… 37 37.7 39 21 65 28 47 133. 11.6
Unit 2 tutorials here!Southeast Missouri State University, ethompson@semo.edu↩︎