In this guide, we will introduce you to the
dplyr package, a powerful tool for data
manipulation and analysis in R. With
dplyr, you can filter, arrange, summarize,
and visualize data efficiently. The package is complimented well by
ggplot2, whereby you can first carry out
data manipulation and subsequently visualize the outcome.
Central to the functionality of dplyr
is the pipe operator (%>%), which
allows you to chain together multiple data manipulation operations in a
clear and readable way. It simplifies the process of working with data,
as each operation takes the result of the previous one as its input.
# Install and load in the required packages
# install.packages('dplyr')
library(dplyr)
# install.packages('ggplot2')
library(ggplot2)
# install.packages('tibble')
library(tibble)
To start with data manipulation, let’s load the
USArrests dataset.
# Load the USArrests dataset
data(USArrests)
# Convert the row names (States) to a column using tibble package
USArrests <- USArrests %>%
rownames_to_column(var = "State")
USArrests is a data frame with 50 observations on 4 variables.
| Murder | Murder arrests (per 100,000) |
| Assault | Assault arrests (per 100,000) |
| UrbanPop | Percent urban population |
| Rape | Rape arrests (per 100,000) |
The rownames_to_column function is from
the tibble package, which is often used in
combination with dplyr for data
manipulation in R. Now, we also have ‘State’ as a column in the data
set.
One of the most important tasks in data manipulation is filtering
data based on specific conditions. The
filter() function is used for this
purpose. We can filter the USArrests data
set to select states with a murder rate greater than 5.
# Filter states with a murder rate > 7
filtered_data <- USArrests %>%
filter(Murder > 5)
head(filtered_data)
## State Murder Assault UrbanPop Rape
## 1 Alabama 13.2 236 58 21.2
## 2 Alaska 10.0 263 48 44.5
## 3 Arizona 8.1 294 80 31.0
## 4 Arkansas 8.8 190 50 19.5
## 5 California 9.0 276 91 40.6
## 6 Colorado 7.9 204 78 38.7
The arrange() function allows you to
sort or arrange your data by one or more variables. We can use this
function to arrange the filtered data set in ascending order of assault
rate.
# Arrange data by assault rate in ascending order
arranged_data <- filtered_data %>%
arrange(Assault)
head(arranged_data)
## State Murder Assault UrbanPop Rape
## 1 Hawaii 5.3 46 83 20.2
## 2 West Virginia 5.7 81 39 9.3
## 3 Pennsylvania 6.3 106 72 14.9
## 4 Kentucky 9.7 109 52 16.3
## 5 Montana 6.0 109 53 16.4
## 6 Indiana 7.2 113 65 21.0
The select() function is used to choose
specific columns from a data set. You can select columns by name or
using patterns. We’ll select the columns
Assault and
Murder below.
# Select specific columns
selected_data <- USArrests %>%
select(Assault, Murder)
head(selected_data)
## Assault Murder
## 1 236 13.2
## 2 263 10.0
## 3 294 8.1
## 4 190 8.8
## 5 276 9.0
## 6 204 7.9
mutate() is a versatile function that
allows you to create new variables or modify existing ones. We can
calculate the total crime rate as a new variable (Murder + Assault +
Rape).
# Calculate total crime rate
mutated_data <- USArrests %>%
mutate(TotalCrimeRate = Murder + Assault + Rape)
head(mutated_data)
## State Murder Assault UrbanPop Rape TotalCrimeRate
## 1 Alabama 13.2 236 58 21.2 270.4
## 2 Alaska 10.0 263 48 44.5 317.5
## 3 Arizona 8.1 294 80 31.0 333.1
## 4 Arkansas 8.8 190 50 19.5 218.3
## 5 California 9.0 276 91 40.6 325.6
## 6 Colorado 7.9 204 78 38.7 250.6
To summarize data, use the summarize()
function. This function is useful for calculating summary statistics. We
can use this function to calculate the mean and maximum assault
rate.
# Calculate summary statistics for Assault rate
summary_stats <- USArrests %>%
summarize(Mean_Assault = mean(Assault),
Max_Assault = max(Assault))
summary_stats
## Mean_Assault Max_Assault
## 1 170.76 337
As mentioned earlier, the dplyr package
works seamlessly with data visualization packages like
ggplot2. We can create visualizations
based on our data manipulations.
Below is a simple example of creating a bar chart using
ggplot2 to visualize the TotalCrimeRate in
each State.
# Create a bar chart to visualize TotalCrimeRate by State (using mutated_data)
ggplot(mutated_data, aes(x = State, y = TotalCrimeRate)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Total Crime Rate by State",
x = "State",
y = "Total Crime Rate") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
In this guide, you have been introduced to the
dplyr package and its functions for data
manipulation. You’ve been shown how to filter, arrange, select, mutate,
summarize and visualize data. With these skills, you can efficiently
manipulate and analyze data in R.
You’re now all set to explore and practice data manipulation with
dplyr on your own datasets to gain
hands-on experience!