Introduction to Data Manipulation with dplyr in R

Overview

In this guide, we will introduce you to the dplyr package, a powerful tool for data manipulation and analysis in R. With dplyr, you can filter, arrange, summarize, and visualize data efficiently. The package is complimented well by ggplot2, whereby you can first carry out data manipulation and subsequently visualize the outcome.

Central to the functionality of dplyr is the pipe operator (%>%), which allows you to chain together multiple data manipulation operations in a clear and readable way. It simplifies the process of working with data, as each operation takes the result of the previous one as its input.

Set up Environment

# Install and load in the required packages

# install.packages('dplyr')
library(dplyr)

# install.packages('ggplot2')
library(ggplot2)

# install.packages('tibble')
library(tibble)

Data Preparation

To start with data manipulation, let’s load the USArrests dataset.

# Load the USArrests dataset
data(USArrests)

# Convert the row names (States) to a column using tibble package 
USArrests <- USArrests %>% 
  rownames_to_column(var = "State")

USArrests is a data frame with 50 observations on 4 variables.

Murder	Murder arrests (per 100,000)
Assault	Assault arrests (per 100,000)
UrbanPop	Percent urban population
Rape	Rape arrests (per 100,000)

The rownames_to_column function is from the tibble package, which is often used in combination with dplyr for data manipulation in R. Now, we also have ‘State’ as a column in the data set.

Filtering Data

One of the most important tasks in data manipulation is filtering data based on specific conditions. The filter() function is used for this purpose. We can filter the USArrests data set to select states with a murder rate greater than 5.

# Filter states with a murder rate > 7
filtered_data <- USArrests %>% 
  filter(Murder > 5)

head(filtered_data)

##        State Murder Assault UrbanPop Rape
## 1    Alabama   13.2     236       58 21.2
## 2     Alaska   10.0     263       48 44.5
## 3    Arizona    8.1     294       80 31.0
## 4   Arkansas    8.8     190       50 19.5
## 5 California    9.0     276       91 40.6
## 6   Colorado    7.9     204       78 38.7

Arranging Data

The arrange() function allows you to sort or arrange your data by one or more variables. We can use this function to arrange the filtered data set in ascending order of assault rate.

# Arrange data by assault rate in ascending order
arranged_data <- filtered_data %>% 
  arrange(Assault)

head(arranged_data)

##           State Murder Assault UrbanPop Rape
## 1        Hawaii    5.3      46       83 20.2
## 2 West Virginia    5.7      81       39  9.3
## 3  Pennsylvania    6.3     106       72 14.9
## 4      Kentucky    9.7     109       52 16.3
## 5       Montana    6.0     109       53 16.4
## 6       Indiana    7.2     113       65 21.0

Selecting Columns

The select() function is used to choose specific columns from a data set. You can select columns by name or using patterns. We’ll select the columns Assault and Murder below.

# Select specific columns
selected_data <- USArrests %>% 
  select(Assault, Murder)

head(selected_data)

##   Assault Murder
## 1     236   13.2
## 2     263   10.0
## 3     294    8.1
## 4     190    8.8
## 5     276    9.0
## 6     204    7.9

Mutating Data

mutate() is a versatile function that allows you to create new variables or modify existing ones. We can calculate the total crime rate as a new variable (Murder + Assault + Rape).

# Calculate total crime rate
mutated_data <- USArrests %>% 
  mutate(TotalCrimeRate = Murder + Assault + Rape)

head(mutated_data)

##        State Murder Assault UrbanPop Rape TotalCrimeRate
## 1    Alabama   13.2     236       58 21.2          270.4
## 2     Alaska   10.0     263       48 44.5          317.5
## 3    Arizona    8.1     294       80 31.0          333.1
## 4   Arkansas    8.8     190       50 19.5          218.3
## 5 California    9.0     276       91 40.6          325.6
## 6   Colorado    7.9     204       78 38.7          250.6

Summarizing Data

To summarize data, use the summarize() function. This function is useful for calculating summary statistics. We can use this function to calculate the mean and maximum assault rate.

# Calculate summary statistics for Assault rate
summary_stats <- USArrests %>% 
  summarize(Mean_Assault = mean(Assault),
            Max_Assault = max(Assault))

summary_stats

##   Mean_Assault Max_Assault
## 1       170.76         337

Data Visualization

As mentioned earlier, the dplyr package works seamlessly with data visualization packages like ggplot2. We can create visualizations based on our data manipulations.

Below is a simple example of creating a bar chart using ggplot2 to visualize the TotalCrimeRate in each State.

# Create a bar chart to visualize TotalCrimeRate by State (using mutated_data)
ggplot(mutated_data, aes(x = State, y = TotalCrimeRate)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Total Crime Rate by State",
       x = "State",
       y = "Total Crime Rate") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Conclusion

In this guide, you have been introduced to the dplyr package and its functions for data manipulation. You’ve been shown how to filter, arrange, select, mutate, summarize and visualize data. With these skills, you can efficiently manipulate and analyze data in R.

You’re now all set to explore and practice data manipulation with dplyr on your own datasets to gain hands-on experience!