The examples in this document make use of the tidyverse
library. If you haven’t installed it, you can do so using the following
command in your R console:
install.packages("tidyverse")
Once installed, load the tidyverse
library:
library(tidyverse)
Having your data in the right format is crucial for efficient analysis. Tidy data is a standardized way of organizing data that makes it easier to manipulate and analyze using R’s functions.
According to the principles of tidy data:
For example, suppose we have data on the participation of different communities in a local environmental initiative over several years. Here’s how the data might look initially:
# Untidy participation data with years as columns
participation_data <- data.frame(
community = c("Community A", "Community B"),
participation_2019 = c(100, 150),
participation_2020 = c(120, 130),
participation_2021 = c(110, 160)
)
print(participation_data)
## community participation_2019 participation_2020 participation_2021
## 1 Community A 100 120 110
## 2 Community B 150 130 160
In this dataset:
participation_2019
, participation_2020
,
etc.).Why is this a problem?
To tidy this data, we need to restructure it so that:
We can achieve this by reshaping the data using the
pivot_longer()
function from the tidyr
package:
# Tidying data by reshaping so each variable has its own column
participation_tidy <- participation_data %>%
pivot_longer(
cols = starts_with("participation_"),
names_to = "year",
names_prefix = "participation_",
values_to = "participants"
)
print(participation_tidy)
## # A tibble: 6 × 3
## community year participants
## <chr> <chr> <dbl>
## 1 Community A 2019 100
## 2 Community A 2020 120
## 3 Community A 2021 110
## 4 Community B 2019 150
## 5 Community B 2020 130
## 6 Community B 2021 160
Explanation:
pivot_longer()
: Transforms the data
from a wide format to a longer format, making it tidy.
cols = starts_with("participation_")
:
Selects all columns that contain participation data for different
years.names_to = "year"
: Creates a new
column called “year” that contains the years.names_prefix = "participation_"
:
Removes the prefix “participation_” from the year values.values_to = "participants"
: Creates a
new column called “participants” that contains the participation
numbers.Now, our data is tidy:
community
: The community name.year
: The year of participation.participants
: The number of
participants.Each row represents a single observation: the participation of one community in one year.
In data wrangling, reshaping data between different formats is a common task. Two important data formats are wide format and long format.
It can be confusing how the idea of tidy/untidy data relates to the idea of long/wide data. Let’s clarify:
For example, suppose we have data on attendance at community meetings, recorded for each month:
# Long-format attendance data
attendance_long <- data.frame(
community = rep(c("Community A", "Community B"), each = 3),
month = rep(c("January", "February", "March"), times = 2),
attendance = c(30, 35, 40, 25, 30, 45)
)
print(attendance_long)
## community month attendance
## 1 Community A January 30
## 2 Community A February 35
## 3 Community A March 40
## 4 Community B January 25
## 5 Community B February 30
## 6 Community B March 45
In this dataset:
community
,
month
, and
attendance
.If we want to create a report where each community has a single row, and each month’s attendance is in a separate column, we can reshape the data to wide format:
# Pivoting from long to wide format
attendance_wide <- attendance_long %>%
pivot_wider(
names_from = month,
values_from = attendance
)
print(attendance_wide)
## # A tibble: 2 × 4
## community January February March
## <chr> <dbl> <dbl> <dbl>
## 1 Community A 30 35 40
## 2 Community B 25 30 45
Explanation:
pivot_wider()
: Transforms the data
from long format to wide format.
names_from = month
: The unique values
in the “month” column become new column names.values_from = attendance
: The values
in the “attendance” column fill the new columns.Now, the data is in wide format:
Is the Wide Data Tidy?
Summary of the Relationship Between Tidy Data and Long/Wide Formats:
Practice Problem:
Consider the following untidy dataset on the number of community workshops held in different neighborhoods over several years:
# Untidy workshop data with years as columns
workshop_data <- data.frame(
neighborhood = c("Northside", "Southside"),
workshops_2018 = c(5, 7),
workshops_2019 = c(6, 8),
workshops_2020 = c(7, 9)
)
print(workshop_data)
## neighborhood workshops_2018 workshops_2019 workshops_2020
## 1 Northside 5 6 7
## 2 Southside 7 8 9
Exercise:
Solution:
# Tidying workshop data
workshop_tidy <- workshop_data %>%
pivot_longer(
cols = starts_with("workshops_"),
names_to = "year",
names_prefix = "workshops_",
values_to = "workshops"
)
print(workshop_tidy)
## # A tibble: 6 × 3
## neighborhood year workshops
## <chr> <chr> <dbl>
## 1 Northside 2018 5
## 2 Northside 2019 6
## 3 Northside 2020 7
## 4 Southside 2018 7
## 5 Southside 2019 8
## 6 Southside 2020 9
Filtering allows you to select a subset of your data based on specific conditions. This is useful when you want to focus on particular observations.
Suppose we have data on incidents reported in different neighborhoods, categorized by type:
# Incident data
incident_data <- data.frame(
incident_id = 1:20,
neighborhood = c(
"Northside", "Northside", "Northside", "Northside", "Northside",
"Southside", "Southside", "Southside", "Southside", "Southside",
"Eastside", "Eastside", "Eastside", "Eastside", "Eastside",
"Westside", "Westside", "Westside", "Westside", "Westside"
),
incident_type = c(
"Graffiti", "Littering", "Noise", "Vandalism", "Graffiti",
"Noise", "Littering", "Graffiti", "Vandalism", "Noise",
"Graffiti", "Graffiti", "Noise", "Littering", "Vandalism",
"Vandalism", "Littering", "Noise", "Graffiti", "Littering"
)
)
print(incident_data)
## incident_id neighborhood incident_type
## 1 1 Northside Graffiti
## 2 2 Northside Littering
## 3 3 Northside Noise
## 4 4 Northside Vandalism
## 5 5 Northside Graffiti
## 6 6 Southside Noise
## 7 7 Southside Littering
## 8 8 Southside Graffiti
## 9 9 Southside Vandalism
## 10 10 Southside Noise
## 11 11 Eastside Graffiti
## 12 12 Eastside Graffiti
## 13 13 Eastside Noise
## 14 14 Eastside Littering
## 15 15 Eastside Vandalism
## 16 16 Westside Vandalism
## 17 17 Westside Littering
## 18 18 Westside Noise
## 19 19 Westside Graffiti
## 20 20 Westside Littering
If we want to focus on incidents of graffiti, we can filter the data:
# Filtering for graffiti incidents
graffiti_incidents <- incident_data %>%
filter(incident_type == "Graffiti")
print(graffiti_incidents)
## incident_id neighborhood incident_type
## 1 1 Northside Graffiti
## 2 5 Northside Graffiti
## 3 8 Southside Graffiti
## 4 11 Eastside Graffiti
## 5 12 Eastside Graffiti
## 6 19 Westside Graffiti
We can also filter data based on multiple conditions using logical operators. For instance, let’s filter for graffiti incidents in the “Northside” neighborhood:
# Filtering for graffiti incidents in Northside
graffiti_northside <- incident_data %>%
filter(incident_type == "Graffiti" & neighborhood == "Northside")
print(graffiti_northside)
## incident_id neighborhood incident_type
## 1 1 Northside Graffiti
## 2 5 Northside Graffiti
Practice Problem:
Given the following data on community service hours by volunteers, filter the data to find volunteers who have contributed more than 20 hours.
# Volunteer data
volunteer_data <- data.frame(
volunteer_id = 1:6,
name = c("Alice", "Bob", "Charlie", "Diana", "Ethan", "Fiona"),
hours = c(15, 25, 20, 30, 10, 22)
)
print(volunteer_data)
## volunteer_id name hours
## 1 1 Alice 15
## 2 2 Bob 25
## 3 3 Charlie 20
## 4 4 Diana 30
## 5 5 Ethan 10
## 6 6 Fiona 22
Solution:
# Filtering volunteers with more than 20 hours
active_volunteers <- volunteer_data %>%
filter(hours > 20)
print(active_volunteers)
## volunteer_id name hours
## 1 2 Bob 25
## 2 4 Diana 30
## 3 6 Fiona 22
Grouping and summarizing data allows you to compute aggregate statistics for different groups within your data.
Using the incident_data
from the previous section, we
can calculate the number of incidents of each type in each
neighborhood.
# Grouping and summarizing incident counts
incident_summary <- incident_data %>%
group_by(neighborhood, incident_type) %>%
summarize(
incident_count = n()
)
print(incident_summary)
## # A tibble: 16 × 3
## # Groups: neighborhood [4]
## neighborhood incident_type incident_count
## <chr> <chr> <int>
## 1 Eastside Graffiti 2
## 2 Eastside Littering 1
## 3 Eastside Noise 1
## 4 Eastside Vandalism 1
## 5 Northside Graffiti 2
## 6 Northside Littering 1
## 7 Northside Noise 1
## 8 Northside Vandalism 1
## 9 Southside Graffiti 1
## 10 Southside Littering 1
## 11 Southside Noise 2
## 12 Southside Vandalism 1
## 13 Westside Graffiti 1
## 14 Westside Littering 2
## 15 Westside Noise 1
## 16 Westside Vandalism 1
Explanation:
group_by(neighborhood, incident_type)
:
Groups the data by both “neighborhood” and “incident_type”.summarize(incident_count = n())
:
Counts the number of incidents in each group.Practice Problem:
Using the volunteer_data
, calculate the average hours
contributed by volunteers in each department.
First, let’s extend the volunteer data to include departments:
# Extended volunteer data with departments
volunteer_data <- data.frame(
volunteer_id = 1:6,
name = c("Alice", "Bob", "Charlie", "Diana", "Ethan", "Fiona"),
department = c("Education", "Health", "Education", "Health", "Environment", "Environment"),
hours = c(15, 25, 20, 30, 10, 22)
)
print(volunteer_data)
## volunteer_id name department hours
## 1 1 Alice Education 15
## 2 2 Bob Health 25
## 3 3 Charlie Education 20
## 4 4 Diana Health 30
## 5 5 Ethan Environment 10
## 6 6 Fiona Environment 22
Solution:
# Grouping by department and summarizing average hours
avg_hours_by_department <- volunteer_data %>%
group_by(department) %>%
summarize(
average_hours = mean(hours),
total_hours = sum(hours),
volunteer_count = n()
)
print(avg_hours_by_department)
## # A tibble: 3 × 4
## department average_hours total_hours volunteer_count
## <chr> <dbl> <dbl> <int>
## 1 Education 17.5 35 2
## 2 Environment 16 32 2
## 3 Health 27.5 55 2
Joining datasets involves merging data from different sources based on shared columns, enhancing your analysis by linking complementary information.
Suppose we have two datasets:
# Volunteer information
volunteer_info <- data.frame(
volunteer_id = c(1, 2, 3, 5),
name = c("Alice", "Bob", "Charlie", "Eve"),
age = c(28, 34, 22, 29)
)
print(volunteer_info)
## volunteer_id name age
## 1 1 Alice 28
## 2 2 Bob 34
## 3 3 Charlie 22
## 4 5 Eve 29
# Volunteer hours
volunteer_hours <- data.frame(
volunteer_id = c(2, 3, 4, 5),
department = c("Health", "Education", "Environment", "Health"),
hours = c(25, 20, 15, 10)
)
print(volunteer_hours)
## volunteer_id department hours
## 1 2 Health 25
## 2 3 Education 20
## 3 4 Environment 15
## 4 5 Health 10
Note the following:
volunteer_info
.volunteer_hours
.An inner join returns only the rows where there is a match in both datasets.
# Inner join
volunteer_inner <- volunteer_info %>%
inner_join(volunteer_hours, by = "volunteer_id")
print(volunteer_inner)
## volunteer_id name age department hours
## 1 2 Bob 34 Health 25
## 2 3 Charlie 22 Education 20
## 3 5 Eve 29 Health 10
Result:
A left join returns all rows from the left dataset
(meaning the first dataset input to the command, which for us will be
volunteer_info
) and the matched rows from the right dataset
(volunteer_hours
). Unmatched rows will have NA
in the columns from the right dataset.
# Left join
volunteer_left <- volunteer_info %>%
left_join(volunteer_hours, by = "volunteer_id")
print(volunteer_left)
## volunteer_id name age department hours
## 1 1 Alice 28 <NA> NA
## 2 2 Bob 34 Health 25
## 3 3 Charlie 22 Education 20
## 4 5 Eve 29 Health 10
Result:
volunteer_info
(IDs
1, 2, 3,
5).volunteer_hours
, department
and
hours
are NA
.A right join returns all rows from the right dataset
(volunteer_hours
) and the matched rows from the left
dataset (volunteer_info
). Unmatched rows will have
NA
in the columns from the left dataset.
# Right join
volunteer_right <- volunteer_info %>%
right_join(volunteer_hours, by = "volunteer_id")
print(volunteer_right)
## volunteer_id name age department hours
## 1 2 Bob 34 Health 25
## 2 3 Charlie 22 Education 20
## 3 5 Eve 29 Health 10
## 4 4 <NA> NA Environment 15
Result:
volunteer_hours
(IDs
2, 3, 4,
5).volunteer_info
, name
and age
are
NA
.A full join returns all rows from both datasets,
with NA
in places where there is no match.
# Full join
volunteer_full <- volunteer_info %>%
full_join(volunteer_hours, by = "volunteer_id")
print(volunteer_full)
## volunteer_id name age department hours
## 1 1 Alice 28 <NA> NA
## 2 2 Bob 34 Health 25
## 3 3 Charlie 22 Education 20
## 4 5 Eve 29 Health 10
## 5 4 <NA> NA Environment 15
Result:
NA
for
department
and hours
.NA
for
name
and age
.Summary of Joins:
Volunteer ID | Inner Join | Left Join | Right Join | Full Join |
---|---|---|---|---|
1 | No | Yes | No | Yes |
2 | Yes | Yes | Yes | Yes |
3 | Yes | Yes | Yes | Yes |
4 | No | No | Yes | Yes |
5 | Yes | Yes | Yes | Yes |
In this guide, we’ve covered essential data wrangling techniques in R
using the tidyverse
. Understanding how to tidy, reshape,
filter, group, summarize, and join data is fundamental for effective
data analysis.
By practicing these skills with synthetic datasets related to social justice topics, you can prepare yourself to handle real-world data and uncover meaningful insights.
Feel free to revisit these examples and experiment with your own data. Data wrangling is a critical part of your data analysis toolkit, enabling you to transform raw data into actionable insights.
```