Getting Started

The examples in this document make use of the tidyverse library. If you haven’t installed it, you can do so using the following command in your R console:

install.packages("tidyverse")

Once installed, load the tidyverse library:

library(tidyverse)

Tidy Data

Having your data in the right format is crucial for efficient analysis. Tidy data is a standardized way of organizing data that makes it easier to manipulate and analyze using R’s functions.

According to the principles of tidy data:

  1. Each variable forms its own column.
    • Every piece of information you record (like age, race, or income) should have its own column in your dataset.
  2. Each observation forms its own row.
    • Each row should represent a single record or case in your dataset. For example, one person’s data, one event, or one measurement.
  3. Each type of observational unit forms its own table.
    • If you have different kinds of data that relate to different things, they should be stored in separate tables. For instance, if you’re collecting data on both individuals and the events they attend, you should have one table for individuals and another for events.
    • Why this matters: Mixing different types of observational units in one table can make data messy and harder to analyze. Keeping them separate ensures clarity and makes it easier to join them later if needed.

For example, suppose we have data on the participation of different communities in a local environmental initiative over several years. Here’s how the data might look initially:

# Untidy participation data with years as columns
participation_data <- data.frame(
  community = c("Community A", "Community B"),
  participation_2019 = c(100, 150),
  participation_2020 = c(120, 130),
  participation_2021 = c(110, 160)
)

print(participation_data)
##     community participation_2019 participation_2020 participation_2021
## 1 Community A                100                120                110
## 2 Community B                150                130                160

In this dataset:

Why is this a problem?

To tidy this data, we need to restructure it so that:

We can achieve this by reshaping the data using the pivot_longer() function from the tidyr package:

# Tidying data by reshaping so each variable has its own column
participation_tidy <- participation_data %>%
  pivot_longer(
    cols = starts_with("participation_"),
    names_to = "year",
    names_prefix = "participation_",
    values_to = "participants"
  )

print(participation_tidy)
## # A tibble: 6 × 3
##   community   year  participants
##   <chr>       <chr>        <dbl>
## 1 Community A 2019           100
## 2 Community A 2020           120
## 3 Community A 2021           110
## 4 Community B 2019           150
## 5 Community B 2020           130
## 6 Community B 2021           160

Explanation:

Now, our data is tidy:

Each row represents a single observation: the participation of one community in one year.

Wide and Long Data

In data wrangling, reshaping data between different formats is a common task. Two important data formats are wide format and long format.

It can be confusing how the idea of tidy/untidy data relates to the idea of long/wide data. Let’s clarify:

For example, suppose we have data on attendance at community meetings, recorded for each month:

# Long-format attendance data
attendance_long <- data.frame(
  community = rep(c("Community A", "Community B"), each = 3),
  month = rep(c("January", "February", "March"), times = 2),
  attendance = c(30, 35, 40, 25, 30, 45)
)

print(attendance_long)
##     community    month attendance
## 1 Community A  January         30
## 2 Community A February         35
## 3 Community A    March         40
## 4 Community B  January         25
## 5 Community B February         30
## 6 Community B    March         45

In this dataset:

If we want to create a report where each community has a single row, and each month’s attendance is in a separate column, we can reshape the data to wide format:

# Pivoting from long to wide format
attendance_wide <- attendance_long %>%
  pivot_wider(
    names_from = month,
    values_from = attendance
  )

print(attendance_wide)
## # A tibble: 2 × 4
##   community   January February March
##   <chr>         <dbl>    <dbl> <dbl>
## 1 Community A      30       35    40
## 2 Community B      25       30    45

Explanation:

Now, the data is in wide format:

Is the Wide Data Tidy?

Summary of the Relationship Between Tidy Data and Long/Wide Formats:

Practice Problem:

Consider the following untidy dataset on the number of community workshops held in different neighborhoods over several years:

# Untidy workshop data with years as columns
workshop_data <- data.frame(
  neighborhood = c("Northside", "Southside"),
  workshops_2018 = c(5, 7),
  workshops_2019 = c(6, 8),
  workshops_2020 = c(7, 9)
)

print(workshop_data)
##   neighborhood workshops_2018 workshops_2019 workshops_2020
## 1    Northside              5              6              7
## 2    Southside              7              8              9

Exercise:

Solution:

# Tidying workshop data
workshop_tidy <- workshop_data %>%
  pivot_longer(
    cols = starts_with("workshops_"),
    names_to = "year",
    names_prefix = "workshops_",
    values_to = "workshops"
  )

print(workshop_tidy)
## # A tibble: 6 × 3
##   neighborhood year  workshops
##   <chr>        <chr>     <dbl>
## 1 Northside    2018          5
## 2 Northside    2019          6
## 3 Northside    2020          7
## 4 Southside    2018          7
## 5 Southside    2019          8
## 6 Southside    2020          9

Filtering Data

Filtering allows you to select a subset of your data based on specific conditions. This is useful when you want to focus on particular observations.

Suppose we have data on incidents reported in different neighborhoods, categorized by type:

# Incident data
incident_data <- data.frame(
  incident_id = 1:20,
  neighborhood = c(
    "Northside", "Northside", "Northside", "Northside", "Northside",
    "Southside", "Southside", "Southside", "Southside", "Southside",
    "Eastside", "Eastside", "Eastside", "Eastside", "Eastside",
    "Westside", "Westside", "Westside", "Westside", "Westside"
  ),
  incident_type = c(
    "Graffiti", "Littering", "Noise", "Vandalism", "Graffiti",
    "Noise", "Littering", "Graffiti", "Vandalism", "Noise",
    "Graffiti", "Graffiti", "Noise", "Littering", "Vandalism",
    "Vandalism", "Littering", "Noise", "Graffiti", "Littering"
  )
)

print(incident_data)
##    incident_id neighborhood incident_type
## 1            1    Northside      Graffiti
## 2            2    Northside     Littering
## 3            3    Northside         Noise
## 4            4    Northside     Vandalism
## 5            5    Northside      Graffiti
## 6            6    Southside         Noise
## 7            7    Southside     Littering
## 8            8    Southside      Graffiti
## 9            9    Southside     Vandalism
## 10          10    Southside         Noise
## 11          11     Eastside      Graffiti
## 12          12     Eastside      Graffiti
## 13          13     Eastside         Noise
## 14          14     Eastside     Littering
## 15          15     Eastside     Vandalism
## 16          16     Westside     Vandalism
## 17          17     Westside     Littering
## 18          18     Westside         Noise
## 19          19     Westside      Graffiti
## 20          20     Westside     Littering

If we want to focus on incidents of graffiti, we can filter the data:

# Filtering for graffiti incidents
graffiti_incidents <- incident_data %>%
  filter(incident_type == "Graffiti")

print(graffiti_incidents)
##   incident_id neighborhood incident_type
## 1           1    Northside      Graffiti
## 2           5    Northside      Graffiti
## 3           8    Southside      Graffiti
## 4          11     Eastside      Graffiti
## 5          12     Eastside      Graffiti
## 6          19     Westside      Graffiti

We can also filter data based on multiple conditions using logical operators. For instance, let’s filter for graffiti incidents in the “Northside” neighborhood:

# Filtering for graffiti incidents in Northside
graffiti_northside <- incident_data %>%
  filter(incident_type == "Graffiti" & neighborhood == "Northside")

print(graffiti_northside)
##   incident_id neighborhood incident_type
## 1           1    Northside      Graffiti
## 2           5    Northside      Graffiti

Practice Problem:

Given the following data on community service hours by volunteers, filter the data to find volunteers who have contributed more than 20 hours.

# Volunteer data
volunteer_data <- data.frame(
  volunteer_id = 1:6,
  name = c("Alice", "Bob", "Charlie", "Diana", "Ethan", "Fiona"),
  hours = c(15, 25, 20, 30, 10, 22)
)

print(volunteer_data)
##   volunteer_id    name hours
## 1            1   Alice    15
## 2            2     Bob    25
## 3            3 Charlie    20
## 4            4   Diana    30
## 5            5   Ethan    10
## 6            6   Fiona    22

Solution:

# Filtering volunteers with more than 20 hours
active_volunteers <- volunteer_data %>%
  filter(hours > 20)

print(active_volunteers)
##   volunteer_id  name hours
## 1            2   Bob    25
## 2            4 Diana    30
## 3            6 Fiona    22

Grouping and Summarizing Data

Grouping and summarizing data allows you to compute aggregate statistics for different groups within your data.

Using the incident_data from the previous section, we can calculate the number of incidents of each type in each neighborhood.

# Grouping and summarizing incident counts
incident_summary <- incident_data %>%
  group_by(neighborhood, incident_type) %>%
  summarize(
    incident_count = n()
  )

print(incident_summary)
## # A tibble: 16 × 3
## # Groups:   neighborhood [4]
##    neighborhood incident_type incident_count
##    <chr>        <chr>                  <int>
##  1 Eastside     Graffiti                   2
##  2 Eastside     Littering                  1
##  3 Eastside     Noise                      1
##  4 Eastside     Vandalism                  1
##  5 Northside    Graffiti                   2
##  6 Northside    Littering                  1
##  7 Northside    Noise                      1
##  8 Northside    Vandalism                  1
##  9 Southside    Graffiti                   1
## 10 Southside    Littering                  1
## 11 Southside    Noise                      2
## 12 Southside    Vandalism                  1
## 13 Westside     Graffiti                   1
## 14 Westside     Littering                  2
## 15 Westside     Noise                      1
## 16 Westside     Vandalism                  1

Explanation:

Practice Problem:

Using the volunteer_data, calculate the average hours contributed by volunteers in each department.

First, let’s extend the volunteer data to include departments:

# Extended volunteer data with departments
volunteer_data <- data.frame(
  volunteer_id = 1:6,
  name = c("Alice", "Bob", "Charlie", "Diana", "Ethan", "Fiona"),
  department = c("Education", "Health", "Education", "Health", "Environment", "Environment"),
  hours = c(15, 25, 20, 30, 10, 22)
)

print(volunteer_data)
##   volunteer_id    name  department hours
## 1            1   Alice   Education    15
## 2            2     Bob      Health    25
## 3            3 Charlie   Education    20
## 4            4   Diana      Health    30
## 5            5   Ethan Environment    10
## 6            6   Fiona Environment    22

Solution:

# Grouping by department and summarizing average hours
avg_hours_by_department <- volunteer_data %>%
  group_by(department) %>%
  summarize(
    average_hours = mean(hours),
    total_hours = sum(hours),
    volunteer_count = n()
  )

print(avg_hours_by_department)
## # A tibble: 3 × 4
##   department  average_hours total_hours volunteer_count
##   <chr>               <dbl>       <dbl>           <int>
## 1 Education            17.5          35               2
## 2 Environment          16            32               2
## 3 Health               27.5          55               2

Joining Datasets

Joining datasets involves merging data from different sources based on shared columns, enhancing your analysis by linking complementary information.

Suppose we have two datasets:

  1. Volunteer Information
# Volunteer information
volunteer_info <- data.frame(
  volunteer_id = c(1, 2, 3, 5),
  name = c("Alice", "Bob", "Charlie", "Eve"),
  age = c(28, 34, 22, 29)
)

print(volunteer_info)
##   volunteer_id    name age
## 1            1   Alice  28
## 2            2     Bob  34
## 3            3 Charlie  22
## 4            5     Eve  29
  1. Volunteer Hours
# Volunteer hours
volunteer_hours <- data.frame(
  volunteer_id = c(2, 3, 4, 5),
  department = c("Health", "Education", "Environment", "Health"),
  hours = c(25, 20, 15, 10)
)

print(volunteer_hours)
##   volunteer_id  department hours
## 1            2      Health    25
## 2            3   Education    20
## 3            4 Environment    15
## 4            5      Health    10

Note the following:

Inner Join

An inner join returns only the rows where there is a match in both datasets.

# Inner join
volunteer_inner <- volunteer_info %>%
  inner_join(volunteer_hours, by = "volunteer_id")

print(volunteer_inner)
##   volunteer_id    name age department hours
## 1            2     Bob  34     Health    25
## 2            3 Charlie  22  Education    20
## 3            5     Eve  29     Health    10

Result:

  • Includes volunteers with IDs 2, 3, and 5.
  • Excludes IDs 1 and 4 because they are not present in both datasets.

Left Join

A left join returns all rows from the left dataset (meaning the first dataset input to the command, which for us will be volunteer_info) and the matched rows from the right dataset (volunteer_hours). Unmatched rows will have NA in the columns from the right dataset.

# Left join
volunteer_left <- volunteer_info %>%
  left_join(volunteer_hours, by = "volunteer_id")

print(volunteer_left)
##   volunteer_id    name age department hours
## 1            1   Alice  28       <NA>    NA
## 2            2     Bob  34     Health    25
## 3            3 Charlie  22  Education    20
## 4            5     Eve  29     Health    10

Result:

  • Includes all volunteers from volunteer_info (IDs 1, 2, 3, 5).
  • For volunteer ID 1, which has no matching entry in volunteer_hours, department and hours are NA.

Right Join

A right join returns all rows from the right dataset (volunteer_hours) and the matched rows from the left dataset (volunteer_info). Unmatched rows will have NA in the columns from the left dataset.

# Right join
volunteer_right <- volunteer_info %>%
  right_join(volunteer_hours, by = "volunteer_id")

print(volunteer_right)
##   volunteer_id    name age  department hours
## 1            2     Bob  34      Health    25
## 2            3 Charlie  22   Education    20
## 3            5     Eve  29      Health    10
## 4            4    <NA>  NA Environment    15

Result:

  • Includes all volunteers from volunteer_hours (IDs 2, 3, 4, 5).
  • For volunteer ID 4, which has no matching entry in volunteer_info, name and age are NA.

Full Join

A full join returns all rows from both datasets, with NA in places where there is no match.

# Full join
volunteer_full <- volunteer_info %>%
  full_join(volunteer_hours, by = "volunteer_id")

print(volunteer_full)
##   volunteer_id    name age  department hours
## 1            1   Alice  28        <NA>    NA
## 2            2     Bob  34      Health    25
## 3            3 Charlie  22   Education    20
## 4            5     Eve  29      Health    10
## 5            4    <NA>  NA Environment    15

Result:

  • Includes all volunteers (IDs 1, 2, 3, 4, 5).
  • Volunteer ID 1 has NA for department and hours.
  • Volunteer ID 4 has NA for name and age.

Summary of Joins:

Volunteer ID Inner Join Left Join Right Join Full Join
1 No Yes No Yes
2 Yes Yes Yes Yes
3 Yes Yes Yes Yes
4 No No Yes Yes
5 Yes Yes Yes Yes

Conclusion

In this guide, we’ve covered essential data wrangling techniques in R using the tidyverse. Understanding how to tidy, reshape, filter, group, summarize, and join data is fundamental for effective data analysis.

By practicing these skills with synthetic datasets related to social justice topics, you can prepare yourself to handle real-world data and uncover meaningful insights.

Feel free to revisit these examples and experiment with your own data. Data wrangling is a critical part of your data analysis toolkit, enabling you to transform raw data into actionable insights.

```