Getting Started

The examples in this document make use of the tidyverse library. If you haven’t installed it, you can do so using the following command in your R console:

install.packages("tidyverse")

Once installed, load the tidyverse library:

library(tidyverse)

Tidy Data

Having your data in the right format is crucial for efficient analysis. Tidy data is a standardized way of organizing data that makes it easier to manipulate and analyze using R’s functions.

According to the principles of tidy data:

Each variable forms its own column.
- Every piece of information you record (like age, race, or income) should have its own column in your dataset.
Each observation forms its own row.
- Each row should represent a single record or case in your dataset. For example, one person’s data, one event, or one measurement.
Each type of observational unit forms its own table.
- If you have different kinds of data that relate to different things, they should be stored in separate tables. For instance, if you’re collecting data on both individuals and the events they attend, you should have one table for individuals and another for events.
- Why this matters: Mixing different types of observational units in one table can make data messy and harder to analyze. Keeping them separate ensures clarity and makes it easier to join them later if needed.

For example, suppose we have data on the participation of different communities in a local environmental initiative over several years. Here’s how the data might look initially:

# Untidy participation data with years as columns
participation_data <- data.frame(
  community = c("Community A", "Community B"),
  participation_2019 = c(100, 150),
  participation_2020 = c(120, 130),
  participation_2021 = c(110, 160)
)

print(participation_data)

##     community participation_2019 participation_2020 participation_2021
## 1 Community A                100                120                110
## 2 Community B                150                130                160

In this dataset:

The participation numbers for each year are stored in separate columns.
The years are embedded in the column names (participation_2019, participation_2020, etc.).
This setup violates the tidy data principle that each variable should have its own column because the “year” variable is spread across multiple columns.

Why is this a problem?

It’s more difficult to perform analyses like calculating trends over time or comparing participation between communities.
Functions in R may not work as intended because they expect data in a tidy format.

To tidy this data, we need to restructure it so that:

There is a single column for the “year” variable.
There is a single column for the “participants” variable.

We can achieve this by reshaping the data using the pivot_longer() function from the tidyr package:

# Tidying data by reshaping so each variable has its own column
participation_tidy <- participation_data %>%
  pivot_longer(
    cols = starts_with("participation_"),
    names_to = "year",
    names_prefix = "participation_",
    values_to = "participants"
  )

print(participation_tidy)

## # A tibble: 6 × 3
##   community   year  participants
##   <chr>       <chr>        <dbl>
## 1 Community A 2019           100
## 2 Community A 2020           120
## 3 Community A 2021           110
## 4 Community B 2019           150
## 5 Community B 2020           130
## 6 Community B 2021           160

Explanation:

pivot_longer(): Transforms the data from a wide format to a longer format, making it tidy.
- cols = starts_with("participation_"): Selects all columns that contain participation data for different years.
- names_to = "year": Creates a new column called “year” that contains the years.
- names_prefix = "participation_": Removes the prefix “participation_” from the year values.
- values_to = "participants": Creates a new column called “participants” that contains the participation numbers.

Now, our data is tidy:

community: The community name.
year: The year of participation.
participants: The number of participants.

Each row represents a single observation: the participation of one community in one year.

Wide and Long Data

In data wrangling, reshaping data between different formats is a common task. Two important data formats are wide format and long format.

Wide Format:
- Each subject or observational unit has a single row.
- Repeated measurements or variables are stored in separate columns.
- Example: A dataset where each student’s test scores for different subjects are in separate columns.
Long Format:
- Each observation is stored in its own row.
- Variables that vary (like time or category) are stored in a single column.
- Example: A dataset where each row represents a student’s score in one subject.

It can be confusing how the idea of tidy/untidy data relates to the idea of long/wide data. Let’s clarify:

Tidy data can be in either wide or long format, depending on the context.
The key is that in tidy data, each variable is a column, and each observation is a row.
Some analyses or visualizations require data to be in a specific format (wide or long).

For example, suppose we have data on attendance at community meetings, recorded for each month:

# Long-format attendance data
attendance_long <- data.frame(
  community = rep(c("Community A", "Community B"), each = 3),
  month = rep(c("January", "February", "March"), times = 2),
  attendance = c(30, 35, 40, 25, 30, 45)
)

print(attendance_long)

##     community    month attendance
## 1 Community A  January         30
## 2 Community A February         35
## 3 Community A    March         40
## 4 Community B  January         25
## 5 Community B February         30
## 6 Community B    March         45

In this dataset:

Each row represents an observation: the attendance for one community in one month.
The variables are: community, month, and attendance.
This data is in long format and is tidy because each variable has its own column, and each observation is a row.

If we want to create a report where each community has a single row, and each month’s attendance is in a separate column, we can reshape the data to wide format:

# Pivoting from long to wide format
attendance_wide <- attendance_long %>%
  pivot_wider(
    names_from = month,
    values_from = attendance
  )

print(attendance_wide)

## # A tibble: 2 × 4
##   community   January February March
##   <chr>         <dbl>    <dbl> <dbl>
## 1 Community A      30       35    40
## 2 Community B      25       30    45

Explanation:

pivot_wider(): Transforms the data from long format to wide format.
- names_from = month: The unique values in the “month” column become new column names.
- values_from = attendance: The values in the “attendance” column fill the new columns.

Now, the data is in wide format:

Each community occupies a single row.
The months are now column names.
The attendance numbers are the values in these columns.

Is the Wide Data Tidy?

Yes, it can still be considered tidy because:
- Each variable is a column.
- Each observation is a row (each community).
However, depending on the analysis, the long format may be more suitable.

Summary of the Relationship Between Tidy Data and Long/Wide Formats:

Tidy data is about the principles of data organization (variables in columns, observations in rows).
Long and wide formats describe the shape of the data.
Key Points:
- Data in long format often aligns with tidy data principles, especially for time series or repeated measures.
- Data in wide format can also be tidy if it follows the tidy data principles.
- Sometimes, wide data can become untidy if variables are stored in column names (like years in our initial example).

Practice Problem:

Consider the following untidy dataset on the number of community workshops held in different neighborhoods over several years:

# Untidy workshop data with years as columns
workshop_data <- data.frame(
  neighborhood = c("Northside", "Southside"),
  workshops_2018 = c(5, 7),
  workshops_2019 = c(6, 8),
  workshops_2020 = c(7, 9)
)

print(workshop_data)

##   neighborhood workshops_2018 workshops_2019 workshops_2020
## 1    Northside              5              6              7
## 2    Southside              7              8              9

Exercise:

Reshape this data into tidy format.
Ensure that each variable has its own column and each observation is a row.

Solution:

# Tidying workshop data
workshop_tidy <- workshop_data %>%
  pivot_longer(
    cols = starts_with("workshops_"),
    names_to = "year",
    names_prefix = "workshops_",
    values_to = "workshops"
  )

print(workshop_tidy)

## # A tibble: 6 × 3
##   neighborhood year  workshops
##   <chr>        <chr>     <dbl>
## 1 Northside    2018          5
## 2 Northside    2019          6
## 3 Northside    2020          7
## 4 Southside    2018          7
## 5 Southside    2019          8
## 6 Southside    2020          9

Filtering Data

Filtering allows you to select a subset of your data based on specific conditions. This is useful when you want to focus on particular observations.

Suppose we have data on incidents reported in different neighborhoods, categorized by type:

# Incident data
incident_data <- data.frame(
  incident_id = 1:20,
  neighborhood = c(
    "Northside", "Northside", "Northside", "Northside", "Northside",
    "Southside", "Southside", "Southside", "Southside", "Southside",
    "Eastside", "Eastside", "Eastside", "Eastside", "Eastside",
    "Westside", "Westside", "Westside", "Westside", "Westside"
  ),
  incident_type = c(
    "Graffiti", "Littering", "Noise", "Vandalism", "Graffiti",
    "Noise", "Littering", "Graffiti", "Vandalism", "Noise",
    "Graffiti", "Graffiti", "Noise", "Littering", "Vandalism",
    "Vandalism", "Littering", "Noise", "Graffiti", "Littering"
  )
)

print(incident_data)

##    incident_id neighborhood incident_type
## 1            1    Northside      Graffiti
## 2            2    Northside     Littering
## 3            3    Northside         Noise
## 4            4    Northside     Vandalism
## 5            5    Northside      Graffiti
## 6            6    Southside         Noise
## 7            7    Southside     Littering
## 8            8    Southside      Graffiti
## 9            9    Southside     Vandalism
## 10          10    Southside         Noise
## 11          11     Eastside      Graffiti
## 12          12     Eastside      Graffiti
## 13          13     Eastside         Noise
## 14          14     Eastside     Littering
## 15          15     Eastside     Vandalism
## 16          16     Westside     Vandalism
## 17          17     Westside     Littering
## 18          18     Westside         Noise
## 19          19     Westside      Graffiti
## 20          20     Westside     Littering

If we want to focus on incidents of graffiti, we can filter the data:

# Filtering for graffiti incidents
graffiti_incidents <- incident_data %>%
  filter(incident_type == "Graffiti")

print(graffiti_incidents)

##   incident_id neighborhood incident_type
## 1           1    Northside      Graffiti
## 2           5    Northside      Graffiti
## 3           8    Southside      Graffiti
## 4          11     Eastside      Graffiti
## 5          12     Eastside      Graffiti
## 6          19     Westside      Graffiti

We can also filter data based on multiple conditions using logical operators. For instance, let’s filter for graffiti incidents in the “Northside” neighborhood:

# Filtering for graffiti incidents in Northside
graffiti_northside <- incident_data %>%
  filter(incident_type == "Graffiti" & neighborhood == "Northside")

print(graffiti_northside)

##   incident_id neighborhood incident_type
## 1           1    Northside      Graffiti
## 2           5    Northside      Graffiti

Practice Problem:

Given the following data on community service hours by volunteers, filter the data to find volunteers who have contributed more than 20 hours.

# Volunteer data
volunteer_data <- data.frame(
  volunteer_id = 1:6,
  name = c("Alice", "Bob", "Charlie", "Diana", "Ethan", "Fiona"),
  hours = c(15, 25, 20, 30, 10, 22)
)

print(volunteer_data)

##   volunteer_id    name hours
## 1            1   Alice    15
## 2            2     Bob    25
## 3            3 Charlie    20
## 4            4   Diana    30
## 5            5   Ethan    10
## 6            6   Fiona    22

Solution:

# Filtering volunteers with more than 20 hours
active_volunteers <- volunteer_data %>%
  filter(hours > 20)

print(active_volunteers)

##   volunteer_id  name hours
## 1            2   Bob    25
## 2            4 Diana    30
## 3            6 Fiona    22

Grouping and Summarizing Data

Grouping and summarizing data allows you to compute aggregate statistics for different groups within your data.

Using the incident_data from the previous section, we can calculate the number of incidents of each type in each neighborhood.

# Grouping and summarizing incident counts
incident_summary <- incident_data %>%
  group_by(neighborhood, incident_type) %>%
  summarize(
    incident_count = n()
  )

print(incident_summary)

## # A tibble: 16 × 3
## # Groups:   neighborhood [4]
##    neighborhood incident_type incident_count
##    <chr>        <chr>                  <int>
##  1 Eastside     Graffiti                   2
##  2 Eastside     Littering                  1
##  3 Eastside     Noise                      1
##  4 Eastside     Vandalism                  1
##  5 Northside    Graffiti                   2
##  6 Northside    Littering                  1
##  7 Northside    Noise                      1
##  8 Northside    Vandalism                  1
##  9 Southside    Graffiti                   1
## 10 Southside    Littering                  1
## 11 Southside    Noise                      2
## 12 Southside    Vandalism                  1
## 13 Westside     Graffiti                   1
## 14 Westside     Littering                  2
## 15 Westside     Noise                      1
## 16 Westside     Vandalism                  1

Explanation:

group_by(neighborhood, incident_type): Groups the data by both “neighborhood” and “incident_type”.
summarize(incident_count = n()): Counts the number of incidents in each group.

Practice Problem:

Using the volunteer_data, calculate the average hours contributed by volunteers in each department.

First, let’s extend the volunteer data to include departments:

# Extended volunteer data with departments
volunteer_data <- data.frame(
  volunteer_id = 1:6,
  name = c("Alice", "Bob", "Charlie", "Diana", "Ethan", "Fiona"),
  department = c("Education", "Health", "Education", "Health", "Environment", "Environment"),
  hours = c(15, 25, 20, 30, 10, 22)
)

print(volunteer_data)

##   volunteer_id    name  department hours
## 1            1   Alice   Education    15
## 2            2     Bob      Health    25
## 3            3 Charlie   Education    20
## 4            4   Diana      Health    30
## 5            5   Ethan Environment    10
## 6            6   Fiona Environment    22

Solution:

# Grouping by department and summarizing average hours
avg_hours_by_department <- volunteer_data %>%
  group_by(department) %>%
  summarize(
    average_hours = mean(hours),
    total_hours = sum(hours),
    volunteer_count = n()
  )

print(avg_hours_by_department)

## # A tibble: 3 × 4
##   department  average_hours total_hours volunteer_count
##   <chr>               <dbl>       <dbl>           <int>
## 1 Education            17.5          35               2
## 2 Environment          16            32               2
## 3 Health               27.5          55               2

Joining Datasets

Joining datasets involves merging data from different sources based on shared columns, enhancing your analysis by linking complementary information.

Suppose we have two datasets:

Volunteer Information

# Volunteer information
volunteer_info <- data.frame(
  volunteer_id = c(1, 2, 3, 5),
  name = c("Alice", "Bob", "Charlie", "Eve"),
  age = c(28, 34, 22, 29)
)

print(volunteer_info)

##   volunteer_id    name age
## 1            1   Alice  28
## 2            2     Bob  34
## 3            3 Charlie  22
## 4            5     Eve  29

Volunteer Hours

# Volunteer hours
volunteer_hours <- data.frame(
  volunteer_id = c(2, 3, 4, 5),
  department = c("Health", "Education", "Environment", "Health"),
  hours = c(25, 20, 15, 10)
)

print(volunteer_hours)

##   volunteer_id  department hours
## 1            2      Health    25
## 2            3   Education    20
## 3            4 Environment    15
## 4            5      Health    10

Note the following:

Volunteer ID 1 is only in volunteer_info.
Volunteer ID 4 is only in volunteer_hours.
Volunteer IDs 2, 3, and 5 are in both datasets.

Inner Join

An inner join returns only the rows where there is a match in both datasets.

# Inner join
volunteer_inner <- volunteer_info %>%
  inner_join(volunteer_hours, by = "volunteer_id")

print(volunteer_inner)

##   volunteer_id    name age department hours
## 1            2     Bob  34     Health    25
## 2            3 Charlie  22  Education    20
## 3            5     Eve  29     Health    10

Result:

Includes volunteers with IDs 2, 3, and 5.
Excludes IDs 1 and 4 because they are not present in both datasets.

Left Join

A left join returns all rows from the left dataset (meaning the first dataset input to the command, which for us will be volunteer_info) and the matched rows from the right dataset (volunteer_hours). Unmatched rows will have NA in the columns from the right dataset.

# Left join
volunteer_left <- volunteer_info %>%
  left_join(volunteer_hours, by = "volunteer_id")

print(volunteer_left)

##   volunteer_id    name age department hours
## 1            1   Alice  28       <NA>    NA
## 2            2     Bob  34     Health    25
## 3            3 Charlie  22  Education    20
## 4            5     Eve  29     Health    10

Result:

Includes all volunteers from volunteer_info (IDs 1, 2, 3, 5).
For volunteer ID 1, which has no matching entry in volunteer_hours, department and hours are NA.

Right Join

A right join returns all rows from the right dataset (volunteer_hours) and the matched rows from the left dataset (volunteer_info). Unmatched rows will have NA in the columns from the left dataset.

# Right join
volunteer_right <- volunteer_info %>%
  right_join(volunteer_hours, by = "volunteer_id")

print(volunteer_right)

##   volunteer_id    name age  department hours
## 1            2     Bob  34      Health    25
## 2            3 Charlie  22   Education    20
## 3            5     Eve  29      Health    10
## 4            4    <NA>  NA Environment    15

Result:

Includes all volunteers from volunteer_hours (IDs 2, 3, 4, 5).
For volunteer ID 4, which has no matching entry in volunteer_info, name and age are NA.

Full Join

A full join returns all rows from both datasets, with NA in places where there is no match.

# Full join
volunteer_full <- volunteer_info %>%
  full_join(volunteer_hours, by = "volunteer_id")

print(volunteer_full)

##   volunteer_id    name age  department hours
## 1            1   Alice  28        <NA>    NA
## 2            2     Bob  34      Health    25
## 3            3 Charlie  22   Education    20
## 4            5     Eve  29      Health    10
## 5            4    <NA>  NA Environment    15

Result:

Includes all volunteers (IDs 1, 2, 3, 4, 5).
Volunteer ID 1 has NA for department and hours.
Volunteer ID 4 has NA for name and age.

Summary of Joins:

Volunteer ID	Inner Join	Left Join	Right Join	Full Join
1	No	Yes	No	Yes
2	Yes	Yes	Yes	Yes
3	Yes	Yes	Yes	Yes
4	No	No	Yes	Yes
5	Yes	Yes	Yes	Yes

Conclusion

In this guide, we’ve covered essential data wrangling techniques in R using the tidyverse. Understanding how to tidy, reshape, filter, group, summarize, and join data is fundamental for effective data analysis.

By practicing these skills with synthetic datasets related to social justice topics, you can prepare yourself to handle real-world data and uncover meaningful insights.

Feel free to revisit these examples and experiment with your own data. Data wrangling is a critical part of your data analysis toolkit, enabling you to transform raw data into actionable insights.

```

Wrangling Data in R

Chad M. Topaz

2024-09-08

Getting Started

Tidy Data

Wide and Long Data

Filtering Data

Grouping and Summarizing Data

Joining Datasets

Inner Join

Left Join

Right Join

Full Join

Conclusion

Volunteer ID	Inner Join	Left Join	Right Join	Full Join
1	No	Yes	No	Yes
2	Yes	Yes	Yes	Yes
3	Yes	Yes	Yes	Yes
4	No	No	Yes	Yes
5	Yes	Yes	Yes	Yes

Volunteer ID	Inner Join	Left Join	Right Join	Full Join
1	No	Yes	No	Yes
2	Yes	Yes	Yes	Yes
3	Yes	Yes	Yes	Yes
4	No	No	Yes	Yes
5	Yes	Yes	Yes	Yes

Volunteer ID	Inner Join	Left Join	Right Join	Full Join
1	No	Yes	No	Yes
2	Yes	Yes	Yes	Yes
3	Yes	Yes	Yes	Yes
4	No	No	Yes	Yes
5	Yes	Yes	Yes	Yes