Introduction

This project showcases the transformation of three different wide-format datasets — all initially messy, inconsistent, or incomplete — into clean, structured, analysis-ready formats. Using R tools like pivot_longer(), filter(), and clean_names(), I walk through how each dataset was tidied and what insights were made possible after the transformation.

Dataset 1: PRAMS (Pregnancy Risk Assessment Monitoring System)

Summary: To make sense of wide-format PRAMS data, I used pivot_longer() to transform multiple measurement columns (prevalence and confidence intervals) into a long, tidy structure. This allowed for a dynamic faceted line plot showing changes over time across different maternal health questions. I focused on the six most prevalent questions to avoid clutter and maximize insight. The facet layout separates prevalence from confidence intervals, which helps the viewer focus on patterns in each metric without distraction. This approach highlights not just what’s common — but how it has changed year to year.

Fix: I used pivot_longer() to reshape the data from wide to long format and stacked the percentage columns to make them easier to analyze.

Step 1: Load and Clean Column Names

prams <- read_csv("Pregnancy_Risk_Assessment_Monitoring_System__PRAMS__20250323.csv") %>%
  clean_names()

Step 2: Tidy the Dataset

prams_long <- prams %>%
  pivot_longer(cols = c(prevalence_percent,
                        lower_95_percent_confidence_interval,
                        upper_95_percent_confidence_interval),
               names_to = "measure",
               values_to = "value") %>%
  filter(!is.na(value))

Step 3: Summarize by Mean Value

prams_summary <- prams_long %>%
  group_by(year, question, measure) %>%
  summarize(mean_value = mean(value, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(mean_value))

kable(head(prams_summary, 10))
year question measure mean_value
2022 breastfeeding initiation (among mothers whose baby is alive and residing with her) upper_95_percent_confidence_interval 95.6
2021 breastfeeding initiation (among mothers whose baby is alive and residing with her) upper_95_percent_confidence_interval 94.6
2022 breastfeeding initiation (among mothers whose baby is alive and residing with her) prevalence_percent 94.3
2018 breastfeeding initiation (among mothers whose baby is alive and residing with her) upper_95_percent_confidence_interval 93.7
2020 breastfeeding initiation (among mothers whose baby is alive and residing with her) upper_95_percent_confidence_interval 93.6
2017 breastfeeding initiation (among mothers whose baby is alive and residing with her) upper_95_percent_confidence_interval 93.3
2021 breastfeeding initiation (among mothers whose baby is alive and residing with her) prevalence_percent 93.1
2022 breastfeeding initiation (among mothers whose baby is alive and residing with her) lower_95_percent_confidence_interval 92.7
2019 breastfeeding initiation (among mothers whose baby is alive and residing with her) upper_95_percent_confidence_interval 92.3
2018 breastfeeding initiation (among mothers whose baby is alive and residing with her) prevalence_percent 92.2

PRAMS Visualizations

# Identify top 6 questions based on average prevalence
top_questions <- prams_summary %>%
  filter(measure == "prevalence_percent") %>%
  top_n(6, mean_value) %>%
  pull(question)

# Filter the long-format data to just those top questions
prams_plot_data <- prams_long %>%
  filter(question %in% top_questions)

# Create the plot
ggplot(prams_plot_data, aes(x = year, y = value, color = question)) +
  geom_line(size = 1.1) +
  geom_point(size = 2) +
  facet_wrap(~ measure, scales = "free_y") +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  labs(
    title = "Trends in Maternal Health Indicators (Top 6 Questions)",
    subtitle = "Data from PRAMS Survey (Multiple Measures)",
    x = "Year",
    y = "Value (%)",
    color = "Survey Question",
    caption = "Source: CDC PRAMS, 2023"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 12, color = "gray40"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    strip.text = element_text(face = "bold")
  )

Dataset 2: DHS Referral Outcomes

Summary: Originally, I visualized the overall distribution of assistance types, but this didn’t show how different populations interacted with the system. So I extended the analysis by breaking referrals down by gender. Using a grouped bar chart, I visualized the proportion of referrals for each assistance type across genders. This helps identify disparities in service access or need, such as whether one gender is more likely to receive specific types of support. Adding labels and a clean horizontal layout made the insights immediately digestible

Fix: I used clean_names() to standardize columns and filtered out rows with missing client_key. Then I summarized by assistance type.

Step 1: Load, Clean, and Filter

homeless <- read_csv("homeless_prep.csv") %>%
  clean_names() %>%
  filter(!is.na(client_key))

kable(head(homeless, 10))
client_key age gender veteran income nights substanceabuse completed probation assistancetype required
256172 20 Female No 0 152 0 0 1 tempassistance 1
221079 20 Female No 740 229 0 1 0 tempassistance 1
243074 21 Female No 0 105 1 1 0 tempassistance 1
129974 21 Female No 0 371 1 1 0 tempassistance 1
226596 22 Female No 0 169 0 0 0 tempassistance 1
87998 22 Female No 1200 297 0 0 0 tempassistance 1
233477 22 Female No 0 477 1 1 1 tempassistance 1
233477 22 Female No 0 477 1 1 1 tempassistance 1
203539 23 Female No 688 120 1 0 0 tempassistance 1
269015 23 Female No 0 148 1 1 1 tempassistance 1

Step 2: Summarize Assistance Type

homeless_summary <- homeless %>%
  group_by(assistancetype) %>%
  count() %>%
  arrange(desc(n))

kable(head(homeless_summary, 10))
assistancetype n
permassistance 121
tempassistance 121

Step 3: Transform to Proportions

homeless_prop <- homeless_summary %>%
  mutate(percent = round(100 * n / sum(n), 1))

kable(homeless_prop)
assistancetype n percent
permassistance 121 100
tempassistance 121 100

Step 4:

Visualize Assistance Distribution

ggplot(homeless_prop, aes(x = reorder(assistancetype, percent), y = percent)) +
  geom_col(fill = "#1f77b4") +
  geom_text(aes(label = paste0(percent, "%")), hjust = -0.1, size = 3.5) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  labs(
    title = "DHS Assistance Types by Referral Proportion",
    subtitle = "Most Common Referral Types in Homeless Dataset",
    x = "Assistance Type",
    y = "Percentage of Total",
    caption = "Source: NYC DHS Referral Data"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11, color = "gray40"),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_text(size = 10),
    plot.caption = element_text(size = 9, face = "italic"),
    panel.grid.major.y = element_blank()
  ) +
  coord_flip()

## Referral Gender Breakdown

# Group by gender and assistance type to calculate percentages
homeless_gender_prop <- homeless %>%
  filter(!is.na(gender), !is.na(assistancetype)) %>%
  group_by(gender, assistancetype) %>%
  summarize(count = n(), .groups = "drop") %>%
  group_by(gender) %>%
  mutate(percent = round(100 * count / sum(count), 1))

# Plot with gender comparison
ggplot(homeless_gender_prop, aes(x = reorder(assistancetype, percent), y = percent, fill = gender)) +
  geom_col(position = "dodge") +
  geom_text(
    aes(label = paste0(percent, "%")),
    position = position_dodge(width = 0.9),
    vjust = -0.25,
    size = 3
  ) +
  labs(
    title = "DHS Referrals by Gender and Assistance Type",
    subtitle = "Comparison of Referral Distribution by Gender",
    x = "Assistance Type",
    y = "Percentage",
    fill = "Gender",
    caption = "Source: NYC DHS Referral Dataset"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11, color = "gray40"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    panel.grid.major.x = element_blank()
  )

Dataset 3: Breast Cancer Cases (2020–2024)

Summary:The breast cancer dataset required a full restructuring due to messy year columns and missing headers. After cleaning and reshaping with pivot_longer(), I used faceted bar plots to show case trends by age group across five years. This allowed us to spotlight which age groups had consistently high or fluctuating case counts. I also created a separate line chart of total cases per year to give a big-picture view of overall incidence. These visuals complement each other — one breaks it down by age, the other tells the story across time Fix: I renamed the columns manually, reshaped the dataset using pivot_longer(), and visualized cases by year and age group.

Step 1: Load, Clean, and Rename Columns

untidy <- read_excel("Untidydataset _624.xlsx", sheet = 1) %>%
  clean_names()

colnames(untidy) <- c("patient_id", "month", "age_group", "race",  
                      "cases_2020", "cases_2021", "cases_2022", "cases_2023", "cases_2024",
                      "total_patients", "type")

kable(head(untidy, 10))
patient_id month age_group race cases_2020 cases_2021 cases_2022 cases_2023 cases_2024 total_patients type
Patient_ID Month Age_Group Race Cases_2020 Cases_2021 Cases_2022 Cases_2023 Cases_2024 Total_Patients Type
P001 January 30-39 White 50 55 60 65 70 300 Invasive Ductal
P002 January 40-49 Black 90 95 100 105 120 510 Lobular
P003 January 50-59 Hispanic 80 85 88 90 110 453 Triple-Negative
P004 February 30-39 Asian 70 75 80 85 90 400 HER2+
P005 February 40-49 White 110 115 120 125 140 610 Inflammatory
P006 March 50-59 Black 130 135 140 145 160 710 Invasive Ductal
P007 March 60-69 Hispanic 100 105 110 115 130 560 Lobular
P008 March 70+ Asian 70 75 78 80 100 403 Triple-Negative
P009 April 30-39 White 60 65 70 75 85 355 HER2+

Step 2: Reshape Year Columns to Long Format

untidy_long <- untidy %>%
  pivot_longer(cols = starts_with("cases_"),
               names_to = "year",
               values_to = "cases") %>%
  mutate(year = str_remove(year, "cases_")) %>%
  drop_na(cases)

Visualizations Breast Cancer Cases

Step 3: Visualize with ggplot2

ggplot(untidy_long, aes(x = year, y = cases)) +
  geom_col(fill = "#FF6F61", width = 0.7) +
  facet_wrap(~ age_group, scales = "free_y") +
  geom_text(aes(label = cases), vjust = -0.2, size = 3, color = "black") +
  labs(
    title = "Breast Cancer Cases by Age Group (2020–2024)",
    subtitle = "Case Counts Grouped by Age Across Years",
    x = "Year",
    y = "Number of Cases",
    caption = "Source: NYC Health Breast Cancer Dataset"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11, color = "gray40"),
    axis.text.x = element_text(angle = 0),
    strip.text = element_text(face = "bold", size = 11),
    panel.grid.major.x = element_blank()
  )

Step 4: Yearly Totals Transformation

yearly_totals <- untidy_long %>%
  mutate(cases = as.numeric(cases)) %>%
  group_by(year) %>%
  summarize(total_cases = sum(cases, na.rm = TRUE)) %>%
  arrange(desc(total_cases))
ggplot(yearly_totals, aes(x = year, y = total_cases, group = 1)) +
  geom_line(color = "#D72638", size = 1.2) +
  geom_point(size = 2.5, color = "#D72638") +
  geom_text(aes(label = total_cases), vjust = -0.5, size = 3.5) +
  labs(
    title = "Total Breast Cancer Cases by Year",
    subtitle = "Aggregated Across All Age Groups",
    x = "Year",
    y = "Total Cases",
    caption = "Source: NYC Health Breast Cancer Dataset"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11, color = "gray40"),
    panel.grid.minor = element_blank()
  )

Reflection

Throughout this project, I gained a much deeper appreciation for the role data wrangling plays in actual analysis. Each dataset had its own set of issues — and none of them were ready to use out-of-the-box.

I learned how important it is to approach messy data with both strategy and patience. From inconsistent headers to multi-year wide formats, every issue required its own combination of janitor, tidyr, and dplyr tools to resolve.

Across all three datasets, visualization wasn’t just the final step — it was a diagnostic tool. The plots revealed patterns, gaps, and outliers that would be missed in raw tables alone. By reshaping the data into long format and visualizing it with ggplot2, I transformed messy data into clear narratives. Whether it was maternal health trends, gender-based service usage, or age-driven cancer incidence, each plot was chosen to maximize interpretability and real-world relevance.