This project showcases the transformation of three different wide-format datasets — all initially messy, inconsistent, or incomplete — into clean, structured, analysis-ready formats. Using R tools like pivot_longer(), filter(), and clean_names(), I walk through how each dataset was tidied and what insights were made possible after the transformation.
Summary: To make sense of wide-format PRAMS data, I used pivot_longer() to transform multiple measurement columns (prevalence and confidence intervals) into a long, tidy structure. This allowed for a dynamic faceted line plot showing changes over time across different maternal health questions. I focused on the six most prevalent questions to avoid clutter and maximize insight. The facet layout separates prevalence from confidence intervals, which helps the viewer focus on patterns in each metric without distraction. This approach highlights not just what’s common — but how it has changed year to year.
Fix: I used pivot_longer() to reshape the data from wide to long format and stacked the percentage columns to make them easier to analyze.
Step 1: Load and Clean Column Names
prams <- read_csv("Pregnancy_Risk_Assessment_Monitoring_System__PRAMS__20250323.csv") %>%
clean_names()
Step 2: Tidy the Dataset
prams_long <- prams %>%
pivot_longer(cols = c(prevalence_percent,
lower_95_percent_confidence_interval,
upper_95_percent_confidence_interval),
names_to = "measure",
values_to = "value") %>%
filter(!is.na(value))
Step 3: Summarize by Mean Value
prams_summary <- prams_long %>%
group_by(year, question, measure) %>%
summarize(mean_value = mean(value, na.rm = TRUE), .groups = "drop") %>%
arrange(desc(mean_value))
kable(head(prams_summary, 10))
year | question | measure | mean_value |
---|---|---|---|
2022 | breastfeeding initiation (among mothers whose baby is alive and residing with her) | upper_95_percent_confidence_interval | 95.6 |
2021 | breastfeeding initiation (among mothers whose baby is alive and residing with her) | upper_95_percent_confidence_interval | 94.6 |
2022 | breastfeeding initiation (among mothers whose baby is alive and residing with her) | prevalence_percent | 94.3 |
2018 | breastfeeding initiation (among mothers whose baby is alive and residing with her) | upper_95_percent_confidence_interval | 93.7 |
2020 | breastfeeding initiation (among mothers whose baby is alive and residing with her) | upper_95_percent_confidence_interval | 93.6 |
2017 | breastfeeding initiation (among mothers whose baby is alive and residing with her) | upper_95_percent_confidence_interval | 93.3 |
2021 | breastfeeding initiation (among mothers whose baby is alive and residing with her) | prevalence_percent | 93.1 |
2022 | breastfeeding initiation (among mothers whose baby is alive and residing with her) | lower_95_percent_confidence_interval | 92.7 |
2019 | breastfeeding initiation (among mothers whose baby is alive and residing with her) | upper_95_percent_confidence_interval | 92.3 |
2018 | breastfeeding initiation (among mothers whose baby is alive and residing with her) | prevalence_percent | 92.2 |
# Identify top 6 questions based on average prevalence
top_questions <- prams_summary %>%
filter(measure == "prevalence_percent") %>%
top_n(6, mean_value) %>%
pull(question)
# Filter the long-format data to just those top questions
prams_plot_data <- prams_long %>%
filter(question %in% top_questions)
# Create the plot
ggplot(prams_plot_data, aes(x = year, y = value, color = question)) +
geom_line(size = 1.1) +
geom_point(size = 2) +
facet_wrap(~ measure, scales = "free_y") +
scale_y_continuous(labels = scales::percent_format(scale = 1)) +
labs(
title = "Trends in Maternal Health Indicators (Top 6 Questions)",
subtitle = "Data from PRAMS Survey (Multiple Measures)",
x = "Year",
y = "Value (%)",
color = "Survey Question",
caption = "Source: CDC PRAMS, 2023"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 12, color = "gray40"),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom",
strip.text = element_text(face = "bold")
)
Summary: Originally, I visualized the overall distribution of assistance types, but this didn’t show how different populations interacted with the system. So I extended the analysis by breaking referrals down by gender. Using a grouped bar chart, I visualized the proportion of referrals for each assistance type across genders. This helps identify disparities in service access or need, such as whether one gender is more likely to receive specific types of support. Adding labels and a clean horizontal layout made the insights immediately digestible
Fix: I used clean_names() to standardize columns and filtered out rows with missing client_key. Then I summarized by assistance type.
Step 1: Load, Clean, and Filter
homeless <- read_csv("homeless_prep.csv") %>%
clean_names() %>%
filter(!is.na(client_key))
kable(head(homeless, 10))
client_key | age | gender | veteran | income | nights | substanceabuse | completed | probation | assistancetype | required |
---|---|---|---|---|---|---|---|---|---|---|
256172 | 20 | Female | No | 0 | 152 | 0 | 0 | 1 | tempassistance | 1 |
221079 | 20 | Female | No | 740 | 229 | 0 | 1 | 0 | tempassistance | 1 |
243074 | 21 | Female | No | 0 | 105 | 1 | 1 | 0 | tempassistance | 1 |
129974 | 21 | Female | No | 0 | 371 | 1 | 1 | 0 | tempassistance | 1 |
226596 | 22 | Female | No | 0 | 169 | 0 | 0 | 0 | tempassistance | 1 |
87998 | 22 | Female | No | 1200 | 297 | 0 | 0 | 0 | tempassistance | 1 |
233477 | 22 | Female | No | 0 | 477 | 1 | 1 | 1 | tempassistance | 1 |
233477 | 22 | Female | No | 0 | 477 | 1 | 1 | 1 | tempassistance | 1 |
203539 | 23 | Female | No | 688 | 120 | 1 | 0 | 0 | tempassistance | 1 |
269015 | 23 | Female | No | 0 | 148 | 1 | 1 | 1 | tempassistance | 1 |
Step 2: Summarize Assistance Type
homeless_summary <- homeless %>%
group_by(assistancetype) %>%
count() %>%
arrange(desc(n))
kable(head(homeless_summary, 10))
assistancetype | n |
---|---|
permassistance | 121 |
tempassistance | 121 |
Step 3: Transform to Proportions
homeless_prop <- homeless_summary %>%
mutate(percent = round(100 * n / sum(n), 1))
kable(homeless_prop)
assistancetype | n | percent |
---|---|---|
permassistance | 121 | 100 |
tempassistance | 121 | 100 |
Step 4:
ggplot(homeless_prop, aes(x = reorder(assistancetype, percent), y = percent)) +
geom_col(fill = "#1f77b4") +
geom_text(aes(label = paste0(percent, "%")), hjust = -0.1, size = 3.5) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
labs(
title = "DHS Assistance Types by Referral Proportion",
subtitle = "Most Common Referral Types in Homeless Dataset",
x = "Assistance Type",
y = "Percentage of Total",
caption = "Source: NYC DHS Referral Data"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11, color = "gray40"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_text(size = 10),
plot.caption = element_text(size = 9, face = "italic"),
panel.grid.major.y = element_blank()
) +
coord_flip()
## Referral Gender Breakdown
# Group by gender and assistance type to calculate percentages
homeless_gender_prop <- homeless %>%
filter(!is.na(gender), !is.na(assistancetype)) %>%
group_by(gender, assistancetype) %>%
summarize(count = n(), .groups = "drop") %>%
group_by(gender) %>%
mutate(percent = round(100 * count / sum(count), 1))
# Plot with gender comparison
ggplot(homeless_gender_prop, aes(x = reorder(assistancetype, percent), y = percent, fill = gender)) +
geom_col(position = "dodge") +
geom_text(
aes(label = paste0(percent, "%")),
position = position_dodge(width = 0.9),
vjust = -0.25,
size = 3
) +
labs(
title = "DHS Referrals by Gender and Assistance Type",
subtitle = "Comparison of Referral Distribution by Gender",
x = "Assistance Type",
y = "Percentage",
fill = "Gender",
caption = "Source: NYC DHS Referral Dataset"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11, color = "gray40"),
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom",
panel.grid.major.x = element_blank()
)
Summary:The breast cancer dataset required a full restructuring due to messy year columns and missing headers. After cleaning and reshaping with pivot_longer(), I used faceted bar plots to show case trends by age group across five years. This allowed us to spotlight which age groups had consistently high or fluctuating case counts. I also created a separate line chart of total cases per year to give a big-picture view of overall incidence. These visuals complement each other — one breaks it down by age, the other tells the story across time Fix: I renamed the columns manually, reshaped the dataset using pivot_longer(), and visualized cases by year and age group.
Step 1: Load, Clean, and Rename Columns
untidy <- read_excel("Untidydataset _624.xlsx", sheet = 1) %>%
clean_names()
colnames(untidy) <- c("patient_id", "month", "age_group", "race",
"cases_2020", "cases_2021", "cases_2022", "cases_2023", "cases_2024",
"total_patients", "type")
kable(head(untidy, 10))
patient_id | month | age_group | race | cases_2020 | cases_2021 | cases_2022 | cases_2023 | cases_2024 | total_patients | type |
---|---|---|---|---|---|---|---|---|---|---|
Patient_ID | Month | Age_Group | Race | Cases_2020 | Cases_2021 | Cases_2022 | Cases_2023 | Cases_2024 | Total_Patients | Type |
P001 | January | 30-39 | White | 50 | 55 | 60 | 65 | 70 | 300 | Invasive Ductal |
P002 | January | 40-49 | Black | 90 | 95 | 100 | 105 | 120 | 510 | Lobular |
P003 | January | 50-59 | Hispanic | 80 | 85 | 88 | 90 | 110 | 453 | Triple-Negative |
P004 | February | 30-39 | Asian | 70 | 75 | 80 | 85 | 90 | 400 | HER2+ |
P005 | February | 40-49 | White | 110 | 115 | 120 | 125 | 140 | 610 | Inflammatory |
P006 | March | 50-59 | Black | 130 | 135 | 140 | 145 | 160 | 710 | Invasive Ductal |
P007 | March | 60-69 | Hispanic | 100 | 105 | 110 | 115 | 130 | 560 | Lobular |
P008 | March | 70+ | Asian | 70 | 75 | 78 | 80 | 100 | 403 | Triple-Negative |
P009 | April | 30-39 | White | 60 | 65 | 70 | 75 | 85 | 355 | HER2+ |
Step 2: Reshape Year Columns to Long Format
untidy_long <- untidy %>%
pivot_longer(cols = starts_with("cases_"),
names_to = "year",
values_to = "cases") %>%
mutate(year = str_remove(year, "cases_")) %>%
drop_na(cases)
Step 3: Visualize with ggplot2
ggplot(untidy_long, aes(x = year, y = cases)) +
geom_col(fill = "#FF6F61", width = 0.7) +
facet_wrap(~ age_group, scales = "free_y") +
geom_text(aes(label = cases), vjust = -0.2, size = 3, color = "black") +
labs(
title = "Breast Cancer Cases by Age Group (2020–2024)",
subtitle = "Case Counts Grouped by Age Across Years",
x = "Year",
y = "Number of Cases",
caption = "Source: NYC Health Breast Cancer Dataset"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11, color = "gray40"),
axis.text.x = element_text(angle = 0),
strip.text = element_text(face = "bold", size = 11),
panel.grid.major.x = element_blank()
)
Step 4: Yearly Totals Transformation
yearly_totals <- untidy_long %>%
mutate(cases = as.numeric(cases)) %>%
group_by(year) %>%
summarize(total_cases = sum(cases, na.rm = TRUE)) %>%
arrange(desc(total_cases))
ggplot(yearly_totals, aes(x = year, y = total_cases, group = 1)) +
geom_line(color = "#D72638", size = 1.2) +
geom_point(size = 2.5, color = "#D72638") +
geom_text(aes(label = total_cases), vjust = -0.5, size = 3.5) +
labs(
title = "Total Breast Cancer Cases by Year",
subtitle = "Aggregated Across All Age Groups",
x = "Year",
y = "Total Cases",
caption = "Source: NYC Health Breast Cancer Dataset"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11, color = "gray40"),
panel.grid.minor = element_blank()
)
Throughout this project, I gained a much deeper appreciation for the role data wrangling plays in actual analysis. Each dataset had its own set of issues — and none of them were ready to use out-of-the-box.
I learned how important it is to approach messy data with both strategy and patience. From inconsistent headers to multi-year wide formats, every issue required its own combination of janitor, tidyr, and dplyr tools to resolve.
Across all three datasets, visualization wasn’t just the final step — it was a diagnostic tool. The plots revealed patterns, gaps, and outliers that would be missed in raw tables alone. By reshaping the data into long format and visualizing it with ggplot2, I transformed messy data into clear narratives. Whether it was maternal health trends, gender-based service usage, or age-driven cancer incidence, each plot was chosen to maximize interpretability and real-world relevance.