The purpose of this project is to practice preparing and tidying
wide-format datasets for analysis.
I selected three wide datasets inspired by Week 6
discussions:
1. NYC Public Schools Test Scores
2. Prenatal Food Insecurity Survey
3. Minority College Enrollment Rates
For each dataset, I created a CSV file, read it into R, tidied and
transformed it using tidyr and dplyr,
and performed the requested analysis.
All code, outputs, and explanations are documented below.
This dataset contains English Language Arts (ELA) and Math test scores for three NYC schools from 2019–2021. The wide format has each year in separate columns.
library(dplyr)
library(tidyr)
nyc <- read.csv("nyc_schools.csv")
head(nyc)
## School ELA_2019 ELA_2020 ELA_2021 Math_2019 Math_2020 Math_2021
## 1 PS 101 55 57 60 59 61 63
## 2 PS 202 62 64 66 65 67 69
## 3 PS 303 47 50 52 49 53 55
The dataset is wide (separate columns for each year/subject). I used
pivot_longer to reshape it into long format.
nyc_long <- nyc %>%
pivot_longer(cols = -School,
names_to = c("Subject", "Year"),
names_sep = "_",
values_to = "Score")
head(nyc_long)
## # A tibble: 6 × 4
## School Subject Year Score
## <chr> <chr> <chr> <int>
## 1 PS 101 ELA 2019 55
## 2 PS 101 ELA 2020 57
## 3 PS 101 ELA 2021 60
## 4 PS 101 Math 2019 59
## 5 PS 101 Math 2020 61
## 6 PS 101 Math 2021 63
We can now compute average performance by subject across years.
nyc_summary <- nyc_long %>%
group_by(Year, Subject) %>%
summarise(Avg_Score = mean(Score), .groups="drop")
nyc_summary
## # A tibble: 6 × 3
## Year Subject Avg_Score
## <chr> <chr> <dbl>
## 1 2019 ELA 54.7
## 2 2019 Math 57.7
## 3 2020 ELA 57
## 4 2020 Math 60.3
## 5 2021 ELA 59.3
## 6 2021 Math 62.3
Narrative:
After tidying, we can clearly compare ELA and Math scores over time. We
see steady improvement across years for both subjects.
This dataset simulates survey responses from five patients regarding prenatal visits, transportation issues, food insecurity, and access to fresh food.
prenatal <- read.csv("prenatal_food_insecurity.csv")
head(prenatal)
## PatientID AttendedVisits TransportIssues FoodInsecurity FreshFoodAccess
## 1 1 8 No Yes Low
## 2 2 6 No No Low
## 3 3 7 Yes Yes Medium
## 4 4 5 No Yes Low
## 5 5 9 No No High
We will tidy the categorical survey responses into long format while
keeping the numeric AttendedVisits separate.
prenatal_long <- prenatal %>%
pivot_longer(cols = c(TransportIssues, FoodInsecurity, FreshFoodAccess),
names_to = "Question",
values_to = "Response")
head(prenatal_long)
## # A tibble: 6 × 4
## PatientID AttendedVisits Question Response
## <int> <int> <chr> <chr>
## 1 1 8 TransportIssues No
## 2 1 8 FoodInsecurity Yes
## 3 1 8 FreshFoodAccess Low
## 4 2 6 TransportIssues No
## 5 2 6 FoodInsecurity No
## 6 2 6 FreshFoodAccess Low
We analyze patterns of food insecurity and fresh food access.
food_summary <- prenatal %>%
group_by(FoodInsecurity, FreshFoodAccess) %>%
summarise(Count = n(), Avg_Visits = mean(AttendedVisits), .groups="drop")
food_summary
## # A tibble: 4 × 4
## FoodInsecurity FreshFoodAccess Count Avg_Visits
## <chr> <chr> <int> <dbl>
## 1 No High 1 9
## 2 No Low 1 6
## 3 Yes Low 2 6.5
## 4 Yes Medium 1 7
Narrative:
The tidied data makes it easy to analyze response patterns. We observe
that most patients reporting food insecurity also reported low access to
fresh food. This supports initiatives linking food access programs with
prenatal care.
This dataset shows enrollment counts of minority students from 2018–2022 at three universities.
college <- read.csv("minority_college_enrollment.csv")
head(college)
## College X2018 X2019 X2020 X2021 X2022
## 1 Univ A 320 340 360 380 400
## 2 Univ B 210 220 230 250 270
## 3 Univ C 180 190 200 215 225
The dataset is wide across years, so we convert it to long format and clean the year labels.
college_long <- college %>%
pivot_longer(cols = -College,
names_to = "Year",
values_to = "Enrollment") %>%
mutate(Year = gsub("X", "", Year))
head(college_long)
## # A tibble: 6 × 3
## College Year Enrollment
## <chr> <chr> <int>
## 1 Univ A 2018 320
## 2 Univ A 2019 340
## 3 Univ A 2020 360
## 4 Univ A 2021 380
## 5 Univ A 2022 400
## 6 Univ B 2018 210
We analyze enrollment trends over years.
enrollment_trend <- college_long %>%
group_by(Year) %>%
summarise(Total_Enrollment = sum(Enrollment), .groups="drop")
enrollment_trend
## # A tibble: 5 × 2
## Year Total_Enrollment
## <chr> <int>
## 1 2018 710
## 2 2019 750
## 3 2020 790
## 4 2021 845
## 5 2022 895
Narrative:
Enrollment increased steadily from 2018 to 2022 across all universities.
Identifying such bright spots in growth highlights effective strategies
that can be replicated to further promote diversity in higher
education.
tidyr::pivot_longer.dplyr, I summarized and analyzed the data.The project illustrates that preparing data carefully is as important as the analysis itself.