Introduction

The purpose of this project is to practice preparing and tidying wide-format datasets for analysis.
I selected three wide datasets inspired by Week 6 discussions:
1. NYC Public Schools Test Scores
2. Prenatal Food Insecurity Survey
3. Minority College Enrollment Rates

For each dataset, I created a CSV file, read it into R, tidied and transformed it using tidyr and dplyr, and performed the requested analysis.
All code, outputs, and explanations are documented below.


Dataset 1: NYC Public Schools Test Scores

This dataset contains English Language Arts (ELA) and Math test scores for three NYC schools from 2019–2021. The wide format has each year in separate columns.

Load Data

library(dplyr)
library(tidyr)

nyc <- read.csv("nyc_schools.csv")
head(nyc)
##   School ELA_2019 ELA_2020 ELA_2021 Math_2019 Math_2020 Math_2021
## 1 PS 101       55       57       60        59        61        63
## 2 PS 202       62       64       66        65        67        69
## 3 PS 303       47       50       52        49        53        55

Tidying the Data

The dataset is wide (separate columns for each year/subject). I used pivot_longer to reshape it into long format.

nyc_long <- nyc %>% 
  pivot_longer(cols = -School,
               names_to = c("Subject", "Year"),
               names_sep = "_",
               values_to = "Score")

head(nyc_long)
## # A tibble: 6 × 4
##   School Subject Year  Score
##   <chr>  <chr>   <chr> <int>
## 1 PS 101 ELA     2019     55
## 2 PS 101 ELA     2020     57
## 3 PS 101 ELA     2021     60
## 4 PS 101 Math    2019     59
## 5 PS 101 Math    2020     61
## 6 PS 101 Math    2021     63

Analysis

We can now compute average performance by subject across years.

nyc_summary <- nyc_long %>% 
  group_by(Year, Subject) %>% 
  summarise(Avg_Score = mean(Score), .groups="drop")

nyc_summary
## # A tibble: 6 × 3
##   Year  Subject Avg_Score
##   <chr> <chr>       <dbl>
## 1 2019  ELA          54.7
## 2 2019  Math         57.7
## 3 2020  ELA          57  
## 4 2020  Math         60.3
## 5 2021  ELA          59.3
## 6 2021  Math         62.3

Narrative:
After tidying, we can clearly compare ELA and Math scores over time. We see steady improvement across years for both subjects.


Dataset 2: Prenatal Food Insecurity Survey

This dataset simulates survey responses from five patients regarding prenatal visits, transportation issues, food insecurity, and access to fresh food.

Load Data

prenatal <- read.csv("prenatal_food_insecurity.csv")
head(prenatal)
##   PatientID AttendedVisits TransportIssues FoodInsecurity FreshFoodAccess
## 1         1              8              No            Yes             Low
## 2         2              6              No             No             Low
## 3         3              7             Yes            Yes          Medium
## 4         4              5              No            Yes             Low
## 5         5              9              No             No            High

Tidying the Data

We will tidy the categorical survey responses into long format while keeping the numeric AttendedVisits separate.

prenatal_long <- prenatal %>% 
  pivot_longer(cols = c(TransportIssues, FoodInsecurity, FreshFoodAccess),
               names_to = "Question",
               values_to = "Response")

head(prenatal_long)
## # A tibble: 6 × 4
##   PatientID AttendedVisits Question        Response
##       <int>          <int> <chr>           <chr>   
## 1         1              8 TransportIssues No      
## 2         1              8 FoodInsecurity  Yes     
## 3         1              8 FreshFoodAccess Low     
## 4         2              6 TransportIssues No      
## 5         2              6 FoodInsecurity  No      
## 6         2              6 FreshFoodAccess Low

Analysis

We analyze patterns of food insecurity and fresh food access.

food_summary <- prenatal %>% 
  group_by(FoodInsecurity, FreshFoodAccess) %>% 
  summarise(Count = n(), Avg_Visits = mean(AttendedVisits), .groups="drop")

food_summary
## # A tibble: 4 × 4
##   FoodInsecurity FreshFoodAccess Count Avg_Visits
##   <chr>          <chr>           <int>      <dbl>
## 1 No             High                1        9  
## 2 No             Low                 1        6  
## 3 Yes            Low                 2        6.5
## 4 Yes            Medium              1        7

Narrative:
The tidied data makes it easy to analyze response patterns. We observe that most patients reporting food insecurity also reported low access to fresh food. This supports initiatives linking food access programs with prenatal care.


Dataset 3: Minority College Enrollment

This dataset shows enrollment counts of minority students from 2018–2022 at three universities.

Load Data

college <- read.csv("minority_college_enrollment.csv")
head(college)
##   College X2018 X2019 X2020 X2021 X2022
## 1  Univ A   320   340   360   380   400
## 2  Univ B   210   220   230   250   270
## 3  Univ C   180   190   200   215   225

Tidying the Data

The dataset is wide across years, so we convert it to long format and clean the year labels.

college_long <- college %>% 
  pivot_longer(cols = -College,
               names_to = "Year",
               values_to = "Enrollment") %>% 
  mutate(Year = gsub("X", "", Year))

head(college_long)
## # A tibble: 6 × 3
##   College Year  Enrollment
##   <chr>   <chr>      <int>
## 1 Univ A  2018         320
## 2 Univ A  2019         340
## 3 Univ A  2020         360
## 4 Univ A  2021         380
## 5 Univ A  2022         400
## 6 Univ B  2018         210

Analysis

We analyze enrollment trends over years.

enrollment_trend <- college_long %>% 
  group_by(Year) %>% 
  summarise(Total_Enrollment = sum(Enrollment), .groups="drop")

enrollment_trend
## # A tibble: 5 × 2
##   Year  Total_Enrollment
##   <chr>            <int>
## 1 2018               710
## 2 2019               750
## 3 2020               790
## 4 2021               845
## 5 2022               895

Narrative:
Enrollment increased steadily from 2018 to 2022 across all universities. Identifying such bright spots in growth highlights effective strategies that can be replicated to further promote diversity in higher education.


Conclusion

The project illustrates that preparing data carefully is as important as the analysis itself.