Overview / Introduction

For this assignment we’ve been given the task of choosing any three of the “wide” data sets identified in the Week 6 Discussion items by our classmates. For each of the three chosen data sets we will create a .CSV file or MySQL database, read the information from our files into R using tidyr and dplyr as needed to tidy and transform our data, and perform the analysis requested in the discussion item.

The first data set is from DATA.NY.GOV and it records the daily ridership numbers for the MTA since 2020 compared to pre-pandemic levels. The suggested analysis revolved around how ridership has fluctuated throughout the years following the pandemic and which means of transport have been affected more than others. The second data set contains world population data spanning several decades providing population counts for various countries from 1970 through 2022. The suggested analyses revolve around population growth across continents and the correlation between a country’s area and its population density over time. The third data set records HIV/AIDS diagnosis data in NYC and the suggested analyses revolve around diagnosis rates by gender and race, changes in diagnoses over the years across neighborhoods, and a comparison between new HIV and HIV/AIDS diagnoses.

Load Packages and First Data Set

Let’s start by loading the necessary packages and the first data set.

library(tidyverse)
library(dplyr)
library(zoo)
library(RSocrata)

url1 = "https://data.ny.gov/api/odata/v4/vxuj-8kew"
ridership <- read.socrata(url1)
glimpse(ridership)
## Rows: 1,754
## Columns: 15
## $ date                                                 <dttm> 2021-01-01, 2022…
## $ subways_total_estimated_ridership                    <int> 613692, 1027918, …
## $ subways_of_comparable_pre_pandemic_day               <dbl> 0.29, 0.38, 0.80,…
## $ buses_total_estimated_ridersip                       <int> 378288, 350845, 4…
## $ buses_of_comparable_pre_pandemic_day                 <dbl> 0.41, 0.29, 0.52,…
## $ lirr_total_estimated_ridership                       <int> 28977, 33980, 677…
## $ lirr_of_comparable_pre_pandemic_day                  <dbl> 0.35, 0.35, 0.82,…
## $ metro_north_total_estimated_ridership                <int> 14988, 30341, 663…
## $ metro_north_of_comparable_pre_pandemic_day           <dbl> 0.17, 0.23, 0.73,…
## $ access_a_ride_total_scheduled_trips                  <int> 5960, 4904, 11476…
## $ access_a_ride_of_comparable_pre_pandemic_day         <dbl> 0.44, 0.34, 0.85,…
## $ bridges_and_tunnels_total_traffic                    <int> 445950, 498515, 7…
## $ bridges_and_tunnels_of_comparable_pre_pandemic_day   <dbl> 0.65, 0.65, 1.08,…
## $ staten_island_railway_total_estimated_ridership      <int> 805, 1262, 1771, …
## $ staten_island_railway_of_comparable_pre_pandemic_day <dbl> 0.29, 0.31, 0.65,…

Tidy First Data Set

Seeing as how the suggested analysis was with regards to ridership fluctuation throughout the years and its affects on different means of transportation, our data set should then be tidied to better compare ridership between the different modes of transportation with regards to their initial pre pandemic levels. For this reason, we create a subset pulling the date field along with the columns comparing each respective mode of transportation with their pre pandemic levels. We then rename each column in order to pivot and create a long table containing a field for means of transportation and a field for ridership proportion compared to pre pandemic levels.

comparable_ridership <- subset(ridership, select = c(1, 3, 5, 7, 9, 11, 13, 15)) %>% 
  rename(subway = subways_of_comparable_pre_pandemic_day,
         buses = buses_of_comparable_pre_pandemic_day,
         lirr = lirr_of_comparable_pre_pandemic_day,
         "metro north" = metro_north_of_comparable_pre_pandemic_day,
         "access a ride" = access_a_ride_of_comparable_pre_pandemic_day,
         "bridges and tunnels" = bridges_and_tunnels_of_comparable_pre_pandemic_day,
         "staten island railway" = staten_island_railway_of_comparable_pre_pandemic_day) %>% 
  pivot_longer(cols = c(2:8),
               names_to = "means_of_transportation",
               values_to = "prop_compared_to_pre_pandemic")
glimpse(comparable_ridership)
## Rows: 12,278
## Columns: 3
## $ date                          <dttm> 2021-01-01, 2021-01-01, 2021-01-01, 202…
## $ means_of_transportation       <chr> "subway", "buses", "lirr", "metro north"…
## $ prop_compared_to_pre_pandemic <dbl> 0.29, 0.41, 0.35, 0.17, 0.44, 0.65, 0.29…

Analyze First Data Set

Once the long table has been created, we go ahead and group the data by means_of_transportation and calculate the mean prop_compared_to_pre_pandemic for each one in order to get a glimpse of which ones have been most affected by the pandemic and have yet to return to pre pandemic levels. We then create a box plot to get a better understanding of these averages and their respective distributions. Given these visuals, it seems like bridges and tunnels have had greater success in returning to pre pandemic levels while the staten island railway continues to struggle.

comparable_ridership %>%
  group_by(means_of_transportation) %>%
  summarise(
    "avg_prop_compared_to_pre_pandemic" = mean(prop_compared_to_pre_pandemic)
  ) %>% 
  arrange(means_of_transportation)
## # A tibble: 7 × 2
##   means_of_transportation avg_prop_compared_to_pre_pandemic
##   <chr>                                               <dbl>
## 1 access a ride                                       0.874
## 2 bridges and tunnels                                 0.936
## 3 buses                                               0.551
## 4 lirr                                                0.602
## 5 metro north                                         0.521
## 6 staten island railway                               0.385
## 7 subway                                              0.561
ggplot(data = comparable_ridership, aes(x = means_of_transportation,
                                        y = prop_compared_to_pre_pandemic,
                                        color = means_of_transportation)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Load Second Data Set

Now, we move on to the second data set.

url2 = "https://raw.githubusercontent.com/Stevee-G/Data607/refs/heads/main/world_population.csv"
world_population <- read.csv(url2)
glimpse(world_population)
## Rows: 234
## Columns: 17
## $ Rank                        <int> 36, 138, 34, 213, 203, 42, 224, 201, 33, 1…
## $ CCA3                        <chr> "AFG", "ALB", "DZA", "ASM", "AND", "AGO", …
## $ Country.Territory           <chr> "Afghanistan", "Albania", "Algeria", "Amer…
## $ Capital                     <chr> "Kabul", "Tirana", "Algiers", "Pago Pago",…
## $ Continent                   <chr> "Asia", "Europe", "Africa", "Oceania", "Eu…
## $ X2022.Population            <int> 41128771, 2842321, 44903225, 44273, 79824,…
## $ X2020.Population            <int> 38972230, 2866849, 43451666, 46189, 77700,…
## $ X2015.Population            <int> 33753499, 2882481, 39543154, 51368, 71746,…
## $ X2010.Population            <int> 28189672, 2913399, 35856344, 54849, 71519,…
## $ X2000.Population            <int> 19542982, 3182021, 30774621, 58230, 66097,…
## $ X1990.Population            <int> 10694796, 3295066, 25518074, 47818, 53569,…
## $ X1980.Population            <int> 12486631, 2941651, 18739378, 32886, 35611,…
## $ X1970.Population            <int> 10752971, 2324731, 13795915, 27075, 19860,…
## $ Area..km..                  <int> 652230, 28748, 2381741, 199, 468, 1246700,…
## $ Density..per.km..           <dbl> 63.0587, 98.8702, 18.8531, 222.4774, 170.5…
## $ Growth.Rate                 <dbl> 1.0257, 0.9957, 1.0164, 0.9831, 1.0100, 1.…
## $ World.Population.Percentage <dbl> 0.52, 0.04, 0.56, 0.00, 0.00, 0.45, 0.00, …

Tidy Second Data Set

The suggested analysis for this data set revolves around population growth across continents and the correlation between a country’s area and its population density over time. All that the first analysis needs is a view of population growth through the decades for each continent. The second analysis is a little trickier, however. The initial discussion post asked for an analysis taking a country’s area (a numerical variable), a country’s population density (another numerical variable), and time (a numerical or categorical variable depending on context) into account. In order to simplify the analysis, we go ahead and create a variable for the change in density, that way we only have to take that variable and country area into consideration.

For the first subset, we go ahead and select the columns we need, rename the columns that refer to decades, and pivot everything into a long table where year and population are their own columns. We then group the subset by Continent and year, sum the populations, and ensure year is a numerical variable for plotting purposes. For the second subset, have a similar approach, except when renaming we have to use column indices since the area and density columns tend to import with slight variations in their titles, making them difficult to call by name when the code is rerun. Once renaming is done, we create the change in density column called density_prop_change and select just the area and density change columns for analysis.

pop_by_continent <- subset(world_population, select = c(5:13)) %>%
  rename("2022" = X2022.Population, "2020" = X2020.Population,
         "2015" = X2015.Population, "2010" = X2010.Population,
         "2000" = X2000.Population, "1990" = X1990.Population,
         "1980" = X1980.Population, "1970" = X1970.Population) %>% 
  pivot_longer(cols = c(2:9),
               names_to = "year",
               values_to = "population") %>% 
  group_by(Continent, year) %>% 
  summarise(
    "total_population" = sum(population)
  )
pop_by_continent$year <- as.numeric(pop_by_continent$year)
glimpse(pop_by_continent)
## Rows: 48
## Columns: 3
## Groups: Continent [6]
## $ Continent        <chr> "Africa", "Africa", "Africa", "Africa", "Africa", "Af…
## $ year             <dbl> 1970, 1980, 1990, 2000, 2010, 2015, 2020, 2022, 1970,…
## $ total_population <dbl> 365444348, 481536377, 638150629, 818946032, 105522807…
density_by_area <- subset(world_population, select = c(13:15)) %>% 
  rename(pop_1970 = c(1), area_km = c(2), density_2022 = c(3)) %>% 
  mutate(density_prop_change = (density_2022 - (pop_1970/area_km))/(pop_1970/area_km)) %>% 
  select(2, 4) %>%  
  arrange(area_km)
glimpse(density_by_area)
## Rows: 234
## Columns: 2
## $ area_km             <int> 1, 2, 6, 12, 21, 21, 26, 30, 34, 53, 54, 61, 78, 9…
## $ density_prop_change <dbl> -0.32180851, 0.50263700, 0.22349635, 0.09159883, 0…

Analyze Second Data Set

For the first analysis, we create a line graph comparing year to total population for each continent. As can be clearly seen, Asia and Africa continue to be the continents with the most population growth over the last few decades while Europe and Oceania have basically stagnated. For the second analysis, we create a scatter plot comparing area to change in density and see that there is little to no correlation between a country’s area and its population density change through the decades. In order further investigate the lack of correlation, we run a test on the regression line and note an \(R^2\) value of 0.002, F-statistic of 0.443, and p-value of 0.5, showing absolutely no relationship or dependency between these two variables.

ggplot(data = pop_by_continent, aes(x = year,
                                    y = total_population,
                                    color = Continent)) +
  geom_line(linewidth = 2)

ggplot(data = density_by_area, aes(x = area_km,
                                   y = density_prop_change)) +
  geom_point() +
  stat_smooth(method = lm, se = FALSE)

summary(lm(density_prop_change ~ area_km, data = density_by_area))
## 
## Call:
## lm(formula = density_prop_change ~ area_km, data = density_by_area)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5439 -1.5162 -0.4621  0.7561 28.7617 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.917e+00  1.938e-01   9.893   <2e-16 ***
## area_km     -6.966e-08  1.046e-07  -0.666    0.506    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.814 on 232 degrees of freedom
## Multiple R-squared:  0.001906,   Adjusted R-squared:  -0.002396 
## F-statistic: 0.4431 on 1 and 232 DF,  p-value: 0.5063

Load Third Data Set

Finally, let us load the third data set.

url3 = "https://data.cityofnewyork.us/api/odata/v4/ykvb-493p"
hiv_aids <- read.socrata(url3)
glimpse(hiv_aids)
## Rows: 8,976
## Columns: 11
## $ year                          <int> 2010, 2011, 2010, 2012, 2013, 2013, 2013…
## $ borough                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ neighborhood                  <chr> "Greenpoint", "Stapleton - St. George", …
## $ sex                           <chr> "Male", "Female", "Male", "Female", "Mal…
## $ race                          <chr> "Black", "Native American", "All", "Unkn…
## $ hiv_diagnoses_num             <chr> "6", "0", "23", "0", "0", "54", "0", "0"…
## $ hiv_diagnoses_num_per_100k    <chr> "330.4", "0", "25.4", "0", "0", "56.5", …
## $ hiv_aids_concurrent_num       <chr> "0", "0", "5", "0", "0", "8", "0", "0", …
## $ concurrent_hiv_aids_among_all <chr> "0", "0", "21.7", "0", "0", "14.8", "0",…
## $ aids_diagnoses_num            <chr> "5", "0", "14", "0", "0", "33", "0", "0"…
## $ aids_diagnoses_num_per_100k   <chr> "275.3", "0", "15.4", "0", "0", "34.5", …

Tidy Third Data Set

The suggested analyses for this data set revolve around diagnosis rates by gender and race, changes in diagnoses over the years across neighborhoods, and a comparison between new HIV and HIV/AIDS diagnoses. Out of the three suggested analyses, the first and third one seem more plausible since the borough and neighborhood fields have a lot of empty observations, hindering the second suggested analysis. Thus, we tailor our tidying towards the first and third suggested analyses for this data set.

We create the subset using the year field along with sex, race, and the different diagnosis counts. Right away we rename the diagnosis count columns to prepare them for pivoting. We pivot these columns in order to create one for diagnosis and another for count. Some observations for count have an invalid “*” value and have to be filtered out. Once these invalid values are removed we can make the count field numeric and optimal for analysis. After all is said and done, we group the subset by year, sex, race, diagnosis and sum the counts in order to avoid unnecessary repetition in the diagnosis column.

hiv_aids_sex_race <- subset(hiv_aids, select = c(1, 4:6, 8, 10)) %>%
  rename(hiv = hiv_diagnoses_num, "hiv and aids" = hiv_aids_concurrent_num,
         aids = aids_diagnoses_num) %>% 
  pivot_longer(cols = c(4:6),
               names_to = "diagnosis",
               values_to = "count") %>% 
  filter(count != "*")
hiv_aids_sex_race$count <- as.numeric(hiv_aids_sex_race$count)

hiv_aids_sex_race <- hiv_aids_sex_race %>% 
  group_by(year, sex, race, diagnosis) %>% 
  summarise(
    count = sum(count)
  ) %>% 
  arrange(year, sex, race, diagnosis)

Analyze Third Data Set

Next, we create a line graph for each each relationship that was mentioned in the first and third suggestions. The first is a plot displaying total count of diagnoses through the years by sex, the second shows it by race, and the third shows HIV vs. HIV/AIDS diagnoses. As can be seen in every plot, there seemed to be an uptick in diagnoses in the years 2016 and 2020. The first plot shows males to be those primarily affected by the disease, the second plot shows that black New Yorkers hold a higher proportion of diagnoses, and that concurrent HIV/AID diagnoses still hold a noticeable portion of first time diagnoses. One thing to note about the data as a whole is that the years 2014 and 2015 are not accounted for, thus explaining why in every plot there is a perfectly straight line connecting 2013 to 2016 or at times no line at all.

ggplot(data = hiv_aids_sex_race %>% 
         group_by(year, sex) %>% 
         summarise(
           count = sum(count)
           ), aes(x = year, y = count, color = sex)) +
  geom_line(linewidth = 2)

ggplot(data = hiv_aids_sex_race %>% 
         group_by(year, race) %>% 
         summarise(
           count = sum(count)
           ), aes(x = year, y = count, color = race)) +
  geom_line(linewidth = 2)

ggplot(data = hiv_aids_sex_race %>% 
         filter(diagnosis != "aids") %>% 
         group_by(year, diagnosis) %>% 
         summarise(
           count = sum(count)
           ), aes(x = year, y = count, color = diagnosis)) +
  geom_line(linewidth = 2)

Conclusions / Findings and Recommendations

The different data sets and suggested data analyses required different approaches towards tidying and preparation. The first data set relied mostly on pivoting the different modes of transportation and helped us see just how badly the pandemic has affected the MTA. The second data set needed more of a keen eye on short comings from the table itself and showed us just how population has grown and how area has no effect on population density change at all. The third data set had issues with data quality but we still had enough to analyze what we needed to, helping us take note of HIV/AIDS trends in NYC. All in all, the data sets were successfully tidied and the suggested analyses were successfully performed.