For this assignment we’ve been given the task of choosing any three of the “wide” data sets identified in the Week 6 Discussion items by our classmates. For each of the three chosen data sets we will create a .CSV file or MySQL database, read the information from our files into R using tidyr and dplyr as needed to tidy and transform our data, and perform the analysis requested in the discussion item.
The first data set is from DATA.NY.GOV and it records the daily ridership numbers for the MTA since 2020 compared to pre-pandemic levels. The suggested analysis revolved around how ridership has fluctuated throughout the years following the pandemic and which means of transport have been affected more than others. The second data set contains world population data spanning several decades providing population counts for various countries from 1970 through 2022. The suggested analyses revolve around population growth across continents and the correlation between a country’s area and its population density over time. The third data set records HIV/AIDS diagnosis data in NYC and the suggested analyses revolve around diagnosis rates by gender and race, changes in diagnoses over the years across neighborhoods, and a comparison between new HIV and HIV/AIDS diagnoses.
Let’s start by loading the necessary packages and the first data set.
library(tidyverse)
library(dplyr)
library(zoo)
library(RSocrata)
url1 = "https://data.ny.gov/api/odata/v4/vxuj-8kew"
ridership <- read.socrata(url1)
glimpse(ridership)
## Rows: 1,754
## Columns: 15
## $ date <dttm> 2021-01-01, 2022…
## $ subways_total_estimated_ridership <int> 613692, 1027918, …
## $ subways_of_comparable_pre_pandemic_day <dbl> 0.29, 0.38, 0.80,…
## $ buses_total_estimated_ridersip <int> 378288, 350845, 4…
## $ buses_of_comparable_pre_pandemic_day <dbl> 0.41, 0.29, 0.52,…
## $ lirr_total_estimated_ridership <int> 28977, 33980, 677…
## $ lirr_of_comparable_pre_pandemic_day <dbl> 0.35, 0.35, 0.82,…
## $ metro_north_total_estimated_ridership <int> 14988, 30341, 663…
## $ metro_north_of_comparable_pre_pandemic_day <dbl> 0.17, 0.23, 0.73,…
## $ access_a_ride_total_scheduled_trips <int> 5960, 4904, 11476…
## $ access_a_ride_of_comparable_pre_pandemic_day <dbl> 0.44, 0.34, 0.85,…
## $ bridges_and_tunnels_total_traffic <int> 445950, 498515, 7…
## $ bridges_and_tunnels_of_comparable_pre_pandemic_day <dbl> 0.65, 0.65, 1.08,…
## $ staten_island_railway_total_estimated_ridership <int> 805, 1262, 1771, …
## $ staten_island_railway_of_comparable_pre_pandemic_day <dbl> 0.29, 0.31, 0.65,…
Seeing as how the suggested analysis was with regards to ridership fluctuation throughout the years and its affects on different means of transportation, our data set should then be tidied to better compare ridership between the different modes of transportation with regards to their initial pre pandemic levels. For this reason, we create a subset pulling the date field along with the columns comparing each respective mode of transportation with their pre pandemic levels. We then rename each column in order to pivot and create a long table containing a field for means of transportation and a field for ridership proportion compared to pre pandemic levels.
comparable_ridership <- subset(ridership, select = c(1, 3, 5, 7, 9, 11, 13, 15)) %>%
rename(subway = subways_of_comparable_pre_pandemic_day,
buses = buses_of_comparable_pre_pandemic_day,
lirr = lirr_of_comparable_pre_pandemic_day,
"metro north" = metro_north_of_comparable_pre_pandemic_day,
"access a ride" = access_a_ride_of_comparable_pre_pandemic_day,
"bridges and tunnels" = bridges_and_tunnels_of_comparable_pre_pandemic_day,
"staten island railway" = staten_island_railway_of_comparable_pre_pandemic_day) %>%
pivot_longer(cols = c(2:8),
names_to = "means_of_transportation",
values_to = "prop_compared_to_pre_pandemic")
glimpse(comparable_ridership)
## Rows: 12,278
## Columns: 3
## $ date <dttm> 2021-01-01, 2021-01-01, 2021-01-01, 202…
## $ means_of_transportation <chr> "subway", "buses", "lirr", "metro north"…
## $ prop_compared_to_pre_pandemic <dbl> 0.29, 0.41, 0.35, 0.17, 0.44, 0.65, 0.29…
Once the long table has been created, we go ahead and group the data
by means_of_transportation and calculate the mean
prop_compared_to_pre_pandemic for each one in order to get
a glimpse of which ones have been most affected by the pandemic and have
yet to return to pre pandemic levels. We then create a box plot to get a
better understanding of these averages and their respective
distributions. Given these visuals, it seems like bridges and tunnels
have had greater success in returning to pre pandemic levels while the
staten island railway continues to struggle.
comparable_ridership %>%
group_by(means_of_transportation) %>%
summarise(
"avg_prop_compared_to_pre_pandemic" = mean(prop_compared_to_pre_pandemic)
) %>%
arrange(means_of_transportation)
## # A tibble: 7 × 2
## means_of_transportation avg_prop_compared_to_pre_pandemic
## <chr> <dbl>
## 1 access a ride 0.874
## 2 bridges and tunnels 0.936
## 3 buses 0.551
## 4 lirr 0.602
## 5 metro north 0.521
## 6 staten island railway 0.385
## 7 subway 0.561
ggplot(data = comparable_ridership, aes(x = means_of_transportation,
y = prop_compared_to_pre_pandemic,
color = means_of_transportation)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Now, we move on to the second data set.
url2 = "https://raw.githubusercontent.com/Stevee-G/Data607/refs/heads/main/world_population.csv"
world_population <- read.csv(url2)
glimpse(world_population)
## Rows: 234
## Columns: 17
## $ Rank <int> 36, 138, 34, 213, 203, 42, 224, 201, 33, 1…
## $ CCA3 <chr> "AFG", "ALB", "DZA", "ASM", "AND", "AGO", …
## $ Country.Territory <chr> "Afghanistan", "Albania", "Algeria", "Amer…
## $ Capital <chr> "Kabul", "Tirana", "Algiers", "Pago Pago",…
## $ Continent <chr> "Asia", "Europe", "Africa", "Oceania", "Eu…
## $ X2022.Population <int> 41128771, 2842321, 44903225, 44273, 79824,…
## $ X2020.Population <int> 38972230, 2866849, 43451666, 46189, 77700,…
## $ X2015.Population <int> 33753499, 2882481, 39543154, 51368, 71746,…
## $ X2010.Population <int> 28189672, 2913399, 35856344, 54849, 71519,…
## $ X2000.Population <int> 19542982, 3182021, 30774621, 58230, 66097,…
## $ X1990.Population <int> 10694796, 3295066, 25518074, 47818, 53569,…
## $ X1980.Population <int> 12486631, 2941651, 18739378, 32886, 35611,…
## $ X1970.Population <int> 10752971, 2324731, 13795915, 27075, 19860,…
## $ Area..km.. <int> 652230, 28748, 2381741, 199, 468, 1246700,…
## $ Density..per.km.. <dbl> 63.0587, 98.8702, 18.8531, 222.4774, 170.5…
## $ Growth.Rate <dbl> 1.0257, 0.9957, 1.0164, 0.9831, 1.0100, 1.…
## $ World.Population.Percentage <dbl> 0.52, 0.04, 0.56, 0.00, 0.00, 0.45, 0.00, …
The suggested analysis for this data set revolves around population growth across continents and the correlation between a country’s area and its population density over time. All that the first analysis needs is a view of population growth through the decades for each continent. The second analysis is a little trickier, however. The initial discussion post asked for an analysis taking a country’s area (a numerical variable), a country’s population density (another numerical variable), and time (a numerical or categorical variable depending on context) into account. In order to simplify the analysis, we go ahead and create a variable for the change in density, that way we only have to take that variable and country area into consideration.
For the first subset, we go ahead and select the columns we need,
rename the columns that refer to decades, and pivot everything into a
long table where year and population are their
own columns. We then group the subset by Continent and
year, sum the populations, and ensure year is
a numerical variable for plotting purposes. For the second subset, have
a similar approach, except when renaming we have to use column indices
since the area and density columns tend to import with slight variations
in their titles, making them difficult to call by name when the code is
rerun. Once renaming is done, we create the change in density column
called density_prop_change and select just the area and
density change columns for analysis.
pop_by_continent <- subset(world_population, select = c(5:13)) %>%
rename("2022" = X2022.Population, "2020" = X2020.Population,
"2015" = X2015.Population, "2010" = X2010.Population,
"2000" = X2000.Population, "1990" = X1990.Population,
"1980" = X1980.Population, "1970" = X1970.Population) %>%
pivot_longer(cols = c(2:9),
names_to = "year",
values_to = "population") %>%
group_by(Continent, year) %>%
summarise(
"total_population" = sum(population)
)
pop_by_continent$year <- as.numeric(pop_by_continent$year)
glimpse(pop_by_continent)
## Rows: 48
## Columns: 3
## Groups: Continent [6]
## $ Continent <chr> "Africa", "Africa", "Africa", "Africa", "Africa", "Af…
## $ year <dbl> 1970, 1980, 1990, 2000, 2010, 2015, 2020, 2022, 1970,…
## $ total_population <dbl> 365444348, 481536377, 638150629, 818946032, 105522807…
density_by_area <- subset(world_population, select = c(13:15)) %>%
rename(pop_1970 = c(1), area_km = c(2), density_2022 = c(3)) %>%
mutate(density_prop_change = (density_2022 - (pop_1970/area_km))/(pop_1970/area_km)) %>%
select(2, 4) %>%
arrange(area_km)
glimpse(density_by_area)
## Rows: 234
## Columns: 2
## $ area_km <int> 1, 2, 6, 12, 21, 21, 26, 30, 34, 53, 54, 61, 78, 9…
## $ density_prop_change <dbl> -0.32180851, 0.50263700, 0.22349635, 0.09159883, 0…
For the first analysis, we create a line graph comparing year to total population for each continent. As can be clearly seen, Asia and Africa continue to be the continents with the most population growth over the last few decades while Europe and Oceania have basically stagnated. For the second analysis, we create a scatter plot comparing area to change in density and see that there is little to no correlation between a country’s area and its population density change through the decades. In order further investigate the lack of correlation, we run a test on the regression line and note an \(R^2\) value of 0.002, F-statistic of 0.443, and p-value of 0.5, showing absolutely no relationship or dependency between these two variables.
ggplot(data = pop_by_continent, aes(x = year,
y = total_population,
color = Continent)) +
geom_line(linewidth = 2)
ggplot(data = density_by_area, aes(x = area_km,
y = density_prop_change)) +
geom_point() +
stat_smooth(method = lm, se = FALSE)
summary(lm(density_prop_change ~ area_km, data = density_by_area))
##
## Call:
## lm(formula = density_prop_change ~ area_km, data = density_by_area)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5439 -1.5162 -0.4621 0.7561 28.7617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.917e+00 1.938e-01 9.893 <2e-16 ***
## area_km -6.966e-08 1.046e-07 -0.666 0.506
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.814 on 232 degrees of freedom
## Multiple R-squared: 0.001906, Adjusted R-squared: -0.002396
## F-statistic: 0.4431 on 1 and 232 DF, p-value: 0.5063
Finally, let us load the third data set.
url3 = "https://data.cityofnewyork.us/api/odata/v4/ykvb-493p"
hiv_aids <- read.socrata(url3)
glimpse(hiv_aids)
## Rows: 8,976
## Columns: 11
## $ year <int> 2010, 2011, 2010, 2012, 2013, 2013, 2013…
## $ borough <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ neighborhood <chr> "Greenpoint", "Stapleton - St. George", …
## $ sex <chr> "Male", "Female", "Male", "Female", "Mal…
## $ race <chr> "Black", "Native American", "All", "Unkn…
## $ hiv_diagnoses_num <chr> "6", "0", "23", "0", "0", "54", "0", "0"…
## $ hiv_diagnoses_num_per_100k <chr> "330.4", "0", "25.4", "0", "0", "56.5", …
## $ hiv_aids_concurrent_num <chr> "0", "0", "5", "0", "0", "8", "0", "0", …
## $ concurrent_hiv_aids_among_all <chr> "0", "0", "21.7", "0", "0", "14.8", "0",…
## $ aids_diagnoses_num <chr> "5", "0", "14", "0", "0", "33", "0", "0"…
## $ aids_diagnoses_num_per_100k <chr> "275.3", "0", "15.4", "0", "0", "34.5", …
The suggested analyses for this data set revolve around diagnosis
rates by gender and race, changes in diagnoses over the years across
neighborhoods, and a comparison between new HIV and HIV/AIDS diagnoses.
Out of the three suggested analyses, the first and third one seem more
plausible since the borough and neighborhood
fields have a lot of empty observations, hindering the second suggested
analysis. Thus, we tailor our tidying towards the first and third
suggested analyses for this data set.
We create the subset using the year field along with
sex, race, and the different diagnosis counts.
Right away we rename the diagnosis count columns to prepare them for
pivoting. We pivot these columns in order to create one for
diagnosis and another for count. Some
observations for count have an invalid “*” value and have
to be filtered out. Once these invalid values are removed we can make
the count field numeric and optimal for analysis. After all
is said and done, we group the subset by year,
sex, race, diagnosis and sum the
counts in order to avoid unnecessary repetition in the
diagnosis column.
hiv_aids_sex_race <- subset(hiv_aids, select = c(1, 4:6, 8, 10)) %>%
rename(hiv = hiv_diagnoses_num, "hiv and aids" = hiv_aids_concurrent_num,
aids = aids_diagnoses_num) %>%
pivot_longer(cols = c(4:6),
names_to = "diagnosis",
values_to = "count") %>%
filter(count != "*")
hiv_aids_sex_race$count <- as.numeric(hiv_aids_sex_race$count)
hiv_aids_sex_race <- hiv_aids_sex_race %>%
group_by(year, sex, race, diagnosis) %>%
summarise(
count = sum(count)
) %>%
arrange(year, sex, race, diagnosis)
Next, we create a line graph for each each relationship that was
mentioned in the first and third suggestions. The first is a plot
displaying total count of diagnoses through the years by
sex, the second shows it by race, and the
third shows HIV vs. HIV/AIDS diagnoses. As can be seen in every plot,
there seemed to be an uptick in diagnoses in the years 2016 and 2020.
The first plot shows males to be those primarily affected by the
disease, the second plot shows that black New Yorkers hold a higher
proportion of diagnoses, and that concurrent HIV/AID diagnoses still
hold a noticeable portion of first time diagnoses. One thing to note
about the data as a whole is that the years 2014 and 2015 are not
accounted for, thus explaining why in every plot there is a perfectly
straight line connecting 2013 to 2016 or at times no line at all.
ggplot(data = hiv_aids_sex_race %>%
group_by(year, sex) %>%
summarise(
count = sum(count)
), aes(x = year, y = count, color = sex)) +
geom_line(linewidth = 2)
ggplot(data = hiv_aids_sex_race %>%
group_by(year, race) %>%
summarise(
count = sum(count)
), aes(x = year, y = count, color = race)) +
geom_line(linewidth = 2)
ggplot(data = hiv_aids_sex_race %>%
filter(diagnosis != "aids") %>%
group_by(year, diagnosis) %>%
summarise(
count = sum(count)
), aes(x = year, y = count, color = diagnosis)) +
geom_line(linewidth = 2)
The different data sets and suggested data analyses required different approaches towards tidying and preparation. The first data set relied mostly on pivoting the different modes of transportation and helped us see just how badly the pandemic has affected the MTA. The second data set needed more of a keen eye on short comings from the table itself and showed us just how population has grown and how area has no effect on population density change at all. The third data set had issues with data quality but we still had enough to analyze what we needed to, helping us take note of HIV/AIDS trends in NYC. All in all, the data sets were successfully tidied and the suggested analyses were successfully performed.