DATA 607 - Project 2

DATA 607 Project 2 Description

For three datasets, create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Definitions

Wide vs. Long data: It’s better to use the long format for storing data and wide format for data analysis at the end of a workflow (after data dimensionality is reduced).
Wide dataset: Classic Excel way, with lots of columns. Subject’s repeated responses in a single row, and each response in a separate column. Value do not repeat in the first column.
Long dataset: Values in the first column repeat. Advantage is you don’t waste space if some data are missing and do not have to modify the table structure if a new feature is added.

Chosen Datasets (linked to source URL)

Reading Necessary Packages

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.4     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggplot2)
library(dplyr)
library(magrittr)

## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

Data Set 1 Reading and Tidying

Downloading raw data CSV from NY government website

nygas <- read.csv("https://data.ny.gov/api/views/nqur-w4p7/rows.csv?accessType=DOWNLOAD&sorting=true")

Description of data from source: “Gasoline retail prices weekly average by region dataset provides the weekly average retail gasoline prices for New York State and sixteen New York metropolitan regions in U.S. dollars per gallon. Data is a weekly average from October 2007 through current. Some metropolitan regions begin in 2017.”

Understanding initial data format. There are 18 columns and 802 rows. Data are in a wide format. The first column is Date, which is the Monday of each week dating from October 2007 to present (801 observations). The second column is the average gas price for New York state combined. The other 16 columns are the average gas price per week for 16 New York state metropolitan regions.

Date column is in character format, which I’ll need to eventually change to date format.

head(nygas) #df preview

##         Date New.York.State.Average....gal. Albany.Average....gal.
## 1 02/27/2023                           3.47                   3.51
## 2 02/20/2023                           3.51                   3.54
## 3 02/13/2023                           3.54                   3.85
## 4 02/06/2023                           3.57                   3.61
## 5 01/30/2023                           3.56                   3.59
## 6 01/23/2023                           3.48                   3.48
##   Batavia.Average....gal. Binghamton.Average....gal. Buffalo.Average....gal.
## 1                    3.38                       3.41                    3.39
## 2                    3.38                       3.45                    3.41
## 3                    3.72                       3.77                    3.75
## 4                    3.44                       3.52                    3.45
## 5                    3.43                       3.50                    3.45
## 6                    3.37                       3.45                    3.41
##   Dutchess.Average....gal. Elmira.Average....gal. Glens.Falls.Average....gal.
## 1                     3.56                   3.40                        3.62
## 2                     3.59                   3.43                        3.63
## 3                     3.90                   3.77                        3.88
## 4                     3.63                   3.51                        3.69
## 5                     3.62                   3.47                        3.67
## 6                     3.54                   3.37                        3.58
##   Ithaca.Average....gal. Kingston.Average....gal. Nassau.Average....gal.
## 1                   3.47                     3.38                   3.36
## 2                   3.50                     3.40                   3.42
## 3                   3.82                     3.74                   3.73
## 4                   3.53                     3.49                   3.58
## 5                   3.51                     3.47                   3.49
## 6                   3.49                     3.38                   3.37
##   New.York.City.Average....gal. Rochester.Average....gal.
## 1                          3.48                      3.45
## 2                          3.54                      3.47
## 3                          3.82                      3.78
## 4                          3.64                      3.52
## 5                          3.62                      3.51
## 6                          3.53                      3.46
##   Syracuse.Average....gal. Utica.Average....gal. Watertown.Average....gal.
## 1                     3.45                  3.50                      3.48
## 2                     3.47                  3.53                      3.50
## 3                     3.81                  3.87                      3.86
## 4                     3.52                  3.62                      3.56
## 5                     3.50                  3.62                      3.54
## 6                     3.44                  3.59                      3.47
##   White.Plains.Average....gal.
## 1                         3.56
## 2                         3.61
## 3                         3.94
## 4                         3.66
## 5                         3.66
## 6                         3.58

nrow(nygas) #number of rows = 801

## [1] 801

ncol(nygas) #number of columns = 18

## [1] 18

names(nygas) #column names

##  [1] "Date"                           "New.York.State.Average....gal."
##  [3] "Albany.Average....gal."         "Batavia.Average....gal."       
##  [5] "Binghamton.Average....gal."     "Buffalo.Average....gal."       
##  [7] "Dutchess.Average....gal."       "Elmira.Average....gal."        
##  [9] "Glens.Falls.Average....gal."    "Ithaca.Average....gal."        
## [11] "Kingston.Average....gal."       "Nassau.Average....gal."        
## [13] "New.York.City.Average....gal."  "Rochester.Average....gal."     
## [15] "Syracuse.Average....gal."       "Utica.Average....gal."         
## [17] "Watertown.Average....gal."      "White.Plains.Average....gal."

str(nygas) #data types

## 'data.frame':    801 obs. of  18 variables:
##  $ Date                          : chr  "02/27/2023" "02/20/2023" "02/13/2023" "02/06/2023" ...
##  $ New.York.State.Average....gal.: num  3.47 3.51 3.54 3.57 3.56 3.48 3.44 3.45 3.41 3.45 ...
##  $ Albany.Average....gal.        : num  3.51 3.54 3.85 3.61 3.59 3.48 3.41 3.44 3.4 3.44 ...
##  $ Batavia.Average....gal.       : num  3.38 3.38 3.72 3.44 3.43 3.37 3.32 3.35 3.32 3.33 ...
##  $ Binghamton.Average....gal.    : num  3.41 3.45 3.77 3.52 3.5 3.45 3.44 3.44 3.44 3.47 ...
##  $ Buffalo.Average....gal.       : num  3.39 3.41 3.75 3.45 3.45 3.41 3.4 3.44 3.46 3.49 ...
##  $ Dutchess.Average....gal.      : num  3.56 3.59 3.9 3.63 3.62 3.54 3.51 3.51 3.48 3.51 ...
##  $ Elmira.Average....gal.        : num  3.4 3.43 3.77 3.51 3.47 3.37 3.29 3.3 3.33 3.38 ...
##  $ Glens.Falls.Average....gal.   : num  3.62 3.63 3.88 3.69 3.67 3.58 3.51 3.53 3.47 3.49 ...
##  $ Ithaca.Average....gal.        : num  3.47 3.5 3.82 3.53 3.51 3.49 3.53 3.57 3.57 3.59 ...
##  $ Kingston.Average....gal.      : num  3.38 3.4 3.74 3.49 3.47 3.38 3.31 3.3 3.27 3.34 ...
##  $ Nassau.Average....gal.        : num  3.36 3.42 3.73 3.58 3.49 3.37 3.3 3.29 3.2 3.25 ...
##  $ New.York.City.Average....gal. : num  3.48 3.54 3.82 3.64 3.62 3.53 3.5 3.53 3.45 3.5 ...
##  $ Rochester.Average....gal.     : num  3.45 3.47 3.78 3.52 3.51 3.46 3.46 3.49 3.49 3.53 ...
##  $ Syracuse.Average....gal.      : num  3.45 3.47 3.81 3.52 3.5 3.44 3.41 3.43 3.4 3.45 ...
##  $ Utica.Average....gal.         : num  3.5 3.53 3.87 3.62 3.62 3.59 3.57 3.59 3.5 3.53 ...
##  $ Watertown.Average....gal.     : num  3.48 3.5 3.86 3.56 3.54 3.47 3.44 3.46 3.44 3.47 ...
##  $ White.Plains.Average....gal.  : num  3.56 3.61 3.94 3.66 3.66 3.58 3.54 3.54 3.49 3.53 ...

nygas %>% #show min and max of Date range
  mutate(Date = lubridate::mdy(Date)) %>% 
  summarise(min = min(Date),
            max = max(Date))

##          min        max
## 1 2007-10-29 2023-02-27

Approach: 1. Clean Column Names 2. Pivot data to long format with columns “date | region | price_per_gal” (data dictionary would define columns; not necessary to clarify “usd_gas_prices_per_gallon” for example in the column name)

Clean column names to include only metro title

names(nygas) <- str_replace_all(string = names(nygas), pattern = "\\.Average.*", replacement = "") 
names(nygas) <- str_replace_all(string = names(nygas), pattern = "\\.", replacement = " ")

Pivoting table to long format; making all column names lower case; converting date column to date format to prep for data analysis

longgas <- nygas %>% 
  pivot_longer(
    cols = colnames(nygas)[2:ncol(nygas)],
    names_to = "metro",
    values_to = "price_per_gal"
)

names(longgas) <- tolower(names(longgas))

longgas$date <- lubridate::mdy(longgas$date)

Showing new format of df

head(longgas)

## # A tibble: 6 × 3
##   date       metro          price_per_gal
##   <date>     <chr>                  <dbl>
## 1 2023-02-27 New York State          3.47
## 2 2023-02-27 Albany                  3.51
## 3 2023-02-27 Batavia                 3.38
## 4 2023-02-27 Binghamton              3.41
## 5 2023-02-27 Buffalo                 3.39
## 6 2023-02-27 Dutchess                3.56

Data Set 1 Analysis

Analysis instructions from data set poster Susanna Wong (cannot embed due to login permission required): - Create a line graph to view the trend or compare the prices of the gasoline for each region. We can view if the prices increases or decreases overall.

This initial line graph is quite messy due to the long list of metros.

ggplot(longgas) + 
  geom_line(aes(date, price_per_gal, color = metro))

## Warning: Removed 3792 rows containing missing values (`geom_line()`).

I’ll group the metros so trends are more obvious

#Create groups for line graph color visualization
west_ny <- c('Rochester','Elmira','Buffalo','Batavia')
east_ny <- c('Albany','Dutchess','Kingston')
central_ny <- c('Binghamton','Ithaca','Syracuse','Utica')
north_ny <- c('Glens Falls','Watertown')
nyc_metro <- c('Nassau','White Plains')
nyc <- c('New York City')
nys <- c('New York State')

longgas <- longgas %>% 
  mutate(ny_region = case_when((metro %in% west_ny) ~ 'West/Central',
                            (metro %in% east_ny) ~ 'East/North',
                            (metro %in% central_ny) ~ 'West/Central',
                            (metro %in% north_ny) ~ 'East/North',
                            (metro %in% nyc_metro) ~ 'NYC Metro',
                            (metro %in% nys) ~ 'New York State Overall',
                            (metro %in% nyc) ~ 'NYC',
                            TRUE ~ 'Default'))

Now I’ll group my dataframe by those regions for visualization. Please note that this makes the data less accurate because it’s averaging the average gas price per metro to form an imperfect proxy for “Average Price Per Gallon of Gas by New York State Region.” A better measure of the average gas price per region would include observations from every gas station in that region.

gas_region <- longgas %>%
  group_by(date,ny_region) %>%
  summarise(price_per_gal=mean(price_per_gal))

## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.

ggplot(gas_region) + 
  geom_line(aes(date, price_per_gal, color = ny_region)) +
  ggtitle('Average Price Per Gallon of Gas by New York State Region - October 2007 to February 2023') + #giving plot title
  xlab('Date') + ylab('Avg Price of Gallon of Gas') #giving axes titles

## Warning: Removed 1427 rows containing missing values (`geom_line()`).

Two issues remain with the above chart. 1. It’s still difficult to see each individual line and make out the colors 2. It appears that some regions only had their data collected after a certain date, approximately around 2017.

To fix these issues, I’ll change the time horizon of the chart to only include rows where there are no blank values. And in my next chart, I’ll group by month so there’s less noise.

#Identifying max "date" value where columns have blanks, then adding a week to it for new min_date value
min_date <- max(longgas[is.na(longgas$price_per_gal),]$date) + 7 #Result is 2017-01-09

clean_gas_region <- longgas %>% filter(date >= min_date)
min(clean_gas_region$date) #checking to make sure min date is 2017-01-16

## [1] "2017-01-16"

Graphing again, now with a shorter time horizon and no NA values for price_per_gal. It is imperfect to take an average of an average when grouping by month but the goal here is to analyze general trends.

month_gas_region <- clean_gas_region %>%
  group_by(month=lubridate::floor_date(date, unit = 'month'), ny_region) %>%
  summarise(price_per_gal=mean(price_per_gal))

## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

ggplot(month_gas_region) + 
  geom_line(aes(month, price_per_gal, color = ny_region)) +
  ggtitle('Average Price Per Gallon of Gas by New York State Region - Oct 2007-Feb 2023') + #giving plot title
  xlab('Date') + ylab('Avg Price of Gallon of Gas (USD)') #giving axes titles +

  scale_y_continuous(labels = scales::dollar_format(prefix="$"))

## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1

Answering Susanna’s original analysis instructions: “Create a line graph to view the trend or compare the prices of the gasoline for each region. We can view if the prices increases or decreases overall.” - Overall gas prices tended to increase over time between Jan 2017 and Feb 2023. - The highest gas prices appeared in the middle of 2022 while the lowest appeared in the middle of 2021. - By region, New York State gas prices tended to move with each other. Gas prices were more divergent by region earlier in the time horizon (2017-2020) before becoming more similar around 2021.

Data Set 2 Reading and Tidying

mta <- read.csv("https://data.ny.gov/api/views/vxuj-8kew/rows.csv?accessType=DOWNLOAD&sorting=true")

Description of data from source: “The daily ridership dataset provides systemwide ridership and traffic estimates for subways, buses, Long Island Rail Road, Metro-North Railroad, Access-A-Ride, and Bridges and Tunnels, beginning 3/1/20 (4/1/20 for LIRR and Metro-North), and provides a percentage comparison against a comparable pre-pandemic date.”

Understanding initial data format. There are 15 columns and 1097 rows. Data are in a wide format. The first column is Date, which is every day from March 2020 to present (1097 observations). Then there are two columns each for 7 types of transportation options (14 columns total): one columns shows the transportation option’s ridership on that date, the other column compares ridership to pre-pandemic norms.

Date column is in character format, which I’ll need to eventually change to date format.

head(mta) #df preview

##         Date Subways..Total.Estimated.Ridership
## 1 03/02/2023                            3760245
## 2 03/01/2023                            3773706
## 3 02/28/2023                            3408751
## 4 02/27/2023                            3335204
## 5 02/26/2023                            1623164
## 6 02/25/2023                            2041901
##   Subways....of.Comparable.Pre.Pandemic.Day Buses..Total.Estimated.Ridership
## 1                                      0.68                          1390337
## 2                                      0.68                          1477880
## 3                                      0.63                          1334393
## 4                                      0.61                          1415210
## 5                                      0.74                           677706
## 6                                      0.71                           802244
##   Buses....of.Comparable.Pre.Pandemic.Day LIRR..Total.Estimated.Ridership
## 1                                    0.62                          195243
## 2                                    0.66                          192142
## 3                                    0.62                          177532
## 4                                    0.66                          180277
## 5                                    0.69                           74226
## 6                                    0.63                           86403
##   LIRR....of.Comparable.Pre.Pandemic.Day Metro.North..Total.Estimated.Ridership
## 1                                   0.62                                 176678
## 2                                   0.61                                 176661
## 3                                   0.59                                 139326
## 4                                   0.60                                 160346
## 5                                   0.95                                  67702
## 6                                   0.92                                  81742
##   Metro.North....of.Comparable.Pre.Pandemic.Day
## 1                                          0.64
## 2                                          0.64
## 3                                          0.52
## 4                                          0.60
## 5                                          0.74
## 6                                          0.62
##   Access.A.Ride..Total.Scheduled.Trips
## 1                                28978
## 2                                29449
## 3                                26116
## 4                                26730
## 5                                15891
## 6                                15461
##   Access.A.Ride....of.Comparable.Pre.Pandemic.Day
## 1                                            0.97
## 2                                            0.99
## 3                                            0.89
## 4                                            0.91
## 5                                            0.94
## 6                                            0.95
##   Bridges.and.Tunnels..Total.Traffic
## 1                             934427
## 2                             901530
## 3                             751643
## 4                             854625
## 5                             807475
## 6                             831613
##   Bridges.and.Tunnels....of.Comparable.Pre.Pandemic.Day
## 1                                                  1.01
## 2                                                  0.98
## 3                                                  0.85
## 4                                                  0.97
## 5                                                  1.07
## 6                                                  1.00
##   Staten.Island.Railway..Total.Estimated.Ridership
## 1                                             7428
## 2                                             7401
## 3                                             6689
## 4                                             6751
## 5                                             1095
## 6                                             1396
##   Staten.Island.Railway....of.Comparable.Pre.Pandemic.Day
## 1                                                    0.47
## 2                                                    0.46
## 3                                                    0.41
## 4                                                    0.42
## 5                                                    0.39
## 6                                                    0.33

nrow(mta) #1097 rows

## [1] 1097

ncol(mta) #15 columns

## [1] 15

names(mta) #column names

##  [1] "Date"                                                   
##  [2] "Subways..Total.Estimated.Ridership"                     
##  [3] "Subways....of.Comparable.Pre.Pandemic.Day"              
##  [4] "Buses..Total.Estimated.Ridership"                       
##  [5] "Buses....of.Comparable.Pre.Pandemic.Day"                
##  [6] "LIRR..Total.Estimated.Ridership"                        
##  [7] "LIRR....of.Comparable.Pre.Pandemic.Day"                 
##  [8] "Metro.North..Total.Estimated.Ridership"                 
##  [9] "Metro.North....of.Comparable.Pre.Pandemic.Day"          
## [10] "Access.A.Ride..Total.Scheduled.Trips"                   
## [11] "Access.A.Ride....of.Comparable.Pre.Pandemic.Day"        
## [12] "Bridges.and.Tunnels..Total.Traffic"                     
## [13] "Bridges.and.Tunnels....of.Comparable.Pre.Pandemic.Day"  
## [14] "Staten.Island.Railway..Total.Estimated.Ridership"       
## [15] "Staten.Island.Railway....of.Comparable.Pre.Pandemic.Day"

str(mta) #data types

## 'data.frame':    1097 obs. of  15 variables:
##  $ Date                                                   : chr  "03/02/2023" "03/01/2023" "02/28/2023" "02/27/2023" ...
##  $ Subways..Total.Estimated.Ridership                     : int  3760245 3773706 3408751 3335204 1623164 2041901 3244250 3498242 3454275 3325483 ...
##  $ Subways....of.Comparable.Pre.Pandemic.Day              : num  0.68 0.68 0.63 0.61 0.74 0.71 0.6 0.64 0.64 0.61 ...
##  $ Buses..Total.Estimated.Ridership                       : int  1390337 1477880 1334393 1415210 677706 802244 1244173 1326674 1304613 1272859 ...
##  $ Buses....of.Comparable.Pre.Pandemic.Day                : num  0.62 0.66 0.62 0.66 0.69 0.63 0.58 0.62 0.61 0.59 ...
##  $ LIRR..Total.Estimated.Ridership                        : int  195243 192142 177532 180277 74226 86403 179793 188438 193753 194967 ...
##  $ LIRR....of.Comparable.Pre.Pandemic.Day                 : num  0.62 0.61 0.59 0.6 0.95 0.92 0.59 0.62 0.64 0.64 ...
##  $ Metro.North..Total.Estimated.Ridership                 : int  176678 176661 139326 160346 67702 81742 153058 167193 171187 174283 ...
##  $ Metro.North....of.Comparable.Pre.Pandemic.Day          : num  0.64 0.64 0.52 0.6 0.74 0.62 0.57 0.62 0.64 0.65 ...
##  $ Access.A.Ride..Total.Scheduled.Trips                   : int  28978 29449 26116 26730 15891 15461 26077 27336 27895 27224 ...
##  $ Access.A.Ride....of.Comparable.Pre.Pandemic.Day        : num  0.97 0.99 0.89 0.91 0.94 0.95 0.89 0.93 0.95 0.93 ...
##  $ Bridges.and.Tunnels..Total.Traffic                     : int  934427 901530 751643 854625 807475 831613 928057 906322 869960 864047 ...
##  $ Bridges.and.Tunnels....of.Comparable.Pre.Pandemic.Day  : num  1.01 0.98 0.85 0.97 1.07 1 1.05 1.03 0.98 0.98 ...
##  $ Staten.Island.Railway..Total.Estimated.Ridership       : int  7428 7401 6689 6751 1095 1396 5550 6695 6538 6525 ...
##  $ Staten.Island.Railway....of.Comparable.Pre.Pandemic.Day: num  0.47 0.46 0.41 0.42 0.39 0.33 0.34 0.41 0.4 0.4 ...

mta %>% #show min and max of Date range
  mutate(Date = lubridate::mdy(Date)) %>% 
  summarise(min = min(Date),
            max = max(Date))

##          min        max
## 1 2020-03-01 2023-03-02

Approach: 1. Clean column names to make calculations / graphing easier (I can always label the outputs of the calculations/graphs with greater specificity) 2. Pivot data to long format with columns “date | transport | users | users_v_prepandemic” (data dictionary would define columns; not necessary to clarify “estimated_riders” instead of simply “users” in the column name). I’m using “users” as opposed to “riders” because some transporation options are quantified by “traffic.” 3. Label dates as weekdays or weekends 4. Ensure date is in date format

Clean column names to only transit method and “Ridership/Trips/Traffic” or “Day”, then transforming those columns to be “Transit Method | Riders” or “Transit Method | Users Ratio to Prepandemic.” Making column titles lowercase. Adding column for total_riders on date.

names(mta) <- str_replace_all(string = names(mta), pattern = "\\.{2}(.*?\\w)*\\.", replacement = ".") 
names(mta) <- str_replace_all(string = names(mta), pattern = "\\.", replacement = "_") 

names(mta) <- str_replace_all(string = names(mta), pattern = "(Ridership|Traffic|Trips)$", replacement = "Users") 
names(mta) <- str_replace_all(string = names(mta), pattern = "Day$", replacement = "Users_Ratio_to_Prepandemic") 

names(mta) <- tolower(names(mta))

mta <- mta %>% #sum total riders
  rowwise() %>%
  mutate(total_users = sum(across(ends_with("users")), na.rm = TRUE))

mta <- mta %>% #avg pandemic ratio
  mutate(daily_avg_user_ratio = mean(c_across(ends_with("prepandemic")), na.rm = TRUE))

Pivoting table to long format; changing NAs in users column to 0s; checking to make sure long format totals align with wide format totals; converting date column to date format to prep for data analysis

mta_long <- mta %>% 
  pivot_longer(cols = ends_with("users"), names_to = "transit_type", names_pattern = "(.*)users$", values_to = "users_value") %>%
  pivot_longer(cols = ends_with("prepandemic"), names_to = "prepandemic_ratio", names_pattern = "(.*)prepandemic$", values_to = "user_ratio") %>%
  mutate(transit_type = gsub("_$", "", transit_type),
  prepandemic_ratio = gsub("_$", "", prepandemic_ratio)) 

mta_long <- mta_long[mapply(grepl, mta_long$transit_type, mta_long$prepandemic_ratio),]

mta_long <- mta_long %>%
  mutate(users_value = coalesce(users_value, 0)) %>%
  select(-c(daily_avg_user_ratio,prepandemic_ratio))

#checking to make sure my "users" value after going from wide to long format is the same
rider_check <- mta_long %>%
  group_by(transit_type) %>%
  summarise(total_riders=sum(users_value))

sum(mta$total_users) == sum(mta_long$users_value) #Yields True

## [1] TRUE

round(mean(mta$daily_avg_user_ratio),2) == round(mean(mta_long$user_ratio, na.rm=TRUE),2) #Yields True. Need to round due to small differences in rounded calculations.

## [1] TRUE

mta$date <- lubridate::mdy(mta$date)
mta_long$date <- lubridate::mdy(mta_long$date)

Adding weekday and weekend column status “day_type”. Finalizing column names/order. Finalizing transit_type names (capitalizing them for easier pretty visualizations).

mta_long$day_type <- ifelse(weekdays(mta_long$date) %in% c("Saturday", "Sunday"), "Weekend", "Weekday")

colnames(mta_long) <- c('date','transit_type','users_value','users_ratio_to_prepandemic','day_type')

mta_long$transit_type <- gsub('buses', 'Buses', mta_long$transit_type)
mta_long$transit_type <- gsub('lirr', 'LIRR', mta_long$transit_type)
mta_long$transit_type <- gsub('access_a_ride', 'Access-A-Ride', mta_long$transit_type)
mta_long$transit_type <- gsub('bridges_and_tunnels', 'Bridges and Tunnels', mta_long$transit_type)
mta_long$transit_type <- gsub('staten_island_railway', 'Staten Island Railway', mta_long$transit_type)
mta_long$transit_type <- gsub('metro_north', 'Metro North', mta_long$transit_type)
mta_long$transit_type <- gsub('subways', 'Subways', mta_long$transit_type)

Displaying finalized updated long format df

head(mta_long)

## # A tibble: 6 × 5
##   date       transit_type        users_value users_ratio_to_prepandemic day_type
##   <date>     <chr>                     <dbl>                      <dbl> <chr>   
## 1 2023-03-02 Subways                 3760245                       0.68 Weekday 
## 2 2023-03-02 Buses                   1390337                       0.62 Weekday 
## 3 2023-03-02 LIRR                     195243                       0.62 Weekday 
## 4 2023-03-02 Metro North              176678                       0.64 Weekday 
## 5 2023-03-02 Access-A-Ride             28978                       0.97 Weekday 
## 6 2023-03-02 Bridges and Tunnels      934427                       1.01 Weekday

Data Set 2 Analysis

Analysis instructions from data set post respondent Mohamed Hassan (cannot embed due to login permission required): - During the pandemic, ridership was significantly low, and with some exceptions, hasn’t reverted back to pre-pandemic levels. I’m almost positive that the MTA conducts regular analyses comparing ridership prepandemic and during the pandemic. This is an important dataset that is often used to justify expanding or cutting service.

As with the gas prices graph, this initial line graph is quite messy due to the long list of transit_types, the frequent observation cadence (daily), and the different y scales. This is even before I add useful parameter: Weekday/Weekend status. For this analysis, I’ll instead choose to group all transit_types into two categories: Driving v. Public Transit, to highlight how habits may have changed over the course of the pandemic. If needed, I’ll use a log scale to compare users_value between Driving v. Public Transit. I’ll also highlight 2020 weekdays in particular so the effect of social distancing is focused on.

ggplot(mta_long) + 
  geom_line(aes(date, users_value, color = transit_type))

I’ll group the transit_types so trends are more obvious

#Create groups for line graph color visualization
public_transit <- c('Access-A-Ride','Buses','LIRR','Metro North','Staten Island Railway','Subways')
driving <- c('Bridges and Tunnels')

mta_long <- mta_long %>% 
  mutate(transit_group = case_when((transit_type %in% public_transit) ~ 'Public Transit',
                            (transit_type %in% driving) ~ 'Driving',
                            TRUE ~ 'Default'))

Again, please note that grouping at a higher aggregate level for “averages” data like “average travelers ratio to prepandemic” makes it less accurate. Taking the average of an average is not mathematically sound but it can be directionally useful.

weekday_commute_2020 <- mta_long %>%
  group_by(week=lubridate::floor_date(date, unit = 'week'), transit_type, transit_group, day_type) %>%
  summarise(avg_travelers_ratio_to_prepandemic=mean(users_ratio_to_prepandemic),
            weekly_travelers=sum(users_value)) %>%
  filter(week < '2021-01-01' & day_type == 'Weekday')

## `summarise()` has grouped output by 'week', 'transit_type', 'transit_group'.
## You can override using the `.groups` argument.

Graphing again, only looking at weekday commute ratio in 2020 by Driving v. Public Transit. I recognize bar charts are not as good as line charts for timeseries data, however I’ll make a bar chart here for variety and because it does make volumes easy to compare via bar heights.

ggplot(weekday_commute_2020,aes(x = week,y = avg_travelers_ratio_to_prepandemic)) + 
    geom_bar(aes(fill = transit_group),stat = "identity",position = "dodge") +
    scale_y_continuous(labels = scales::percent) +
    ggtitle('2020 Average Travel Ratio % Per Transit Group vs. Prepandemic Volume') +
    xlab('Date') + ylab('Avg Travel Ratio % vs. Prepandemic')

Repeating analysis instructions from Mohamed Cruz: “During the pandemic, ridership was significantly low, and with some exceptions, hasn’t reverted back to pre-pandemic levels. I’m almost positive that the MTA conducts regular analyses comparing ridership prepandemic and during the pandemic. This is an important dataset that is often used to justify expanding or cutting service.” - This 2020 weekday subset of the dataset shouldn’t be used to justify expanding or cutting service as it doesn’t account for long term trends in terms of WFH, population growth, and safety considerations. - However this subset might be slightly helpful in estimating commuter volumes should another pandemic occur which requires social distancing. Public transit and driving use rates dipped at similarly steep rates immediately following the lockdowns, however driving was relatively quick to recover. Driving was back to ~80% of prepandemic levels on weekdays in 4.5 months (mid-July) while public transit’s peak was less than 80%, achieved in December (8.5 months after lockdown). - Based on the above, it would make sense to invest in driving infrastructure immediately following a pandemic, perhaps delaying improvements for public transit until after.

Data Set 3 Reading and Tidying:

url_athletics <- "ope.ed.gov/athletics/api/dataFiles/file?fileName=EADA_2020-2021.zip"
download.file(url_athletics,"Schools.xlsx.zip")
unzip("Schools.xlsx.zip")
athletics <- readxl::read_excel("Schools.xlsx")

Description of data: Data from universities and colleges related to gender equity on their sports programs. Data include spending per institution on each team, listing coaches’ salaries, recruiting expenses, participation volume, and scholarship amounts. The goal of this information is to hold collegiate athletic departments accountable for supporting men’s and women’s sports equally: In salary, scholarships, expenditures, and so forth.

Understanding initial data format. There are 128 columns and 14382 rows. Data are generally in a wide format.

The first 15 columns are information about the college institution (e.g. name, location). The rest are sports-specific information about participation rates, expenditures, revenue, and so forth, with one column for each field.

head(athletics) #df preview

## # A tibble: 6 × 128
##   unitid insti…¹ addr1…² addr2…³ city_…⁴ state…⁵ zip_t…⁶ Class…⁷ class…⁸ Class…⁹
##    <dbl> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>     <dbl> <chr>   <chr>  
## 1 100654 Alabam… 4900 M… <NA>    Normal  AL      35762         2 NCAA D… <NA>   
## 2 100654 Alabam… 4900 M… <NA>    Normal  AL      35762         2 NCAA D… <NA>   
## 3 100654 Alabam… 4900 M… <NA>    Normal  AL      35762         2 NCAA D… <NA>   
## 4 100654 Alabam… 4900 M… <NA>    Normal  AL      35762         2 NCAA D… <NA>   
## 5 100654 Alabam… 4900 M… <NA>    Normal  AL      35762         2 NCAA D… <NA>   
## 6 100654 Alabam… 4900 M… <NA>    Normal  AL      35762         2 NCAA D… <NA>   
## # … with 118 more variables: EFMaleCount <dbl>, EFFemaleCount <dbl>,
## #   EFTotalCount <dbl>, sector_cd <dbl>, sector_name <chr>, SPORTSCODE <dbl>,
## #   PARTIC_MEN <dbl>, PARTIC_WOMEN <dbl>, PARTIC_COED_MEN <dbl>,
## #   PARTIC_COED_WOMEN <dbl>, SUM_PARTIC_MEN <dbl>, SUM_PARTIC_WOMEN <dbl>,
## #   OPEXPPERPART_MEN <dbl>, OPEXPPERTEAM_MEN <dbl>, OPEXPPERPART_WOMEN <dbl>,
## #   OPEXPPERTEAM_WOMEN <dbl>, TOTAL_OPEXP_MENWOMEN <dbl>,
## #   OPEXPPERPART_COED_MEN <dbl>, OPEXPPERTEAM_COED_MEN <dbl>, …

nrow(athletics) #number of rows = 2074

## [1] 14381

ncol(athletics) #number of columns = 167

## [1] 128

names(athletics) #column names

##   [1] "unitid"                 "institution_name"       "addr1_txt"             
##   [4] "addr2_txt"              "city_txt"               "state_cd"              
##   [7] "zip_text"               "ClassificationCode"     "classification_name"   
##  [10] "ClassificationOther"    "EFMaleCount"            "EFFemaleCount"         
##  [13] "EFTotalCount"           "sector_cd"              "sector_name"           
##  [16] "SPORTSCODE"             "PARTIC_MEN"             "PARTIC_WOMEN"          
##  [19] "PARTIC_COED_MEN"        "PARTIC_COED_WOMEN"      "SUM_PARTIC_MEN"        
##  [22] "SUM_PARTIC_WOMEN"       "OPEXPPERPART_MEN"       "OPEXPPERTEAM_MEN"      
##  [25] "OPEXPPERPART_WOMEN"     "OPEXPPERTEAM_WOMEN"     "TOTAL_OPEXP_MENWOMEN"  
##  [28] "OPEXPPERPART_COED_MEN"  "OPEXPPERTEAM_COED_MEN"  "OPEXP_PART_COED_WOMEN" 
##  [31] "OPEXP_TEAM_COED_WOMEN"  "TOTAL_OPEXP_COEDTEAM"   "SUM_OPEXPPERPART_MEN"  
##  [34] "SUM_OPEXPPERTEAM_MEN"   "SUM_OPEXPPERPART_WOMEN" "SUM_OPEXPPERTEAM_WOMEN"
##  [37] "TOTAL_OPEXP_INCLCOED"   "MEN_FTHEADCOACH_MALE"   "MEN_PTHEADCOACH_MALE"  
##  [40] "MEN_FTUNIVEMPLOY_MALE"  "MEN_PTUNIVEMPLOY_MALE"  "MEN_FTHEADCOACH_FEM"   
##  [43] "MEN_PTHEADCOACH_FEM"    "MEN_FTUNIVEMPLOY_FEM"   "MEN_PTUNIVEMPLOY_FEM"  
##  [46] "MEN_TOTAL_HEADCOACH"    "WOMEN_FTHDCOACH_MALE"   "WOMEN_PTHDCOACH_MALE"  
##  [49] "WOMEN_FTUNIVEMP_MALE"   "WOMEN_PTUNIVEMP_MALE"   "WOMEN_FTHDCOACH_FEM"   
##  [52] "WOMEN_PTHDCOACH_FEM"    "WOMEN_FTUNIVEMP_FEM"    "WOMEN_PTUNIVEMP_FEM"   
##  [55] "WOMEN_TOTAL_HDCOACH"    "COED_FTHDCOACH_MALE"    "COED_PTHDCOACH_MALE"   
##  [58] "COED_FTUNIVEMP_MALE"    "COED_PTUNIVEMP_MALE"    "COED_FTHDCOACH_FEM"    
##  [61] "COED_PTHDCOACH_FEM"     "COED_FTUNIVEMP_FEM"     "COED_PTUNIVEMP_FEM"    
##  [64] "COED_TOTAL_HDCOACH"     "SUM_FTHDCOACH_MALE"     "SUM_PTHDCOACH_MALE"    
##  [67] "SUM_FTUNIVEMP_MALE"     "SUM_PTUNIVEMP_MALE"     "SUM_FTHDCOACH_FEM"     
##  [70] "SUM_PTHDCOACH_FEM"      "SUM_FTUNIVEMP_FEM"      "SUM_PTUNIVEMP_FEM"     
##  [73] "SUM_TOTAL_HDCOACH"      "MEN_FTASCOACH_MALE"     "MEN_PTASCOACH_MALE"    
##  [76] "MEN_FTACUNIVEMP_MALE"   "MEN_PTACUNIVEMP_MALE"   "MEN_FTASSTCOACH_FEM"   
##  [79] "MEN_PTASSTCOACH_FEM"    "MEN_FTACUNIVEMP_FEM"    "MEN_PTACUNIVEMP_FEM"   
##  [82] "MEN_TOTAL_ASSTCOACH"    "WOMEN_FTASCOACH_MALE"   "WOMEN_PTASCOACH_MALE"  
##  [85] "WOMEN_FTACUNEMP_MALE"   "WOMEN_PTACUNEMP_MALE"   "WOMEN_FTASTCOACH_FEM"  
##  [88] "WOMEN_PTASTCOACH_FEM"   "WOMN_FTACUNIVEMP_FEM"   "WOMN_PTACUNIVEMP_FEM"  
##  [91] "WOMEN_TOTAL_ASTCOACH"   "COED_FTASCOACH_MALE"    "COED_PTASCOACH_MALE"   
##  [94] "COED_FTACUNEMP_MALE"    "COED_PTACUNEMP_MALE"    "COED_FTASTCOACH_FEM"   
##  [97] "COED_PTASTCOACH_FEM"    "COED_FTACUNIVEMP_FEM"   "COED_PTACUNIVEMP_FEM"  
## [100] "COED_TOTAL_ASTCOACH"    "SUM_FTASCOACH_MALE"     "SUM_PTASCOACH_MALE"    
## [103] "SUM_FTACUNIVEMP_MALE"   "SUM_PTACUNIVEMP_MALE"   "SUM_FTASCOACH_FEM"     
## [106] "SUM_PTASCOACH_FEM"      "SUM_FTACUNIVEMP_FEM"    "SUM_PTACUNIVEMP_FEM"   
## [109] "SUM_TOTAL_ASSTCOACH"    "REV_MEN"                "REV_WOMEN"             
## [112] "TOTAL_REV_MENWOMEN"     "REV_COED_MEN"           "REV_COED_WOMEN"        
## [115] "TOTAL_REV_COED"         "REVENUE_MENALL"         "REVENUE_WOMENALL"      
## [118] "TOTAL_REVENUE_ALL"      "EXP_MEN"                "EXP_WOMEN"             
## [121] "TOTAL_EXP_MENWOMEN"     "EXP_COED_MEN"           "EXP_COED_WOMEN"        
## [124] "TOTAL_EXP_COED"         "EXPENSE_MENALL"         "EXPENSE_WOMENALL"      
## [127] "TOTAL_EXPENSE_ALL"      "Sports"

str(athletics) #data types

## tibble [14,381 × 128] (S3: tbl_df/tbl/data.frame)
##  $ unitid                : num [1:14381] 100654 100654 100654 100654 100654 ...
##  $ institution_name      : chr [1:14381] "Alabama A & M University" "Alabama A & M University" "Alabama A & M University" "Alabama A & M University" ...
##  $ addr1_txt             : chr [1:14381] "4900 Meridian Street" "4900 Meridian Street" "4900 Meridian Street" "4900 Meridian Street" ...
##  $ addr2_txt             : chr [1:14381] NA NA NA NA ...
##  $ city_txt              : chr [1:14381] "Normal" "Normal" "Normal" "Normal" ...
##  $ state_cd              : chr [1:14381] "AL" "AL" "AL" "AL" ...
##  $ zip_text              : chr [1:14381] "35762" "35762" "35762" "35762" ...
##  $ ClassificationCode    : num [1:14381] 2 2 2 2 2 2 2 2 2 2 ...
##  $ classification_name   : chr [1:14381] "NCAA Division I-FCS" "NCAA Division I-FCS" "NCAA Division I-FCS" "NCAA Division I-FCS" ...
##  $ ClassificationOther   : chr [1:14381] NA NA NA NA ...
##  $ EFMaleCount           : num [1:14381] 1849 1849 1849 1849 1849 ...
##  $ EFFemaleCount         : num [1:14381] 2814 2814 2814 2814 2814 ...
##  $ EFTotalCount          : num [1:14381] 4663 4663 4663 4663 4663 ...
##  $ sector_cd             : num [1:14381] 1 1 1 1 1 1 1 1 1 1 ...
##  $ sector_name           : chr [1:14381] "Public, 4-year or above" "Public, 4-year or above" "Public, 4-year or above" "Public, 4-year or above" ...
##  $ SPORTSCODE            : num [1:14381] 1 2 7 8 15 16 22 23 24 33 ...
##  $ PARTIC_MEN            : num [1:14381] 43 15 97 7 NA NA 8 11 11 NA ...
##  $ PARTIC_WOMEN          : num [1:14381] NA 16 NA NA 32 23 7 14 13 9 ...
##  $ PARTIC_COED_MEN       : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ PARTIC_COED_WOMEN     : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ SUM_PARTIC_MEN        : num [1:14381] 43 15 97 7 0 0 8 11 11 0 ...
##  $ SUM_PARTIC_WOMEN      : num [1:14381] 0 16 0 0 32 23 7 14 13 9 ...
##  $ OPEXPPERPART_MEN      : num [1:14381] 2925 27325 11665 2531 0 ...
##  $ OPEXPPERTEAM_MEN      : num [1:14381] 125769 409876 1131460 17720 0 ...
##  $ OPEXPPERPART_WOMEN    : num [1:14381] 0 23448 0 0 1681 ...
##  $ OPEXPPERTEAM_WOMEN    : num [1:14381] 0 375175 0 0 53783 ...
##  $ TOTAL_OPEXP_MENWOMEN  : num [1:14381] 125769 785051 1131460 17720 53783 ...
##  $ OPEXPPERPART_COED_MEN : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ OPEXPPERTEAM_COED_MEN : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ OPEXP_PART_COED_WOMEN : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ OPEXP_TEAM_COED_WOMEN : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ TOTAL_OPEXP_COEDTEAM  : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ SUM_OPEXPPERPART_MEN  : num [1:14381] 2925 27325 11665 2531 0 ...
##  $ SUM_OPEXPPERTEAM_MEN  : num [1:14381] 125769 409876 1131460 17720 0 ...
##  $ SUM_OPEXPPERPART_WOMEN: num [1:14381] 0 23448 0 0 1681 ...
##  $ SUM_OPEXPPERTEAM_WOMEN: num [1:14381] 0 375175 0 0 53783 ...
##  $ TOTAL_OPEXP_INCLCOED  : num [1:14381] 125769 785051 1131460 17720 53783 ...
##  $ MEN_FTHEADCOACH_MALE  : num [1:14381] 1 1 1 0 NA NA NA NA NA NA ...
##  $ MEN_PTHEADCOACH_MALE  : num [1:14381] NA NA NA 1 NA NA 1 NA NA NA ...
##  $ MEN_FTUNIVEMPLOY_MALE : num [1:14381] 1 1 1 1 NA NA 1 NA NA NA ...
##  $ MEN_PTUNIVEMPLOY_MALE : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ MEN_FTHEADCOACH_FEM   : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ MEN_PTHEADCOACH_FEM   : num [1:14381] NA NA NA NA NA NA NA 1 1 NA ...
##  $ MEN_FTUNIVEMPLOY_FEM  : num [1:14381] NA NA NA NA NA NA NA 1 1 NA ...
##  $ MEN_PTUNIVEMPLOY_FEM  : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ MEN_TOTAL_HEADCOACH   : num [1:14381] 1 1 1 1 NA NA 1 1 1 NA ...
##  $ WOMEN_FTHDCOACH_MALE  : num [1:14381] NA NA NA NA 1 NA NA NA NA 1 ...
##  $ WOMEN_PTHDCOACH_MALE  : num [1:14381] NA NA NA NA NA NA 1 NA NA NA ...
##  $ WOMEN_FTUNIVEMP_MALE  : num [1:14381] NA NA NA NA 1 NA 1 NA NA 1 ...
##  $ WOMEN_PTUNIVEMP_MALE  : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ WOMEN_FTHDCOACH_FEM   : num [1:14381] NA 1 NA NA NA 1 NA NA NA NA ...
##  $ WOMEN_PTHDCOACH_FEM   : num [1:14381] NA NA NA NA NA NA NA 1 1 NA ...
##  $ WOMEN_FTUNIVEMP_FEM   : num [1:14381] NA 1 NA NA NA 1 NA 1 1 NA ...
##  $ WOMEN_PTUNIVEMP_FEM   : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ WOMEN_TOTAL_HDCOACH   : num [1:14381] NA 1 NA NA 1 1 1 1 1 1 ...
##  $ COED_FTHDCOACH_MALE   : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ COED_PTHDCOACH_MALE   : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ COED_FTUNIVEMP_MALE   : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ COED_PTUNIVEMP_MALE   : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ COED_FTHDCOACH_FEM    : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ COED_PTHDCOACH_FEM    : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ COED_FTUNIVEMP_FEM    : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ COED_PTUNIVEMP_FEM    : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ COED_TOTAL_HDCOACH    : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ SUM_FTHDCOACH_MALE    : num [1:14381] 1 1 1 0 1 0 0 0 0 1 ...
##  $ SUM_PTHDCOACH_MALE    : num [1:14381] 0 0 0 1 0 0 2 0 0 0 ...
##  $ SUM_FTUNIVEMP_MALE    : num [1:14381] 1 1 1 1 1 0 2 0 0 1 ...
##  $ SUM_PTUNIVEMP_MALE    : num [1:14381] 0 0 0 0 0 0 0 0 0 0 ...
##  $ SUM_FTHDCOACH_FEM     : num [1:14381] 0 1 0 0 0 1 0 0 0 0 ...
##  $ SUM_PTHDCOACH_FEM     : num [1:14381] 0 0 0 0 0 0 0 2 2 0 ...
##  $ SUM_FTUNIVEMP_FEM     : num [1:14381] 0 1 0 0 0 1 0 2 2 0 ...
##  $ SUM_PTUNIVEMP_FEM     : num [1:14381] 0 0 0 0 0 0 0 0 0 0 ...
##  $ SUM_TOTAL_HDCOACH     : num [1:14381] 1 2 1 1 1 1 2 2 2 1 ...
##  $ MEN_FTASCOACH_MALE    : num [1:14381] 1 2 10 0 NA NA 0 NA NA NA ...
##  $ MEN_PTASCOACH_MALE    : num [1:14381] NA NA NA 0 NA NA NA NA NA NA ...
##  $ MEN_FTACUNIVEMP_MALE  : num [1:14381] 1 2 10 NA NA NA 0 NA NA NA ...
##  $ MEN_PTACUNIVEMP_MALE  : num [1:14381] NA NA NA NA NA NA 0 NA NA NA ...
##  $ MEN_FTASSTCOACH_FEM   : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ MEN_PTASSTCOACH_FEM   : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ MEN_FTACUNIVEMP_FEM   : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ MEN_PTACUNIVEMP_FEM   : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ MEN_TOTAL_ASSTCOACH   : num [1:14381] 1 2 10 0 NA NA 0 0 0 NA ...
##  $ WOMEN_FTASCOACH_MALE  : num [1:14381] NA 2 NA NA NA 1 NA 1 NA NA ...
##  $ WOMEN_PTASCOACH_MALE  : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ WOMEN_FTACUNEMP_MALE  : num [1:14381] NA 2 NA NA NA 1 NA 1 NA NA ...
##  $ WOMEN_PTACUNEMP_MALE  : num [1:14381] NA NA NA NA NA 0 NA NA NA NA ...
##  $ WOMEN_FTASTCOACH_FEM  : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ WOMEN_PTASTCOACH_FEM  : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ WOMN_FTACUNIVEMP_FEM  : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ WOMN_PTACUNIVEMP_FEM  : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ WOMEN_TOTAL_ASTCOACH  : num [1:14381] NA 2 NA NA 0 1 0 1 0 0 ...
##  $ COED_FTASCOACH_MALE   : logi [1:14381] NA NA NA NA NA NA ...
##  $ COED_PTASCOACH_MALE   : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ COED_FTACUNEMP_MALE   : logi [1:14381] NA NA NA NA NA NA ...
##  $ COED_PTACUNEMP_MALE   : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ COED_FTASTCOACH_FEM   : logi [1:14381] NA NA NA NA NA NA ...
##  $ COED_PTASTCOACH_FEM   : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##  $ COED_FTACUNIVEMP_FEM  : logi [1:14381] NA NA NA NA NA NA ...
##  $ COED_PTACUNIVEMP_FEM  : num [1:14381] NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]

Approach: 1. Because there are so many values in this data set, I’ll begin with feature selection: selecting only columns useful to my analysis. 2. Clean column names and adjust data to make calculations/graphing easier (I can always label the outputs of the calculations/graphs with greater specificity)

Column Approach - Narrowing to columns which will help me answer the question: Do colleges spend equally on men’s and women’s basketball programs, controlling for as many factors as possible (school size, revenue, location, etc.)? To answer this question, I’m especially interested in comparing intra-school expenditures between men’s and women’s programs (for maximum control), but I’ll also look at expenditures at a macro-level for overall trends. - I may not be able to answer this question perfectly, but this preliminary data analysis should let me know if there are data worth diving into further. - I’m curious about this question because Title IX requires proportional expenditure on men’s and women’s athletics programs based on school enrollment. But that doesn’t mean that a given sport (e.g. basketball) will need to have equal expenditures between the men’s and women’s teams. Do schools spend more on women’s teams to make up for a Football-created spending deficit on women? Or is there favoritism towards men’s teams?

For this analysis, I want to control as many factors as possible. Therefore I’ll look at: - Where school enrollments are over 1,000 (reduce outlier small schools) and with gender ratios where women and men make up no more/no less than 55%/45% of the school (control for title IX). - Private, non-profit 4 year institutions (control for similar quality of schools) - In particular, comparing expenditure on men’s and women’s basketball at the same school will be helpful, thus controlling for cost of living and school culture

Therefore my useful columns are: 1. institution_name - School Name 2. classification_name - NCAA Classification Name 3. sector_cd - Coded value for sector name – would typically get rid of immediately but will enable me to filter schools faster than writing regex for sector_name 4. sector_name - Public or private sector school, 2 or 4 year enrollment, public or non-profit 5. EFMaleCount- Enrollment of men at institution 6. EFFemaleCount - Enrollment of women at institution 7. EFTotalCount - Total enrollment at institution – this is a calculated column I’d typically get rid of, but it’s useful for creating my an initial school gender ratio control 8. EXPENSE_MENALL - Total expenses on men’s sport 9. EXPENSE_WOMENALL - total expenses on women’s sport 10. REVENUE_MENALL - Total revenue from on men’s sport 11. REVENUE_WOMENALL - Total revenue from women’s sport 12. Sports - Sport names (e.g. filter for Basketball)

Selecting columns, narrowing data per “Column Approach” criteria above.

#I'll get rid of some of these if they're not necessary for my analysis, but for now including them to help my filtering
wide_athlete <- athletics[,c('institution_name','classification_name','sector_cd','sector_name','EFMaleCount','EFFemaleCount','EFTotalCount','EXPENSE_MENALL','EXPENSE_WOMENALL','REVENUE_MENALL','REVENUE_WOMENALL','Sports')]

wide_athlete <- wide_athlete %>%
  filter(grepl("NCAA Division III",classification_name),#Only NCAA Division III
         sector_cd == 2, #Only Private nonprofit, 4 year or above schools
         Sports == 'Basketball',
         EFTotalCount > 1000) #Only schools with 1,000+ enrollment

#Creating gender ratio columns to narrow both genders to minimum of 45% of school enrollment
#Not permanent columns since I won't need them later
wide_athlete <- wide_athlete %>%
  mutate(men_pct_enrollment = EFMaleCount/EFTotalCount,
         women_pct_enrollment = EFFemaleCount/EFTotalCount) %>%
  filter(women_pct_enrollment > 0.45 & men_pct_enrollment > 0.45)

#Removing gender ratio cols now that I've used them
wide_athlete <- wide_athlete[1:(length(wide_athlete)-2)]

The next question is whether to focus on NCAA Division III schools with football or without football. Football status is an important input. Because it’s an expensive sport to operate (many players, a lot of equipment, investment for school spirit), football highly skews athletic spending per gender. At schools with football, women’s teams can spend more on their teams to make up for the deficit. Therefore If there’s an adequate sample, I will focus only on NCAA Division III sports without football.

table(wide_athlete$classification_name)

## 
##    NCAA Division III with football NCAA Division III without football 
##                                 52                                  3

Unfortunately, per the above R chunk’s output, there are only 3 schools which meet all my criteria. If I look at schools that meet every criteria, but do have football, I’ll have a sample of 53. Even though this data analysis is only preliminary, 3 is not an adequate sample size. I’ll instead look at schools with football. My goal is to figure out whether there is a story worth investigating about basketball expenditure per gender; I can create a more perfect answer and set of criteria later.

wide_athlete <- wide_athlete %>%
  filter(grepl("with football",classification_name)) #Only schools with football

Next, I’ll remove all columns which don’t contain directly necessary information for my analysis. There is an alternative approach which I could have explored: Keeping as many factors as possible, and limiting my math functions to filter for only the columns I need. This “more data is better” approach is appropriate sometimes. I leveraged this approach in the prior two analyses (NY gas and MTA, above), therefore, in this instance I will reduce my features to the necessities, which makes my data set easier to visualize.

#Reducing features to necessities. Graph titles should therefore include all the criteria for this data set: NCAA Division III institutions with football; Private nonprofit, etc. 
skinny_athlete <- wide_athlete[,c('institution_name','EXPENSE_MENALL','EXPENSE_WOMENALL','REVENUE_MENALL','REVENUE_WOMENALL')]

Before visualizing my data, note that currently mens and women’s data are on the same rows. I’ll split them out to make the data a “long” format.

long_athlete <- skinny_athlete %>% 
  pivot_longer(
    cols = starts_with("EXPENSE"),
    names_to = "gender_expense",
    values_to = "expense"
) %>%
  pivot_longer(
    cols = starts_with("REVENUE"),
    names_to = "gender_revenue",
    values_to = "revenue"
)

#I want to keep only rows where the gender-expense and gender-revenue are MENALL or both are WOMENALL. Otherwise I'll have duplicates
long_athlete <- long_athlete %>% 
  filter((gender_expense == "EXPENSE_MENALL" & gender_revenue == "REVENUE_MENALL") | (gender_expense == "EXPENSE_WOMENALL" & gender_revenue == "REVENUE_WOMENALL"))

#Having only one gender column, renaming to be more specific in terms of "basketball"

long_athlete <- long_athlete %>% 
  mutate(across('gender_expense', str_replace, 'EXPENSE_MENALL', 'MEN'),
         across('gender_expense', str_replace, 'EXPENSE_WOMENALL', 'WOMEN'))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across("gender_expense", str_replace, "EXPENSE_MENALL",
##   "MEN")`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

long_athlete <- long_athlete[,c('institution_name','gender_expense','expense','revenue')]

names(long_athlete) <- c('institution_name','gender','basketball_expense','basketball_revenue')

Previewing dataframe

head(long_athlete)

## # A tibble: 6 × 4
##   institution_name            gender basketball_expense basketball_revenue
##   <chr>                       <chr>               <dbl>              <dbl>
## 1 Birmingham-Southern College MEN                182968             182968
## 2 Birmingham-Southern College WOMEN              146016             146016
## 3 Hendrix College             MEN                 86243              86243
## 4 Hendrix College             WOMEN               92230              92230
## 5 Benedictine University      MEN                384848             444458
## 6 Benedictine University      WOMEN              435789             442344

Data Set 3 Analysis

Analysis instructions from data set poster Ross Boehme (cannot embed due to login permission required): Do colleges spend equally on men’s and women’s basketball programs, controlling for as many factors as possible (school size, revenue, location, etc.)? - To create a controlled comparison, as a reminder, my current dataframe looks at only non-profit 4 year colleges, who participate in basketball on an NCAA Division III level, with 1,000+ enrollment, only schools with football programs, and with a minimum of 45% enrollment for each gender.

It appears that the basketball_expense and basketball_revenue values are highly similar for both genders. The online dictionary available for these data provided no further clarity about their definitions; Whether “expenses” were included in “revenues” for example, which to me would seem to be the case. Before further analysis, I’d contact the institution that collect these data to learn more.

long_athlete %>%
  group_by(gender) %>%
  summarise(sum_expense = sum(basketball_expense),
            sum_revenue = sum(basketball_revenue))

## # A tibble: 2 × 3
##   gender sum_expense sum_revenue
##   <chr>        <dbl>       <dbl>
## 1 MEN        8316045     8610938
## 2 WOMEN      7151293     7270592

My findings: expenses are higher for men’s than for women’s programs, a factor that’s not attributable to revenue. Revenues for these basketball programs per gender are similarly proportional to expenses: 104% revenue to expense ratio for men and 102% for women. However, men’s expenses are 116% that of women’s.

Given the limitations of my current dataset, I’m not prepared to make broad declarations about expense parity. However, it appears that men’s programs receive further expenditure than women’s. Further, charting each institutions expense ratios by gender could highlight certain institutions to investigate.

ggplot(long_athlete, aes(x=basketball_expense, y=basketball_revenue, color = gender)) +
  geom_point(alpha=0.7) +
  ggtitle('NCAA D3 Basketball Program Expenditure vs. Revenue Per Gender -- Selected Similar Schools') + 
  xlab('Basketball Program Expenses (USD)') + ylab('Basketball Program Revenues (USD)') +
  scale_y_continuous(labels = scales::dollar_format(prefix="$")) +
  scale_x_continuous(labels = scales::dollar_format(prefix="$"))

Generally you can see that as expenses increase (x variable), revenues (y variable) increase.

DATA 607 - Project 2

Ross Boehme

2023-03-04