Rank Schools’s Routes Difficulty

Using API to load data

NYC data allows you to collect data make a url request and loads it as a csv

bus <- read.csv("https://data.cityofnewyork.us/resource/ez4e-fazm.csv?$offset=150&$limit=653000")

Limit Dataset

We want to limit data exclusively to traffic issues as we are only concerned about delays that occured because of traffic and it’s also important to note that about 62% of bus delays are due to traffic as it’s shown below.

full_table_row = nrow(bus)
bus <- bus |> filter(reason == "Heavy Traffic")
traffic_row = nrow(bus)

traffic_delay_percent = traffic_row / full_table_row * 100

print(traffic_delay_percent)
## [1] 61.91593

Preview of Initial Data

Here we can see that we have “mins”, “minutes” capitals, missing values, incorrect values and ranges of numbers and on some ocasions even hours are listed so obviously this data is not clean enough to run an analysis

bus |> 
  select(how_long_delayed) |>
  head(20)
##    how_long_delayed
## 1                  
## 2                20
## 3                20
## 4            40MIIN
## 5                20
## 6                20
## 7              1425
## 8            15 min
## 9                  
## 10        46-60 Min
## 11            10MIN
## 12          30 MINS
## 13       25 MINUTES
## 14                 
## 15          30 MINS
## 16                 
## 17           30MINS
## 18           15 min
## 19          30 MINS
## 20           30MINS

Clean “how_long_delayed” column

We must clean in a way were we keep the integrity of the data and save as much rows as possible without compromising values that can be truly extracted. For ranges of minutes we will average them out since the middle of those ranges wont change the integrity of how long a delay was since it was between two times making any value in between true. Hours will be converted into minutes and empty, exagerated values or values that we can’t denote a logical sense of it will be removed.

bus<- bus |> 
    mutate(how_long_delayed = str_remove_all(how_long_delayed, " ")) |>
    mutate(how_long_delayed = str_remove_all(how_long_delayed, "\\.")) |>
    mutate(time_stuff = tolower(str_extract(how_long_delayed, "[A-Za-z]+" ))) |>
    mutate(how_long_delayed = str_remove(how_long_delayed, "[A-Za-z]+")) |>
    mutate(is_range = str_detect(how_long_delayed, "-|/"))|> 
    mutate(end_range = ifelse(is_range,str_extract(how_long_delayed, "\\d+$"), 0)) |> 
    mutate(how_long_delayed = ifelse(is_range,str_remove(how_long_delayed, "-\\d+$|/\\d+$"), how_long_delayed))

bus <- transform(bus, how_long_delayed = as.integer(how_long_delayed))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
bus <- transform(bus, end_range = as.integer(end_range))

bus <- bus |> 
    mutate(how_long_delayed = ifelse(str_detect(time_stuff, "hr|hrs|hours|hour|h"), how_long_delayed * 60, how_long_delayed * 1))

bus <- bus |> 
    drop_na(how_long_delayed) |>
    filter( how_long_delayed < 300) |>
    mutate(average_hours = ifelse(is_range, (how_long_delayed + end_range) / 2 , how_long_delayed))

filtered_row <- nrow(bus)

bus |> 
  select(how_long_delayed) |>
  head(20)
##    how_long_delayed
## 1                40
## 2                15
## 3                46
## 4                10
## 5                30
## 6                25
## 7                30
## 8                30
## 9                15
## 10               30
## 11               30
## 12               60
## 13               15
## 14               30
## 15               16
## 16               16
## 17               16
## 18               30
## 19               40
## 20               30
#bus <- bus |> mutate(how_long_delayed = str_extract(how_long_delayed, "\\d+"))

Get loss data percent

We managed to save about 97.5 percent of the heavy traffic dataset

percent_loss = (1 - (filtered_row / traffic_row) ) * 100
print(percent_loss)
## [1] 2.426103

Serviced Schools

We run into issues were serviced schools on one row are seperated by commas, dots or forward slash. Some of the id numbers start with hashtags.

bus |> 
  select(schools_serviced) |>
  head(20)
##                 schools_serviced
## 1                          29033
## 2                          24058
## 3                    16081,16360
## 4                           C579
## 5                          20030
## 6                           C069
## 7                          32075
## 8                          25450
## 9                          24068
## 10                         18066
## 11                         26179
## 12                   31905,31435
## 13                         28082
## 14                         22286
## 15                   21414,21907
## 16 15879,20446,20685,21893,22760
## 17                          C329
## 18                         13113
## 19                   20858,20859
## 20                         30443

Expand rows and Clean Serviced Schools

Schools have a 5 digit identification number and anything over it or under it will not be used for analyisis since a human error can damage the integrity of the data and what it means. It does not occur much often but it’s better to neglect data we can understand rather than input false data to a specific school

bus <- bus |> mutate(schools_serviced = str_remove_all(schools_serviced, "#|\\(|\\)"))|> mutate(schools_serviced = str_remove(schools_serviced, "^ | $"))
sep <- bus |> separate_longer_delim(schools_serviced, delim=',') |> separate_longer_delim(schools_serviced, delim='.') |> separate_longer_delim(schools_serviced, delim=' ') |> filter(str_detect(schools_serviced, "^.....$"))

sep |> 
  select(schools_serviced) |>
  head(20)
##    schools_serviced
## 1             29033
## 2             24058
## 3             16081
## 4             16360
## 5             20030
## 6             32075
## 7             25450
## 8             24068
## 9             18066
## 10            26179
## 11            31905
## 12            31435
## 13            28082
## 14            22286
## 15            21414
## 16            21907
## 17            15879
## 18            20446
## 19            20685
## 20            21893

Analysis on school with highest delay

we found that busses delay averages and count of occurences tend to normally distributed(Q-Q plots have a line that closely matches a diagnal line) so we will create and index in which 85 % of it’s strength depends on occurences and 15% of its strengh depends on average seeing where each z score lies and they’re comparable to each other accordingly based on their ranges

delay_rank <- sep |> group_by(schools_serviced) |> summarise(average = mean(how_long_delayed), n = n())

delay_mean = mean(delay_rank$average)
delay_sd = sd(delay_rank$average)
count_mean = mean(delay_rank$n)
count_sd = sd(delay_rank$n)
ggplot(delay_rank, aes(x = average)) + geom_histogram() + labs(title = "Delay averages")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
sim_norm <- rnorm(n = nrow(delay_rank), mean = delay_mean, sd = delay_sd)
ggplot(data = delay_rank, aes(sample = sim_norm)) + geom_line(stat = "qq")

sim_norm_c <- rnorm(n = nrow(delay_rank), mean = count_mean, sd = count_sd)
ggplot(data = delay_rank, aes(sample = sim_norm_c)) + geom_line(stat = "qq")

Evaluate Ranks and Generate Difficulty Score

average is the average minutes delayed per occurence(n). We use the probability percentage of each observation is “average” and “n” values to determine how much of the distribution it occupies. The more of the distribution it occupies, the higher the score. X amounts of standard deveation means higher score and x amount of standard deveation to the negative side means lower standard deveation.

delay_rank <- delay_rank |> mutate(count_score = pnorm(q = n, mean = count_mean, sd = count_sd), delay_score = pnorm(q = average, mean = delay_mean, sd = delay_sd)) |> mutate(difficulty_score = (count_score * 0.85) + (delay_score * 0.15))


 
 delay_rank|> 
  head(20)
## # A tibble: 20 × 6
##    schools_serviced average     n count_score delay_score difficulty_score
##    <chr>              <dbl> <int>       <dbl>       <dbl>            <dbl>
##  1 00000               31       1       0.249       0.735            0.322
##  2 01001               40.3    19       0.267       0.949            0.370
##  3 01003               23.3    80       0.334       0.417            0.347
##  4 01004               27.1   571       0.868       0.579            0.825
##  5 01005               23.4    12       0.260       0.422            0.284
##  6 01006               17.1    30       0.279       0.188            0.265
##  7 01015               31.2   549       0.853       0.742            0.836
##  8 01019               31.2   967       0.991       0.741            0.953
##  9 01020               29.8   574       0.870       0.692            0.843
## 10 01022               22.0    20       0.269       0.360            0.282
## 11 01034               34.0   649       0.913       0.830            0.901
## 12 01037               46       2       0.250       0.988            0.361
## 13 01038               43.5   618       0.897       0.977            0.909
## 14 01063               28.1  1088       0.997       0.624            0.941
## 15 01064               28.1   592       0.882       0.622            0.843
## 16 01080               42.3   502       0.816       0.969            0.839
## 17 01110               25.4   320       0.628       0.506            0.610
## 18 01134               30.1   697       0.935       0.700            0.900
## 19 01137               27       5       0.253       0.577            0.302
## 20 01140               26.2   609       0.892       0.540            0.839

Plot top 50 most difficult schools

Since we managed to succesfully provide a diffulty score of schools routes, we will now plot the top 50 schools with the most difficult routes.

delay_rank |> 
    arrange(desc(difficulty_score)) |>
    slice(1:50)|>
    ggplot( aes(x=reorder(schools_serviced, +difficulty_score), y = difficulty_score, fill = average))  + geom_bar(stat="identity", width = 0.8) + coord_flip() + theme_ipsum() + geom_text(
     aes(label = format(round(difficulty_score * 100, 2), nsmall = 1)),
     colour = "white", size = 4.4,
     vjust =0.45,hjust = 1.2, position = position_dodge(0.1)
 ) + labs(title = "Top 50 Schools With Most Difficult Route", y= "Bus Routes Difficulty Percentage", x = "Schools", fill = "Average Delay(minutes)") + theme(axis.title = element_text(family = "Helvetica", size = (20), colour = "steelblue4", face = "bold.italic"), plot.title =  element_text(family = "Helvetica", size = (20), colour = "steelblue4", face = "bold.italic"), plot.subtitle =  element_text(family = "Helvetica", size = (12), colour = "lightblue3", face = "bold.italic"))

## Conclusion And Where Do We Go From Here

In conclusion, we now have a way to determine which schools has tougher routes than others, furthermore we can inspect through districts and boroughs using the school numbers to match them, and by knowing this information, the district and borough leaders and decide where to allocate funding in order