NYC data allows you to collect data make a url request and loads it as a csv
bus <- read.csv("https://data.cityofnewyork.us/resource/ez4e-fazm.csv?$offset=150&$limit=653000")
We want to limit data exclusively to traffic issues as we are only concerned about delays that occured because of traffic and it’s also important to note that about 62% of bus delays are due to traffic as it’s shown below.
full_table_row = nrow(bus)
bus <- bus |> filter(reason == "Heavy Traffic")
traffic_row = nrow(bus)
traffic_delay_percent = traffic_row / full_table_row * 100
print(traffic_delay_percent)
## [1] 61.91593
Here we can see that we have “mins”, “minutes” capitals, missing values, incorrect values and ranges of numbers and on some ocasions even hours are listed so obviously this data is not clean enough to run an analysis
bus |>
select(how_long_delayed) |>
head(20)
## how_long_delayed
## 1
## 2 20
## 3 20
## 4 40MIIN
## 5 20
## 6 20
## 7 1425
## 8 15 min
## 9
## 10 46-60 Min
## 11 10MIN
## 12 30 MINS
## 13 25 MINUTES
## 14
## 15 30 MINS
## 16
## 17 30MINS
## 18 15 min
## 19 30 MINS
## 20 30MINS
We must clean in a way were we keep the integrity of the data and save as much rows as possible without compromising values that can be truly extracted. For ranges of minutes we will average them out since the middle of those ranges wont change the integrity of how long a delay was since it was between two times making any value in between true. Hours will be converted into minutes and empty, exagerated values or values that we can’t denote a logical sense of it will be removed.
bus<- bus |>
mutate(how_long_delayed = str_remove_all(how_long_delayed, " ")) |>
mutate(how_long_delayed = str_remove_all(how_long_delayed, "\\.")) |>
mutate(time_stuff = tolower(str_extract(how_long_delayed, "[A-Za-z]+" ))) |>
mutate(how_long_delayed = str_remove(how_long_delayed, "[A-Za-z]+")) |>
mutate(is_range = str_detect(how_long_delayed, "-|/"))|>
mutate(end_range = ifelse(is_range,str_extract(how_long_delayed, "\\d+$"), 0)) |>
mutate(how_long_delayed = ifelse(is_range,str_remove(how_long_delayed, "-\\d+$|/\\d+$"), how_long_delayed))
bus <- transform(bus, how_long_delayed = as.integer(how_long_delayed))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
bus <- transform(bus, end_range = as.integer(end_range))
bus <- bus |>
mutate(how_long_delayed = ifelse(str_detect(time_stuff, "hr|hrs|hours|hour|h"), how_long_delayed * 60, how_long_delayed * 1))
bus <- bus |>
drop_na(how_long_delayed) |>
filter( how_long_delayed < 300) |>
mutate(average_hours = ifelse(is_range, (how_long_delayed + end_range) / 2 , how_long_delayed))
filtered_row <- nrow(bus)
bus |>
select(how_long_delayed) |>
head(20)
## how_long_delayed
## 1 40
## 2 15
## 3 46
## 4 10
## 5 30
## 6 25
## 7 30
## 8 30
## 9 15
## 10 30
## 11 30
## 12 60
## 13 15
## 14 30
## 15 16
## 16 16
## 17 16
## 18 30
## 19 40
## 20 30
#bus <- bus |> mutate(how_long_delayed = str_extract(how_long_delayed, "\\d+"))
We managed to save about 97.5 percent of the heavy traffic dataset
percent_loss = (1 - (filtered_row / traffic_row) ) * 100
print(percent_loss)
## [1] 2.426103
We run into issues were serviced schools on one row are seperated by commas, dots or forward slash. Some of the id numbers start with hashtags.
bus |>
select(schools_serviced) |>
head(20)
## schools_serviced
## 1 29033
## 2 24058
## 3 16081,16360
## 4 C579
## 5 20030
## 6 C069
## 7 32075
## 8 25450
## 9 24068
## 10 18066
## 11 26179
## 12 31905,31435
## 13 28082
## 14 22286
## 15 21414,21907
## 16 15879,20446,20685,21893,22760
## 17 C329
## 18 13113
## 19 20858,20859
## 20 30443
Schools have a 5 digit identification number and anything over it or under it will not be used for analyisis since a human error can damage the integrity of the data and what it means. It does not occur much often but it’s better to neglect data we can understand rather than input false data to a specific school
bus <- bus |> mutate(schools_serviced = str_remove_all(schools_serviced, "#|\\(|\\)"))|> mutate(schools_serviced = str_remove(schools_serviced, "^ | $"))
sep <- bus |> separate_longer_delim(schools_serviced, delim=',') |> separate_longer_delim(schools_serviced, delim='.') |> separate_longer_delim(schools_serviced, delim=' ') |> filter(str_detect(schools_serviced, "^.....$"))
sep |>
select(schools_serviced) |>
head(20)
## schools_serviced
## 1 29033
## 2 24058
## 3 16081
## 4 16360
## 5 20030
## 6 32075
## 7 25450
## 8 24068
## 9 18066
## 10 26179
## 11 31905
## 12 31435
## 13 28082
## 14 22286
## 15 21414
## 16 21907
## 17 15879
## 18 20446
## 19 20685
## 20 21893
we found that busses delay averages and count of occurences tend to normally distributed(Q-Q plots have a line that closely matches a diagnal line) so we will create and index in which 85 % of it’s strength depends on occurences and 15% of its strengh depends on average seeing where each z score lies and they’re comparable to each other accordingly based on their ranges
delay_rank <- sep |> group_by(schools_serviced) |> summarise(average = mean(how_long_delayed), n = n())
delay_mean = mean(delay_rank$average)
delay_sd = sd(delay_rank$average)
count_mean = mean(delay_rank$n)
count_sd = sd(delay_rank$n)
ggplot(delay_rank, aes(x = average)) + geom_histogram() + labs(title = "Delay averages")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
sim_norm <- rnorm(n = nrow(delay_rank), mean = delay_mean, sd = delay_sd)
ggplot(data = delay_rank, aes(sample = sim_norm)) + geom_line(stat = "qq")
sim_norm_c <- rnorm(n = nrow(delay_rank), mean = count_mean, sd = count_sd)
ggplot(data = delay_rank, aes(sample = sim_norm_c)) + geom_line(stat = "qq")
average is the average minutes delayed per occurence(n). We use the probability percentage of each observation is “average” and “n” values to determine how much of the distribution it occupies. The more of the distribution it occupies, the higher the score. X amounts of standard deveation means higher score and x amount of standard deveation to the negative side means lower standard deveation.
delay_rank <- delay_rank |> mutate(count_score = pnorm(q = n, mean = count_mean, sd = count_sd), delay_score = pnorm(q = average, mean = delay_mean, sd = delay_sd)) |> mutate(difficulty_score = (count_score * 0.85) + (delay_score * 0.15))
delay_rank|>
head(20)
## # A tibble: 20 × 6
## schools_serviced average n count_score delay_score difficulty_score
## <chr> <dbl> <int> <dbl> <dbl> <dbl>
## 1 00000 31 1 0.249 0.735 0.322
## 2 01001 40.3 19 0.267 0.949 0.370
## 3 01003 23.3 80 0.334 0.417 0.347
## 4 01004 27.1 571 0.868 0.579 0.825
## 5 01005 23.4 12 0.260 0.422 0.284
## 6 01006 17.1 30 0.279 0.188 0.265
## 7 01015 31.2 549 0.853 0.742 0.836
## 8 01019 31.2 967 0.991 0.741 0.953
## 9 01020 29.8 574 0.870 0.692 0.843
## 10 01022 22.0 20 0.269 0.360 0.282
## 11 01034 34.0 649 0.913 0.830 0.901
## 12 01037 46 2 0.250 0.988 0.361
## 13 01038 43.5 618 0.897 0.977 0.909
## 14 01063 28.1 1088 0.997 0.624 0.941
## 15 01064 28.1 592 0.882 0.622 0.843
## 16 01080 42.3 502 0.816 0.969 0.839
## 17 01110 25.4 320 0.628 0.506 0.610
## 18 01134 30.1 697 0.935 0.700 0.900
## 19 01137 27 5 0.253 0.577 0.302
## 20 01140 26.2 609 0.892 0.540 0.839
Since we managed to succesfully provide a diffulty score of schools routes, we will now plot the top 50 schools with the most difficult routes.
delay_rank |>
arrange(desc(difficulty_score)) |>
slice(1:50)|>
ggplot( aes(x=reorder(schools_serviced, +difficulty_score), y = difficulty_score, fill = average)) + geom_bar(stat="identity", width = 0.8) + coord_flip() + theme_ipsum() + geom_text(
aes(label = format(round(difficulty_score * 100, 2), nsmall = 1)),
colour = "white", size = 4.4,
vjust =0.45,hjust = 1.2, position = position_dodge(0.1)
) + labs(title = "Top 50 Schools With Most Difficult Route", y= "Bus Routes Difficulty Percentage", x = "Schools", fill = "Average Delay(minutes)") + theme(axis.title = element_text(family = "Helvetica", size = (20), colour = "steelblue4", face = "bold.italic"), plot.title = element_text(family = "Helvetica", size = (20), colour = "steelblue4", face = "bold.italic"), plot.subtitle = element_text(family = "Helvetica", size = (12), colour = "lightblue3", face = "bold.italic"))
## Conclusion And Where Do We Go From Here
In conclusion, we now have a way to determine which schools has tougher routes than others, furthermore we can inspect through districts and boroughs using the school numbers to match them, and by knowing this information, the district and borough leaders and decide where to allocate funding in order