Useful resources:
Missing data, an introduction with descriptive anaysis link
Little’s MCAR test link
Intepreting missing data patern link
Imputations, a comparison of methods link
The dataframe bus.df is created after first stage of
data cleaning, i.e., combining the repetitive columns. We will do an NA
analysis first before proceeding to further operation.
A glimpse of the dataframe:
library(tidyverse)
glimpse(bus.df)
## Rows: 361,115
## Columns: 25
## $ id <chr> "51330302751-20230418152005_v104.60", "513301…
## $ trip_id <chr> "1328-87802-82800-2-8a15e6ba", "1328-87802-86…
## $ timestamp <int> 1682075324, 1682075323, 1682080225, 168208022…
## $ delay <int> 0, 0, 0, 0, 0, 0, 128, 0, 9, 0, 145, 0, 0, -6…
## $ stop_sequence <int> NA, NA, 1, 1, 1, 1, 15, 1, 12, 1, 9, 1, 1, 8,…
## $ arrival_delay <int> NA, NA, 0, 0, 0, 0, NA, 0, 9, 0, 145, 0, 0, N…
## $ arrival_time <int> NA, NA, 1682193600, 1682191800, 1682195400, 1…
## $ arrival_uncertainty <int> NA, NA, NA, NA, NA, NA, NA, NA, 0, 300, 0, 30…
## $ depature_delay <int> NA, NA, 0, 0, 0, 0, 161, 0, NA, 0, NA, 0, 0, …
## $ depature_time <int> NA, NA, 1682193600, 1682191800, 1682195400, 1…
## $ depature_uncertainty <int> NA, NA, NA, NA, NA, NA, 0, NA, NA, 300, NA, 3…
## $ stop_id <chr> NA, NA, "3362-20230420111822_v104.61", "4223-…
## $ schedule_relationship <int> 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ vehicle_latitude <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ vehicle_longitude <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ vehicle_bearing <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ vehicle_speed <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ vehicle_odometer <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ vehicle_id <chr> NA, NA, NA, NA, NA, NA, "59973", "59973", "59…
## $ vehicle_label <chr> NA, NA, NA, NA, NA, NA, "AMP 973", "AM…
## $ vehicle_occupancy_status <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ start_time <chr> "23:00:00", "24:00:00", "08:00:00", "07:30:00…
## $ start_date <chr> "20230422", "20230422", "20230423", "20230423…
## $ route_id <chr> "878-203", "878-203", "90111-20230420111822_v…
## $ direction_id <int> 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, …
bus_test<-cbind.data.frame(bus.df$delay,bus.df$stop_sequence,
bus.df$stop_id,bus.df$schedule_relationship,
bus.df$start_time,bus.df$route_id)
par(mfrow=c(3,1))
hist(bus.df$delay, main='Delay')
hist(bus.df$stop_sequence, main='Stop_sq')
hist(bus.df$schedule_relationship, main='Schedule_relationship')
The delay is not normally distrubuted in the data.
library(naniar)
bus.df%>%summarise_all(~ sum(is.na(.)))
## id trip_id timestamp delay stop_sequence arrival_delay arrival_time
## 1 0 56435 0 135542 183860 233841 233841
## arrival_uncertainty depature_delay depature_time depature_uncertainty stop_id
## 1 236744 285325 285325 288228 183860
## schedule_relationship vehicle_latitude vehicle_longitude vehicle_bearing
## 1 56435 225573 225573 246889
## vehicle_speed vehicle_odometer vehicle_id vehicle_label
## 1 225573 347900 50579 50580
## vehicle_occupancy_status start_time start_date route_id direction_id
## 1 285122 56435 56435 56435 57649
library(ggplot2)
gg_miss_var(bus.df)
The main missing pattern of the interest is stop_id, direction_id, delay and all the variables below it in the above plot. Here the project tried Little’s Missing Completely at Random (MCAR) Test. The assumption of the test is that 1. missing data shares the same covariance matrix 2. the data has multivariate normality. For the variables of the interest we assume this to be true. The null is that the missing is completely random.
mcar_test(bus_test)
## # A tibble: 1 × 4
## statistic df p.value missing.patterns
## <dbl> <dbl> <dbl> <int>
## 1 308773. 7 0 4
We fail to conclude the missing is completely random due to low p-values.
library(finalfit)
explanatory = c("stop_sequence","schedule_relationship","direction_id")
dependent = "delay"
bus.df %>%
missing_pattern(dependent, explanatory)
## schedule_relationship direction_id delay stop_sequence
## 177255 1 1 1 1 0
## 48318 1 1 1 0 1
## 77893 1 1 0 0 2
## 1214 1 0 0 0 3
## 56435 0 0 0 0 4
## 56435 57649 135542 183860 433486
We can see that the most frequent pattern is for all of them to be present, and the third most is none of them present. The second most frequent type is for delay to be missing with stop_sequence. In addition, since stop_sequence has a pattern not linked to other variables, if missing delay were to be removed, the overall missing will disappear, but will impact stop_sequence the least. It seems like this is a case of missing conditional on random (MAR), i.e., if delay is not recorded, other variables tend to not be recorded, but the actual missing values are random.
nbus.df%>%summarise_all(~ sum(is.na(.)))
## trip_id timestamp delay stop_sequence arrival_delay arrival_time
## 1 0 0 0 48318 98299 98299
## arrival_uncertainty depature_delay depature_time depature_uncertainty stop_id
## 1 101202 149783 149783 152686 48318
## schedule_relationship vehicle_id vehicle_label start_time start_hour
## 1 0 50579 50580 0 0
## start_minute start_date route_id direction_id stop_lat stop_lon stop_name
## 1 0 0 0 0 48532 48532 0
## weekday delay_min temp feelslike humidity precip windspeedmean severerisk
## 1 0 0 2104 2104 2104 2104 2104 2104
gg_miss_var(nbus.df)
nbus<-cbind.data.frame(nbus.df$stop_lat,nbus.df$precip,
nbus.df$stop_sequence)
mcar_test(nbus)
## # A tibble: 1 × 4
## statistic df p.value missing.patterns
## <dbl> <dbl> <dbl> <int>
## 1 911. 6 0 6
The main variables of interests have 6 missing pattern.
explanatory = c("stop_sequence","stop_lat","precip")
dependent = "start_hour"
nbus.df %>%
missing_pattern(dependent, explanatory)
## start_hour precip stop_sequence stop_lat
## 175595 1 1 1 1 0
## 143 1 1 1 0 1
## 47731 1 1 0 0 2
## 1446 1 0 1 1 1
## 71 1 0 1 0 2
## 587 1 0 0 0 3
## 0 2104 48318 48532 98954
It seems like precip is missing monotonically. This is also a similar MAR case like before.
Since we are convinced that data is MAR, and there is no missing in
the dependent variable in nbus.df, imputations can reduce
the bias.