Useful resources:

  1. Missing data, an introduction with descriptive anaysis link

  2. Little’s MCAR test link

  3. Intepreting missing data patern link

  4. Imputations, a comparison of methods link

The dataframe bus.df is created after first stage of data cleaning, i.e., combining the repetitive columns. We will do an NA analysis first before proceeding to further operation.

A glimpse of the dataframe:

library(tidyverse)
glimpse(bus.df)
## Rows: 361,115
## Columns: 25
## $ id                       <chr> "51330302751-20230418152005_v104.60", "513301…
## $ trip_id                  <chr> "1328-87802-82800-2-8a15e6ba", "1328-87802-86…
## $ timestamp                <int> 1682075324, 1682075323, 1682080225, 168208022…
## $ delay                    <int> 0, 0, 0, 0, 0, 0, 128, 0, 9, 0, 145, 0, 0, -6…
## $ stop_sequence            <int> NA, NA, 1, 1, 1, 1, 15, 1, 12, 1, 9, 1, 1, 8,…
## $ arrival_delay            <int> NA, NA, 0, 0, 0, 0, NA, 0, 9, 0, 145, 0, 0, N…
## $ arrival_time             <int> NA, NA, 1682193600, 1682191800, 1682195400, 1…
## $ arrival_uncertainty      <int> NA, NA, NA, NA, NA, NA, NA, NA, 0, 300, 0, 30…
## $ depature_delay           <int> NA, NA, 0, 0, 0, 0, 161, 0, NA, 0, NA, 0, 0, …
## $ depature_time            <int> NA, NA, 1682193600, 1682191800, 1682195400, 1…
## $ depature_uncertainty     <int> NA, NA, NA, NA, NA, NA, 0, NA, NA, 300, NA, 3…
## $ stop_id                  <chr> NA, NA, "3362-20230420111822_v104.61", "4223-…
## $ schedule_relationship    <int> 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ vehicle_latitude         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ vehicle_longitude        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ vehicle_bearing          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ vehicle_speed            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ vehicle_odometer         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ vehicle_id               <chr> NA, NA, NA, NA, NA, NA, "59973", "59973", "59…
## $ vehicle_label            <chr> NA, NA, NA, NA, NA, NA, "AMP        973", "AM…
## $ vehicle_occupancy_status <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ start_time               <chr> "23:00:00", "24:00:00", "08:00:00", "07:30:00…
## $ start_date               <chr> "20230422", "20230422", "20230423", "20230423…
## $ route_id                 <chr> "878-203", "878-203", "90111-20230420111822_v…
## $ direction_id             <int> 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, …

MVN in the data

bus_test<-cbind.data.frame(bus.df$delay,bus.df$stop_sequence,
                  bus.df$stop_id,bus.df$schedule_relationship,
                  bus.df$start_time,bus.df$route_id)
par(mfrow=c(3,1))
hist(bus.df$delay, main='Delay')
hist(bus.df$stop_sequence, main='Stop_sq')
hist(bus.df$schedule_relationship, main='Schedule_relationship')

The delay is not normally distrubuted in the data.

Missing plot

library(naniar)
bus.df%>%summarise_all(~ sum(is.na(.)))
##   id trip_id timestamp  delay stop_sequence arrival_delay arrival_time
## 1  0   56435         0 135542        183860        233841       233841
##   arrival_uncertainty depature_delay depature_time depature_uncertainty stop_id
## 1              236744         285325        285325               288228  183860
##   schedule_relationship vehicle_latitude vehicle_longitude vehicle_bearing
## 1                 56435           225573            225573          246889
##   vehicle_speed vehicle_odometer vehicle_id vehicle_label
## 1        225573           347900      50579         50580
##   vehicle_occupancy_status start_time start_date route_id direction_id
## 1                   285122      56435      56435    56435        57649
library(ggplot2)
gg_miss_var(bus.df)

The main missing pattern of the interest is stop_id, direction_id, delay and all the variables below it in the above plot. Here the project tried Little’s Missing Completely at Random (MCAR) Test. The assumption of the test is that 1. missing data shares the same covariance matrix 2. the data has multivariate normality. For the variables of the interest we assume this to be true. The null is that the missing is completely random.

mcar_test(bus_test)
## # A tibble: 1 × 4
##   statistic    df p.value missing.patterns
##       <dbl> <dbl>   <dbl>            <int>
## 1   308773.     7       0                4

We fail to conclude the missing is completely random due to low p-values.

Missing pattern

library(finalfit) 

explanatory = c("stop_sequence","schedule_relationship","direction_id")
dependent = "delay"
bus.df %>% 
  missing_pattern(dependent, explanatory)

##        schedule_relationship direction_id  delay stop_sequence       
## 177255                     1            1      1             1      0
## 48318                      1            1      1             0      1
## 77893                      1            1      0             0      2
## 1214                       1            0      0             0      3
## 56435                      0            0      0             0      4
##                        56435        57649 135542        183860 433486

We can see that the most frequent pattern is for all of them to be present, and the third most is none of them present. The second most frequent type is for delay to be missing with stop_sequence. In addition, since stop_sequence has a pattern not linked to other variables, if missing delay were to be removed, the overall missing will disappear, but will impact stop_sequence the least. It seems like this is a case of missing conditional on random (MAR), i.e., if delay is not recorded, other variables tend to not be recorded, but the actual missing values are random.

Removing missing delays

nbus.df%>%summarise_all(~ sum(is.na(.)))
##   trip_id timestamp delay stop_sequence arrival_delay arrival_time
## 1       0         0     0         48318         98299        98299
##   arrival_uncertainty depature_delay depature_time depature_uncertainty stop_id
## 1              101202         149783        149783               152686   48318
##   schedule_relationship vehicle_id vehicle_label start_time start_hour
## 1                     0      50579         50580          0          0
##   start_minute start_date route_id direction_id stop_lat stop_lon stop_name
## 1            0          0        0            0    48532    48532         0
##   weekday delay_min temp feelslike humidity precip windspeedmean severerisk
## 1       0         0 2104      2104     2104   2104          2104       2104
gg_miss_var(nbus.df)

nbus<-cbind.data.frame(nbus.df$stop_lat,nbus.df$precip,
                       nbus.df$stop_sequence)
mcar_test(nbus)
## # A tibble: 1 × 4
##   statistic    df p.value missing.patterns
##       <dbl> <dbl>   <dbl>            <int>
## 1      911.     6       0                6

The main variables of interests have 6 missing pattern.

explanatory = c("stop_sequence","stop_lat","precip")
dependent = "start_hour"
nbus.df %>% 
  missing_pattern(dependent, explanatory)

##        start_hour precip stop_sequence stop_lat      
## 175595          1      1             1        1     0
## 143             1      1             1        0     1
## 47731           1      1             0        0     2
## 1446            1      0             1        1     1
## 71              1      0             1        0     2
## 587             1      0             0        0     3
##                 0   2104         48318    48532 98954

It seems like precip is missing monotonically. This is also a similar MAR case like before.

Since we are convinced that data is MAR, and there is no missing in the dependent variable in nbus.df, imputations can reduce the bias.