Correcting for non-trips when calculating trip rates

Author

Malithi Fernando

1 Correcting for “non trips” when calculating trip rates

I noticed that when calculating trip rates, a correction to include “trips not taken” was needed. This memo is to to demonstrate the need to create ‘zero’ trips when calculating trip rates (where \(triprate = totaltrips/totalpeople\) in a certain category) to ensure that trips that were possible but not taken are included in the calculation. More specifically, this correction ensures that the denominator is correct– ie. regardless of whether someone took a trip of a certain purpose or not, they should be included.

Let’s start off with a test dataset (Table 1). This is a sample of 4 people. 2 men and 2 women, each taking a combination of shopping and non trips. The weights between the two genders are equal (1.2 for one person and 0.5 for the other). Both women took shopping trips, while only one man does.

Our group of interest, therefore, for this demo is men x shopping.

Code

# This section demonstrates the need to create 'zero' trips when calculating trip rates to ensure that trips that were possible but not taken are included in the analysis (and more specifically that the denominator is correct-- ie. people regardless of whether they took a trip, should be included)


# To demonstrate that not doing this adjustment results in a different (artificially high) answer,I demononstrate this using two different methods

# test data is loaded representing a sample of 4 people, 2 men, 2 women, each taking a combination of shopping and non trips. The weights between the two genders is equal (1.2 for one person and 0.5 for the other) 
test_weights <- read.table(text = "hhnr_p   p_gender    purpose weight
1   f   shopping    0.5
1   f   shopping    0.5
1   f   non 0.5
1   f   non 0.5
2   f   shopping    1.2
2   f   shopping    1.2
2   f   non 1.2
2   f   non 1.2
2   f   non 1.2
3   m   shopping    0.5
3   m   non 0.5
4   m   non 1.2
4   m   non 1.2
4   m   non 1.2"
, header = TRUE)

test_weights %>% kable()

Table 1: Test trip dataset with 4 people for a given year
hhnr_p	p_gender	purpose	weight
1	f	shopping	0.5
1	f	shopping	0.5
1	f	non	0.5
1	f	non	0.5
2	f	shopping	1.2
2	f	shopping	1.2
2	f	non	1.2
2	f	non	1.2
2	f	non	1.2
3	m	shopping	0.5
3	m	non	0.5
4	m	non	1.2
4	m	non	1.2
4	m	non	1.2

To calculated a weighted mean, we use the formula:

\[ W = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i} \]

where \(w_i\) is the weight of the \(i\)th person and \(x_i\) is the number of trips of the \(i\)th person.

I work in R, and am using the widely used {survey} package, via the {srvyr} package which provides a tidyverse wrapper to many of the functions. This book provides an excellent overview on how to use these packages: Exploring Complex Survey Data Analysis Using R: A Tidy Introduction with {srvyr} and {survey}. You could use other weighted mean formulas, but this package allows the calculation of standard errors and confidence intervals which is useful for interpreting the effect of the survey design/sample.

1.1 Option 1: Without creating zero trips

The first way we might calculate the means is by simply calculating the number of trips undertaken by each person, grouping them by gender and purpose (Table 2), and then using survey_mean() on the survey object which takes into account the weights. NOTE that I also tested this out with weighted mean functions that do not require a survey object and obtained the same answer (without standard error/conf intervals).

We would get the following answer, where the mean number of shopping trips per person is calculated as 1 for the group men x shopping (Table 3), because there is one person that shows up in this group (rather than the two possible men, since only one took a shopping trip). The formula is then \[triprate = 1 = 0.5*1/0.5\] rather than \[triprate = 0.29 = (0.5*1+1.2*0)/(0.5+1.2)\].

Code

#|
#| 

#OPTION 1: without creating zero trips
test_weights_des <- test_weights %>%
              group_by( p_gender, purpose, hhnr_p) %>%
              summarise(n_trips = n()) %>% #create a table with the number of trips per person, per purpose
  left_join(test_weights %>% select(weight, hhnr_p) %>% distinct(), by = "hhnr_p") %>% # add back in the weights which are needed for the weighted calculations of means etc.
  ungroup() 

test_weights_des %>% kable() %>% kable_styling() %>% 
 row_spec(row = c(7), background = "yellow")

Table 2: Table with the number of trips undertaken per person, with relevant person weights (rows related to men x shopping highlighted in yellow )
p_gender	purpose	hhnr_p	n_trips	weight
f	non	1	2	0.5
f	non	2	3	1.2
f	shopping	1	2	0.5
f	shopping	2	2	1.2
m	non	3	1	0.5
m	non	4	3	1.2
m	shopping	3	1	0.5

Code

test_weights_des %>% 
  as_survey_design( weights = weight) %>% # assign to a 'survey object' so that {survey}/{srvyr} functions can be used and standard errors/confidence intervals can be calculated.
  group_by( p_gender, purpose) %>%
  summarise (mean_triprate = survey_mean(n_trips)) %>% # here, the mean number of shopping trips per person is calculated as 1 in the group men x shopping, because there is one person (rather than the two possible men, since only one took a shopping trip). The formula: Sum across all people(weight*n_trips per person)/sum of all weights across all people = 0.5*1/0.5 rather than (0.5*1+1.2*0)/(0.5+1.2)
  kable() %>% 
  kable_styling() %>% 
  row_spec(row = 4, background = "yellow")

Table 3: Summary table of means calculated using Option 1, (rows related to men x shopping highlighted in yellow )
p_gender	purpose	mean_triprate	mean_triprate_se
f	non	2.705882	0.3171333
f	shopping	2.000000	0.0000000
m	non	2.411765	0.6342665
m	shopping	1.000000	0.0000000

1.2 Option 2: With creating zero trips

The second way we might calculate the means is by creating a dataframe with all possible combinations of people and trip purposes, joining the summary of the number of trips per person, and then filling in the empty cells with n_trips = 0 (Table 4).

Code

#OPTION 2: with creating zero trips

# to create zero trips where there are empty categories, I first create a dataframe with all possible combinations
unique_persons <- test_weights %>% 
  select(p_gender,hhnr_p, weight) %>%
  distinct()

trip_purposes <- test_weights %>%
  select(purpose) %>%
  distinct()

expanded_df <- expand_grid(unique_persons, trip_purposes)

# to this dataframe, I join the summary of the number of trips per person and then mutate to fill the NA (empty) cells with 0
test_weights_fixed_des <- expanded_df %>% 
  left_join(test_weights %>%
              group_by(p_gender, purpose, hhnr_p) %>%
              summarise(n_trips = n()), 
            by = c("hhnr_p", "p_gender", "purpose")) %>% 
   mutate(n_trips = ifelse(is.na(n_trips), 0, n_trips)) %>% # fill the NA (empty) cells with 0
  ungroup() 

test_weights_fixed_des %>% kable() %>% kable_styling() %>% 
 row_spec(row = c(5,7), background = "yellow")

Table 4: Table with the number of trips undertaken per person, with relevant person weights (rows related to men x shopping highlighted in yellow, with zero trips included )
p_gender	hhnr_p	weight	purpose	n_trips
f	1	0.5	shopping	2
f	1	0.5	non	2
f	2	1.2	shopping	2
f	2	1.2	non	3
m	3	0.5	shopping	1
m	3	0.5	non	1
m	4	1.2	shopping	0
m	4	1.2	non	3

This way, the denominator is correct (2 people) and the mean number of shopping trips per person is calculated as 0.29 in the group since \[triprate = 0.29 = (0.5*1+1.2*0)/(0.5+1.2)\] (Table 5)

Code

test_weights_fixed_des %>% 
  group_by( p_gender, purpose) %>%
  as_survey_design(weights = weight) %>% # assign to a survey object so that survey functions can be used and standard errors/confidence intervals can be calculated.
  summarise(mean_triprate = survey_mean(n_trips)) %>% # here, the mean number of shopping trips per person is calculated as 0.29 in the group men x shopping, using the formula: Sum across all people(weight*n_trips per person)/sum of all weights across all people = (0.5*1+1.2*0)/(0.5+1.2)
  kable() %>% 
  kable_styling() %>% 
  row_spec(row = 4, background = "yellow")

Table 5: Summary table of means calculated using Option 2, with the mean number of shopping trips done by men in yello
p_gender	purpose	mean_triprate	mean_triprate_se
f	non	2.7058824	0.3138805
f	shopping	2.0000000	0.0000000
m	non	2.4117647	0.6277611
m	shopping	0.2941176	0.3138805

1.3 Conclusion (looking at real Swiss data)

In the Swiss data, this significantly affects trip rates:

When comparing men and women, there are more non trips recorded for men x shopping and women x nonshopping than their counterpart, therefore without this fix, we overestimate men’s shopping trips and women’s non shopping trips, falsely assuming these trip rates are more equal to each other than is reality. See Table 6 for the number of non-trips added in each category.
Shopping trip rates for both men and women decrease (because there are a lot of shopping non trips, or shopping trips not taken). We get shopping triprates of <1 which makes sense. See Table 7 for the comparison of mean trip rates with and without non-trips being added.

Code

#|


# need to work with each year individually, i think
trips_yr <- trips_filtered_cat %>% 
  filter(year == 2021) 

# create empty dataframe with all possible combinations of people and trip purposes
unique_p <- trips_yr %>% 
  select(year, p_gender,hhnr_p, p_wgt) %>%
  distinct()

trip_purp <- trips_yr %>%
  select(trip_purpose2) %>%
  distinct()

expanded_df <- expand_grid(unique_p, trip_purp)

# join the summary of the number of trips per person and then mutate to fill the NA (empty) cells with 0
trips_yr_fixed <- expanded_df %>% 
  left_join(trips_yr %>%
              group_by(year, p_gender, trip_purpose2, hhnr_p) %>%
              summarise(n_trips = n()), 
            by = c("year", "hhnr_p", "p_gender", "trip_purpose2")) %>% 
   mutate(n_trips = ifelse(is.na(n_trips), 0, n_trips)) %>% # fill the NA (empty) cells with 0
    ungroup() 

# calculate descriptive statistics related to trip rates
trips_yr_des <- trips_yr_fixed %>% 
  as_survey_design(weights = p_wgt) %>% # assign to a survey object so that survey functions can be used and standard errors/confidence intervals can be calculated.
  group_by( year, p_gender, trip_purpose2) %>%
  summarise(mean_triprate = survey_mean(n_trips, vartype = "ci", level= 0.9),
            median_triprate = survey_median(n_trips,vartype = "ci", level= 0.9),
            q25_triprate = survey_quantile(n_trips, 0.25, vartype = "ci", level= 0.9),
            q75_triprate = survey_quantile(n_trips, 0.75, vartype = "ci", level= 0.9),
            min_triprate = survey_quantile(n_trips,0),
            max_triprate = survey_quantile(n_trips,1))  

## CHECKS ##---------------------------

trips_yr_fixed %>% 
  filter(n_trips==0) %>% 
  group_by(p_gender, trip_purpose2) %>%
  count() %>% # check how much this change affects trip rates -> more non trips recorded for men X shopping and women x nonshopping therefore without this fix, we overestimate men's shopping trips women's non shopping trips
  kable()

Table 6: Table of the number of non-trips added in each category.
p_gender	trip_purpose2	n
Men	Non-shopping	1508
Men	Shopping	12289
Women	Non-shopping	1972
Women	Shopping	11510

Code

# let's try the full calculation to prove it without accounting for non trips
 trips_yr_2 <- trips_yr %>%
              group_by( year, p_gender, trip_purpose2, hhnr_p) %>%
              summarise(n_trips = n()) %>% #create a table with the number of trips per person, per purpose
  left_join(trips_yr %>% select(p_wgt, hhnr_p) %>% distinct(), by = "hhnr_p") %>% # add back in the weights which are needed for the weighted calculations of means etc.
  ungroup()

trips_yr_des2 <- trips_yr_2 %>% 
  as_survey_design(weights = p_wgt) %>% # assign to a survey object so that survey functions can be used and standard errors/confidence intervals can be calculated.
  group_by(year, p_gender, trip_purpose2) %>%
  summarise (mean_triprate=survey_mean(n_trips)) 

results <- bind_rows(trips_yr_des %>% select(year, p_gender, trip_purpose2, mean_triprate) %>% mutate(method = "with non trips added"), 
          trips_yr_des2 %>% select(year, p_gender, trip_purpose2, mean_triprate) %>% mutate(method = "without non trips")) 

results %>% 
  pivot_wider(names_from = p_gender, values_from = mean_triprate) %>% 
  mutate(diff = Women-Men) %>%
  kable()

Table 7: Comparison of mean trip rates with and without non-trips being added
year	trip_purpose2	method	Men	Women	diff
2021	Non-shopping	with non trips added	2.6342220	2.5390444	-0.0951776
2021	Shopping	with non trips added	0.6337777	0.7480748	0.1142971
2021	Non-shopping	without non trips	2.8737426	2.8177842	-0.0559585
2021	Shopping	without non trips	1.8107840	1.8228295	0.0120455

Code

results %>% 
  ggplot(aes(x=p_gender, weight = mean_triprate, fill = method)) +
  geom_bar(position = "dodge")+
  facet_wrap(~trip_purpose2)+
  labs(title = "Comparison of mean trip rates with and without non-trips being added",
      y = "Mean trip rate")