Correcting for non-trips when calculating trip rates
Author
Malithi Fernando
1 Correcting for “non trips” when calculating trip rates
I noticed that when calculating trip rates, a correction to include “trips not taken” was needed. This memo is to to demonstrate the need to create ‘zero’ trips when calculating trip rates (where \(triprate = totaltrips/totalpeople\) in a certain category) to ensure that trips that were possible but not taken are included in the calculation. More specifically, this correction ensures that the denominator is correct– ie. regardless of whether someone took a trip of a certain purpose or not, they should be included.
Let’s start off with a test dataset (Table 1). This is a sample of 4 people. 2 men and 2 women, each taking a combination of shopping and non trips. The weights between the two genders are equal (1.2 for one person and 0.5 for the other). Both women took shopping trips, while only one man does.
Our group of interest, therefore, for this demo is men x shopping.
Code
# This section demonstrates the need to create 'zero' trips when calculating trip rates to ensure that trips that were possible but not taken are included in the analysis (and more specifically that the denominator is correct-- ie. people regardless of whether they took a trip, should be included)# To demonstrate that not doing this adjustment results in a different (artificially high) answer,I demononstrate this using two different methods# test data is loaded representing a sample of 4 people, 2 men, 2 women, each taking a combination of shopping and non trips. The weights between the two genders is equal (1.2 for one person and 0.5 for the other) test_weights <-read.table(text ="hhnr_p p_gender purpose weight1 f shopping 0.51 f shopping 0.51 f non 0.51 f non 0.52 f shopping 1.22 f shopping 1.22 f non 1.22 f non 1.22 f non 1.23 m shopping 0.53 m non 0.54 m non 1.24 m non 1.24 m non 1.2", header =TRUE)test_weights %>%kable()
Table 1: Test trip dataset with 4 people for a given year
hhnr_p
p_gender
purpose
weight
1
f
shopping
0.5
1
f
shopping
0.5
1
f
non
0.5
1
f
non
0.5
2
f
shopping
1.2
2
f
shopping
1.2
2
f
non
1.2
2
f
non
1.2
2
f
non
1.2
3
m
shopping
0.5
3
m
non
0.5
4
m
non
1.2
4
m
non
1.2
4
m
non
1.2
To calculated a weighted mean, we use the formula:
\[
W = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}
\]
where \(w_i\) is the weight of the \(i\)th person and \(x_i\) is the number of trips of the \(i\)th person.
I work in R, and am using the widely used {survey} package, via the {srvyr} package which provides a tidyverse wrapper to many of the functions. This book provides an excellent overview on how to use these packages: Exploring Complex Survey Data Analysis Using R: A Tidy Introduction with {srvyr} and {survey}. You could use other weighted mean formulas, but this package allows the calculation of standard errors and confidence intervals which is useful for interpreting the effect of the survey design/sample.
1.1 Option 1: Without creating zero trips
The first way we might calculate the means is by simply calculating the number of trips undertaken by each person, grouping them by gender and purpose (Table 2), and then using survey_mean() on the survey object which takes into account the weights. NOTE that I also tested this out with weighted mean functions that do not require a survey object and obtained the same answer (without standard error/conf intervals).
We would get the following answer, where the mean number of shopping trips per person is calculated as 1 for the group men x shopping (Table 3), because there is one person that shows up in this group (rather than the two possible men, since only one took a shopping trip). The formula is then \[triprate = 1 = 0.5*1/0.5\] rather than \[triprate = 0.29 = (0.5*1+1.2*0)/(0.5+1.2)\].
Code
#|#| #OPTION 1: without creating zero tripstest_weights_des <- test_weights %>%group_by( p_gender, purpose, hhnr_p) %>%summarise(n_trips =n()) %>%#create a table with the number of trips per person, per purposeleft_join(test_weights %>%select(weight, hhnr_p) %>%distinct(), by ="hhnr_p") %>%# add back in the weights which are needed for the weighted calculations of means etc.ungroup() test_weights_des %>%kable() %>%kable_styling() %>%row_spec(row =c(7), background ="yellow")
Table 2: Table with the number of trips undertaken per person, with relevant person weights (rows related to men x shopping highlighted in yellow )
p_gender
purpose
hhnr_p
n_trips
weight
f
non
1
2
0.5
f
non
2
3
1.2
f
shopping
1
2
0.5
f
shopping
2
2
1.2
m
non
3
1
0.5
m
non
4
3
1.2
m
shopping
3
1
0.5
Code
test_weights_des %>%as_survey_design( weights = weight) %>%# assign to a 'survey object' so that {survey}/{srvyr} functions can be used and standard errors/confidence intervals can be calculated.group_by( p_gender, purpose) %>%summarise (mean_triprate =survey_mean(n_trips)) %>%# here, the mean number of shopping trips per person is calculated as 1 in the group men x shopping, because there is one person (rather than the two possible men, since only one took a shopping trip). The formula: Sum across all people(weight*n_trips per person)/sum of all weights across all people = 0.5*1/0.5 rather than (0.5*1+1.2*0)/(0.5+1.2)kable() %>%kable_styling() %>%row_spec(row =4, background ="yellow")
Table 3: Summary table of means calculated using Option 1, (rows related to men x shopping highlighted in yellow )
p_gender
purpose
mean_triprate
mean_triprate_se
f
non
2.705882
0.3171333
f
shopping
2.000000
0.0000000
m
non
2.411765
0.6342665
m
shopping
1.000000
0.0000000
1.2 Option 2: With creating zero trips
The second way we might calculate the means is by creating a dataframe with all possible combinations of people and trip purposes, joining the summary of the number of trips per person, and then filling in the empty cells with n_trips = 0 (Table 4).
Code
#OPTION 2: with creating zero trips# to create zero trips where there are empty categories, I first create a dataframe with all possible combinationsunique_persons <- test_weights %>%select(p_gender,hhnr_p, weight) %>%distinct()trip_purposes <- test_weights %>%select(purpose) %>%distinct()expanded_df <-expand_grid(unique_persons, trip_purposes)# to this dataframe, I join the summary of the number of trips per person and then mutate to fill the NA (empty) cells with 0test_weights_fixed_des <- expanded_df %>%left_join(test_weights %>%group_by(p_gender, purpose, hhnr_p) %>%summarise(n_trips =n()), by =c("hhnr_p", "p_gender", "purpose")) %>%mutate(n_trips =ifelse(is.na(n_trips), 0, n_trips)) %>%# fill the NA (empty) cells with 0ungroup() test_weights_fixed_des %>%kable() %>%kable_styling() %>%row_spec(row =c(5,7), background ="yellow")
Table 4: Table with the number of trips undertaken per person, with relevant person weights (rows related to men x shopping highlighted in yellow, with zero trips included )
p_gender
hhnr_p
weight
purpose
n_trips
f
1
0.5
shopping
2
f
1
0.5
non
2
f
2
1.2
shopping
2
f
2
1.2
non
3
m
3
0.5
shopping
1
m
3
0.5
non
1
m
4
1.2
shopping
0
m
4
1.2
non
3
This way, the denominator is correct (2 people) and the mean number of shopping trips per person is calculated as 0.29 in the group since \[triprate = 0.29 = (0.5*1+1.2*0)/(0.5+1.2)\] (Table 5)
Code
test_weights_fixed_des %>%group_by( p_gender, purpose) %>%as_survey_design(weights = weight) %>%# assign to a survey object so that survey functions can be used and standard errors/confidence intervals can be calculated.summarise(mean_triprate =survey_mean(n_trips)) %>%# here, the mean number of shopping trips per person is calculated as 0.29 in the group men x shopping, using the formula: Sum across all people(weight*n_trips per person)/sum of all weights across all people = (0.5*1+1.2*0)/(0.5+1.2)kable() %>%kable_styling() %>%row_spec(row =4, background ="yellow")
Table 5: Summary table of means calculated using Option 2, with the mean number of shopping trips done by men in yello
p_gender
purpose
mean_triprate
mean_triprate_se
f
non
2.7058824
0.3138805
f
shopping
2.0000000
0.0000000
m
non
2.4117647
0.6277611
m
shopping
0.2941176
0.3138805
1.3 Conclusion (looking at real Swiss data)
In the Swiss data, this significantly affects trip rates:
When comparing men and women, there are more non trips recorded for men x shopping and women x nonshopping than their counterpart, therefore without this fix, we overestimate men’s shopping trips and women’s non shopping trips, falsely assuming these trip rates are more equal to each other than is reality. See Table 6 for the number of non-trips added in each category.
Shopping trip rates for both men and women decrease (because there are a lot of shopping non trips, or shopping trips not taken). We get shopping triprates of <1 which makes sense. See Table 7 for the comparison of mean trip rates with and without non-trips being added.
Code
#|# need to work with each year individually, i thinktrips_yr <- trips_filtered_cat %>%filter(year ==2021) # create empty dataframe with all possible combinations of people and trip purposesunique_p <- trips_yr %>%select(year, p_gender,hhnr_p, p_wgt) %>%distinct()trip_purp <- trips_yr %>%select(trip_purpose2) %>%distinct()expanded_df <-expand_grid(unique_p, trip_purp)# join the summary of the number of trips per person and then mutate to fill the NA (empty) cells with 0trips_yr_fixed <- expanded_df %>%left_join(trips_yr %>%group_by(year, p_gender, trip_purpose2, hhnr_p) %>%summarise(n_trips =n()), by =c("year", "hhnr_p", "p_gender", "trip_purpose2")) %>%mutate(n_trips =ifelse(is.na(n_trips), 0, n_trips)) %>%# fill the NA (empty) cells with 0ungroup() # calculate descriptive statistics related to trip ratestrips_yr_des <- trips_yr_fixed %>%as_survey_design(weights = p_wgt) %>%# assign to a survey object so that survey functions can be used and standard errors/confidence intervals can be calculated.group_by( year, p_gender, trip_purpose2) %>%summarise(mean_triprate =survey_mean(n_trips, vartype ="ci", level=0.9),median_triprate =survey_median(n_trips,vartype ="ci", level=0.9),q25_triprate =survey_quantile(n_trips, 0.25, vartype ="ci", level=0.9),q75_triprate =survey_quantile(n_trips, 0.75, vartype ="ci", level=0.9),min_triprate =survey_quantile(n_trips,0),max_triprate =survey_quantile(n_trips,1)) ## CHECKS ##---------------------------trips_yr_fixed %>%filter(n_trips==0) %>%group_by(p_gender, trip_purpose2) %>%count() %>%# check how much this change affects trip rates -> more non trips recorded for men X shopping and women x nonshopping therefore without this fix, we overestimate men's shopping trips women's non shopping tripskable()
Table 6: Table of the number of non-trips added in each category.
p_gender
trip_purpose2
n
Men
Non-shopping
1508
Men
Shopping
12289
Women
Non-shopping
1972
Women
Shopping
11510
Code
# let's try the full calculation to prove it without accounting for non trips trips_yr_2 <- trips_yr %>%group_by( year, p_gender, trip_purpose2, hhnr_p) %>%summarise(n_trips =n()) %>%#create a table with the number of trips per person, per purposeleft_join(trips_yr %>%select(p_wgt, hhnr_p) %>%distinct(), by ="hhnr_p") %>%# add back in the weights which are needed for the weighted calculations of means etc.ungroup()trips_yr_des2 <- trips_yr_2 %>%as_survey_design(weights = p_wgt) %>%# assign to a survey object so that survey functions can be used and standard errors/confidence intervals can be calculated.group_by(year, p_gender, trip_purpose2) %>%summarise (mean_triprate=survey_mean(n_trips)) results <-bind_rows(trips_yr_des %>%select(year, p_gender, trip_purpose2, mean_triprate) %>%mutate(method ="with non trips added"), trips_yr_des2 %>%select(year, p_gender, trip_purpose2, mean_triprate) %>%mutate(method ="without non trips")) results %>%pivot_wider(names_from = p_gender, values_from = mean_triprate) %>%mutate(diff = Women-Men) %>%kable()
Table 7: Comparison of mean trip rates with and without non-trips being added
year
trip_purpose2
method
Men
Women
diff
2021
Non-shopping
with non trips added
2.6342220
2.5390444
-0.0951776
2021
Shopping
with non trips added
0.6337777
0.7480748
0.1142971
2021
Non-shopping
without non trips
2.8737426
2.8177842
-0.0559585
2021
Shopping
without non trips
1.8107840
1.8228295
0.0120455
Code
results %>%ggplot(aes(x=p_gender, weight = mean_triprate, fill = method)) +geom_bar(position ="dodge")+facet_wrap(~trip_purpose2)+labs(title ="Comparison of mean trip rates with and without non-trips being added",y ="Mean trip rate")