Who rides longer? Men or women? An analysis of NYC & outer NYC (New Jersey) CitiBike riders

By utilizing CitiBike data on trip duration, bike station, gender, etc in October 2018, I will attempt to draw conclusions about the length of duration(in seconds) for a trip in NYC based on gender. The data can be found here: https://www.citibikenyc.com/system-data

Each observation is an individual rider, however we can see a hierarchical group appear, mainly the different stations that riders come from. Taking this into account, we will look at the duration of trip based on gender (complete pooling), based on groups (no pooling), and then use a linear mixed effect model to get a combination of pooling and no pooling.
I believe that different stations (different locations across NYC) will have variation in duration of trip as some neighborhood stations may be more active than others–or may have a more active gender than another one.

citiBike=read_csv("/Users/gregmaghakian/Documents/Soc 712/Week 8 Random Effect Model/Homework/Citibike.csv")
head(citiBike)

## # A tibble: 6 x 15
##   tripduration starttime           stoptime            `start station id`
##          <int> <dttm>              <dttm>                           <int>
## 1          152 2018-10-01 07:44:41 2018-10-01 07:47:14               3183
## 2          122 2018-10-01 08:50:05 2018-10-01 08:52:08               3183
## 3          211 2018-10-01 09:03:17 2018-10-01 09:06:48               3183
## 4          342 2018-10-01 10:13:07 2018-10-01 10:18:49               3183
## 5         2976 2018-10-01 10:45:14 2018-10-01 11:34:51               3183
## 6         2973 2018-10-01 10:45:17 2018-10-01 11:34:50               3183
## # ... with 11 more variables: `start station name` <chr>, `start station
## #   latitude` <dbl>, `start station longitude` <dbl>, `end station
## #   id` <int>, `end station name` <chr>, `end station latitude` <dbl>,
## #   `end station longitude` <dbl>, bikeid <int>, usertype <chr>, `birth
## #   year` <int>, gender <int>

citiBike$`start station id`=as.character(citiBike$`start station id`)
citiBike$gender=as.character(citiBike$gender)
citiBike$`birth year`=as.character(citiBike$`birth year`)
#Removing unknown gender
citiBike=filter(citiBike,gender!=0)
#Converting to minutes
citiBike$tripduration=citiBike$tripduration/60
citiBike=rename(citiBike,"stationName"=`start station name`)
citiBike=filter(citiBike,`start station id`!=3426)

A look at the groups –aka CitiBike Stations where males and females come from & the number of stations

head(unique(citiBike$stationName))

## [1] "Exchange Place" "Paulus Hook"    "City Hall"      "Grove St PATH" 
## [5] "Warren St"      "Union St"

tail(unique(citiBike$stationName))

## [1] "Journal Square" "Glenwood Ave"   "Fairmount Ave"  "Bergen Ave"    
## [5] "Grand St"       "Jackson Square"

cat("The number of stations/groupings is",length(unique(citiBike$stationName)))

## The number of stations/groupings is 50

First, let’s run a linear regression of sex on trip duration: Complete Pooling

Note: Gender=1 is Male; Gender=2 is Female

pooling=lm(tripduration~gender,data=citiBike)
summary(pooling)

## 
## Call:
## lm(formula = tripduration ~ gender, data = citiBike)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -8.9   -5.0   -3.4   -0.2 4073.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.3963     0.2404  34.930  < 2e-16 ***
## gender2       1.6189     0.5008   3.233  0.00123 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.85 on 37534 degrees of freedom
## Multiple R-squared:  0.0002784,  Adjusted R-squared:  0.0002517 
## F-statistic: 10.45 on 1 and 37534 DF,  p-value: 0.001227

Interpreting the complete pooling linear model:

Based on complete pooling where we run a linear model that treats gender at an individual level, with no grouping, we can see that our coefficient on gender is statistically significant from 0 at the 1% level. Ceteris paribus, compared to males, females on average ride for about 1.6 minutes longer. Males ride for about 8.4 minutes and females for about 10 minutes.

Based on this, I would conclude that females seem to be more active around NYC than males.

No-Pooling model

Here I will run 50 regressions–one for each station of origin for the riders.

dcoef <- citiBike %>% 
    group_by(`start station id`) %>% 
    do(mod = lm(tripduration ~ gender, data =.))
coef <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coef, aes(x = intc)) + geom_density()+xlab("Male Trip Duration over 50 Stations")

ggplot(coef, aes(x = intc)) + geom_histogram()+xlab("Male Trip Duration over 50 Stations")

Here are plots of the intercept for our regression for each station. The intercept represents the average trip duration for males at each station. Here we can see that the mode average duration for men is about 6 minutes. But, we see wide variation at stations, ranging up to almost 20 minutes.

Now let’s look at the variation for women

dcoef=citiBike %>% 
    group_by(`start station id`) %>% 
    do(mod = lm(tripduration ~ gender, data = .))
coef <- dcoef %>% do(data.frame(genderc = coef(.$mod)[2]))
ggplot(coef, aes(x = genderc)) + geom_histogram()+xlab("Difference in Female and Male Trip Duration over 50 Stations")

ggplot(coef, aes(x = genderc)) + geom_density()+xlab("Difference in Female and Male Trip Duration over 50 Stations")

Here we can see that there is also variation between female ridership duration across stations. Some stations have female riders that ride on average up to about 15 more minutes than the men at that station. Some stations even have men which ride more on average than women–something that is lost when we do complete pooling! However, most stations have female riders that ride about 1 minute or so longer than the men at those stations. Based on these visualizations and the before-mentioned analysis, we can see how there is variation in duration of riding by gender across different stations. When we pool, we lose this variation by assuming each rider basically comes from a single station. This is wrong as certain stations could be located in neighborhoods that have more active adults, or a healthier lifestyle, or perhaps even lack of subway or mass transit in the area. It could also be that these stations have characteristics that attract a certain gender that is more prone to riding longer distances.

Linear Mixed-Effects Model: Random Intercept

Let us now combine the variation we get from no-pooling with the concise and interpretative nature that we get from total pooling. To do so, we will use a random effect model to allow for group variation within our regression. Note: We are using a random intercept specification which allows for group variation, but not group variation across genders.

randomInt=lme(tripduration ~ gender, data = citiBike, random = ~1|stationName, method = "ML")
summary(randomInt)

## Linear mixed-effects model fit by maximum likelihood
##  Data: citiBike 
##        AIC    BIC  logLik
##   384989.9 385024 -192491
## 
## Random effects:
##  Formula: ~1 | stationName
##         (Intercept) Residual
## StdDev:     2.43886 40.79113
## 
## Fixed effects: tripduration ~ gender 
##                Value Std.Error    DF   t-value p-value
## (Intercept) 8.851970 0.4457235 37485 19.859778   0e+00
## gender2     1.744573 0.5029339 37485  3.468791   5e-04
##  Correlation: 
##         (Intr)
## gender2 -0.261
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -0.33851484 -0.11171780 -0.07294973 -0.01229429 99.69955304 
## 
## Number of Observations: 37536
## Number of Groups: 50

Here we can see that we have 50 different stations (groups) which do have variation between one another when it comes to ride duration. We can see this from the standard deviation between the stations for men of about 2.4.

Interpreting the results:

Ceteris paribus, females on average ride a citibike for about 1.75 minutes longer than males do. Males on average ride their bikes for about 8.85 minutes while females ride on average for about 10.6 minutes. This interpretation takes into account variation between the different stations, and we can see that the coefficients, compare to total pooling, increase slightly. This means that there was some variation in stations that was not captured when doing the total pooling method.

Linear Mixed-Effects Model: Random Slope

Now we will use a random effect model to allow for gender variation across the different stations through specifying a random slope effect.

randomSlope=lme(tripduration ~ gender, data = citiBike, random = ~gender|stationName, method = "ML")
summary(randomSlope)

## Linear mixed-effects model fit by maximum likelihood
##  Data: citiBike 
##        AIC    BIC    logLik
##   384992.8 385044 -192490.4
## 
## Random effects:
##  Formula: ~gender | stationName
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev     Corr  
## (Intercept)  2.3053841 (Intr)
## gender2      0.8497058 0.74  
## Residual    40.7894901       
## 
## Fixed effects: tripduration ~ gender 
##                Value Std.Error    DF   t-value p-value
## (Intercept) 8.823635 0.4285387 37485 20.590055   0e+00
## gender2     1.869533 0.5242401 37485  3.566178   4e-04
##  Correlation: 
##         (Intr)
## gender2 -0.112
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -0.37982201 -0.11114346 -0.07310560 -0.01241592 99.71840094 
## 
## Number of Observations: 37536
## Number of Groups: 50

When now accounting for gender variation across stations, we notice a slight decrease in standard deviation of duration for the male riding population across the 50 stations. However, we see that there is deviation in gender differences across stations of about .85. Something that is lost in both the random intercept and total-pooling models.

Interpretation:

Ceteris paribus, females on average ride for 1.9 minutes longer than males do. Males on average ride for about 8.8 minutes while females ride for about 10.7 minutes. By allowing for gender variation across stations we have slightly increased the duration difference between genders when compared to the random slope model. It is an even greater increase to the total pooling model when compared to the random intercept model as well. This supports our argument for the use of a random slope model, as there is clearly both station variation and gender variation across stations. The coefficient for trip duration for men (intercept) also decreased slightly when compared to the random slope model, perhaps hinting at bias in the random slope model from not including gender differences across the stations.

With this said, it is interesting to note that the random slope model performs better with regards to both AIC and BIC.

To Summarize:

The intent of this project was to analyze CitiBike data to better understand the differences in biking patterns and exercise between genders. With our total-pooling model, we were able to gather a statistically significant effect where females ride longer than men. However, our data can be grouped at the bike station level, and therefore, we ran 50 separate regressions and graphed them to see how there is variability between stations and between genders across stations. Some stations have males that ride longer than females, and by using a complete pooling method, we lose this granularity. Therefore, we ran a random intercept and random slope linear random effect model to alleviate the problem of complete pooling while also remaining a level of interpretation. Through doing so, we see how females still ride longer than males, but the difference is widened, showing stronger support towards females being more active and prone to riding longer.

CitiBike Data

Greg Maghakian

11/10/2018