Introduction

This code-through explores how to perform a very basic flow for a linear regression model using the Citi Bike dataset. A linear regression is a way to find out how one thing changes when you change something else. It is a tool that helps us understand the relationship between two things: one we have control over or can measure and one we want to predict or understand. For simplicity, in this code-through, we will be looking at Trip Duration as our dependent variable, and User Type as our policy variable. Keep in mind, this is a 10,000 foot view, on the steps that would go into running a basic regression model.

Content Overview

Setting Up Your Environment
Importing Data
Data Cleaning and Preparation
Exploratory Data Analysis
Preparing for Linear Regression
Conducting Linear Regression
Interpreting Results
Reporting Findings

Why You Should Care

The PEDAMS program contains a program evaluation component that we must become very familiar with. This choice of topic is moreso for myself, in order to become more familiar with the concepts learned in my PAF 510 course while pulling on the intro fundamentals of this course.

Learning Objectives

Packages needed for linear regression
How to import data
How to clean and prepare data
How to explore your data
Setting up a linear regression
Running Linear Regression
Understanding Results
Articulating Findings

1. Setting Up Your Environment:

The following packages will be needed for completing the linear regression model. Below is a description of each package you’ll need and why it is applicable to our objective. If not already installed, use install.packages() to download them.

dplyr: Used for data manipulation. This package provides a set of intuite functions for filtering, selecting, reordering, and summarizing data. It is a simplified wa to clean, transform, and analyze data in R.

tidyverse: This package streamlines data import, tidying, manipulation, visualization, and analysis, promiting a coherent and consistent workflow within R.

broom: Helpful for statistical modeling. This package tidies the output of statistical models into easily handled data frames, making it easier to work with model results within tidyverse.

lubridate: Simplifies the handling and manipulation of certain data, such as dates and time, thereby making it easier to work with.

fastDummies: Quickly converts categorical variables into dummy variables. We are using this to convert our categorical variable “usertype” into new columns using 1’s and 0’s to define true/false for usertype (Subscriber/Customer). This makes it possible to do statistical modeling. Technically, we do not need this for the code-through since we are using a binary variable (subscribership = Yes/No), but it is useful to know about for when you do need it.

ggplot2: Very common, and extremely helpful visualization package.

#Load Packages
library(dplyr)
library(tidyverse)
library(broom)
library(lubridate)
library(fastDummies)
library(ggplot2)

2. Importing Data

In this code-through, we are using the Citi Bike data set, which is available to the public and accessible here. In this example, you will download the .csv file from the website, and import it into R using the following code.

#Import Data
dat <- read_csv("C:/Users/swest/Desktop/Grad School/Spring 2024/PAF 513/Labs/Code-Through-Assignment/CitiBike_TripData.csv") #Your path to the file will be different than this one here. You can use help(read.csv) for additional assistance.

3. Data Cleaning and Preparation

Critical to this process is cleaning and preparing data. In order to do so, you must familiarize yourself with the data, including the data document.

Utilizing the dplyr package and base R functions, we can become familiar with the data. There are various ways to accomplish this, but below we will follow a specific steps to clean and prepare our data.

#Inspect Data:
glimpse(dat)

## Rows: 577,703
## Columns: 15
## $ tripduration              <dbl> 695, 693, 2059, 123, 1521, 2028, 2057, 369, …
## $ starttime                 <dttm> 2013-06-01 00:00:01, 2013-06-01 00:00:08, 2…
## $ stoptime                  <dttm> 2013-06-01 00:11:36, 2013-06-01 00:11:41, 2…
## $ `start station id`        <dbl> 444, 444, 406, 475, 2008, 485, 285, 509, 265…
## $ `start station name`      <chr> "Broadway & W 24 St", "Broadway & W 24 St", …
## $ `start station latitude`  <dbl> 40.74235, 40.74235, 40.69513, 40.73524, 40.7…
## $ `start station longitude` <dbl> -73.98915, -73.98915, -73.99595, -73.98759, …
## $ `end station id`          <chr> "434", "434", "406", "262", "310", "406", "5…
## $ `end station name`        <chr> "9 Ave & W 18 St", "9 Ave & W 18 St", "Hicks…
## $ `end station latitude`    <chr> "40.74317449", "40.74317449", "40.69512845",…
## $ `end station longitude`   <chr> "-74.00366443", "-74.00366443", "-73.9959506…
## $ bikeid                    <dbl> 19678, 16649, 19599, 16352, 15567, 18445, 15…
## $ usertype                  <chr> "Subscriber", "Subscriber", "Customer", "Sub…
## $ `birth year`              <chr> "1983", "1984", "NULL", "1960", "1983", "NUL…
## $ gender                    <dbl> 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1,…

str(dat)

## spc_tbl_ [577,703 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ tripduration           : num [1:577703] 695 693 2059 123 1521 ...
##  $ starttime              : POSIXct[1:577703], format: "2013-06-01 00:00:01" "2013-06-01 00:00:08" ...
##  $ stoptime               : POSIXct[1:577703], format: "2013-06-01 00:11:36" "2013-06-01 00:11:41" ...
##  $ start station id       : num [1:577703] 444 444 406 475 2008 ...
##  $ start station name     : chr [1:577703] "Broadway & W 24 St" "Broadway & W 24 St" "Hicks St & Montague St" "E 15 St & Irving Pl" ...
##  $ start station latitude : num [1:577703] 40.7 40.7 40.7 40.7 40.7 ...
##  $ start station longitude: num [1:577703] -74 -74 -74 -74 -74 ...
##  $ end station id         : chr [1:577703] "434" "434" "406" "262" ...
##  $ end station name       : chr [1:577703] "9 Ave & W 18 St" "9 Ave & W 18 St" "Hicks St & Montague St" "Washington Park" ...
##  $ end station latitude   : chr [1:577703] "40.74317449" "40.74317449" "40.69512845" "40.6917823" ...
##  $ end station longitude  : chr [1:577703] "-74.00366443" "-74.00366443" "-73.99595065" "-73.9737299" ...
##  $ bikeid                 : num [1:577703] 19678 16649 19599 16352 15567 ...
##  $ usertype               : chr [1:577703] "Subscriber" "Subscriber" "Customer" "Subscriber" ...
##  $ birth year             : chr [1:577703] "1983" "1984" "NULL" "1960" ...
##  $ gender                 : num [1:577703] 1 1 0 1 1 0 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   tripduration = col_double(),
##   ..   starttime = col_datetime(format = ""),
##   ..   stoptime = col_datetime(format = ""),
##   ..   `start station id` = col_double(),
##   ..   `start station name` = col_character(),
##   ..   `start station latitude` = col_double(),
##   ..   `start station longitude` = col_double(),
##   ..   `end station id` = col_character(),
##   ..   `end station name` = col_character(),
##   ..   `end station latitude` = col_character(),
##   ..   `end station longitude` = col_character(),
##   ..   bikeid = col_double(),
##   ..   usertype = col_character(),
##   ..   `birth year` = col_character(),
##   ..   gender = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(dat)

A quick look with head() allows us to see all the columns and first 5 rows of the dataset. Make note, the only noticeable NULL’s in this preview, are located in “birth year”. Also, make note that the column “gender” only shows “0” and “1”; however, when you reference the data document, you see that gender is defined as “0” is unknown, “1” is male, “2” is female. For our purposes, we will not be relying on either of these variables for our regression, so we do not need to remove them or mutate them. If they were, we would need to further dig in and choose how to clean it. We do, however, need to modify our tripduration from minutes to hours for simplicity. Additionally, we need to modify our usertype to make it more usable for our analysis. We will demonstrate how to use the fastDummies package in order to do this. Now that we have taken a look at our data, lets move on to handling missing errors.

#Clean and Prepare:
sum(is.na(dat$tripduration))  #Check  missing values in our tripduration column

## [1] 0

sum(is.na(dat$usertype)) #Check for missing values in our usertype column

## [1] 0

x <- mean(dat$usertype == "Customer") #Assignment, we will use later.
x #Print proportion of users that use 24-hour or 3-day passes

## [1] 0.4159906

y <- mean(dat$usertype == "Subscriber") #Assignment, we will use later.
y #Print proportion of users that are annual subscribers

## [1] 0.5840094

##Correct Data Types:
dat <- dat %>% 
          mutate(tripduration_hour = tripduration/60, #Convert minutes to hours
                 start_time_only = format(ymd_hms(starttime), "%H:%M:%S"), #Create new column with starttime in the HOUR:MINUTE:SECOND format
                 stop_time_only = format(ymd_hms(stoptime), "%H:%M:%S"), #Create new column with stoptime in the HOUR:MINUTE:SECOND format. 
                day_of_week = wday(starttime, label = TRUE)) #Convert date into weekday and place it in its own column. 
# Encode categorical variables
dat <- dummy_cols(dat, select_columns = "usertype") #Creates new columns titled "Subscriber" and "Customer" and uses a "1" = TRUE and "0" = FALSE argument in those rows to identify the usertype. Technically, we don't need this for our purposes because our categories are already binary (Subscriber/Non-Subscriber). 

head(dat)

Now that we’ve done a basic inspection and cleaning of our data, we will move into our exploratory analysis.

4. Exploratory Data Analysis

This step involves continuing to get to know your data, and even some residual cleaning up. As you’ll see in the following code chunks, there are some ‘outliers’ that need to be addressed before we can move on to Step 5.

#Introductory Exploration:

max(dat$tripduration_hour) #Get an idea of the range of values in tripduration

## [1] 64607.98

min(dat$tripduration_hour)

## [1] 1.016667

mean(dat$tripduration_hour)

## [1] 22.87617

ggplot(dat, aes(x = tripduration_hour)) + 
  geom_boxplot() +
  labs(title = "Boxplot of Trip Duration (Hours)") #Visually explore trip duration

ggplot(dat, aes(x = tripduration_hour)) + 
  geom_histogram(bins = 50) + 
  labs(title = "Histogram of Trip Duration (Hours)") +
  xlim(c(0, 100)) #Visually explore trip duration

When observing tripduration_hour we see that the highest value in that column is 64,607 hours. That is about 7.38 years, which would indicate an anomaly and will skew our regression results. We need to find and address these outliers before we perform our regression. We also identify that the shortest trip is just over an hour. There are several ways to handle outliers, for our purposes we will simply include a cutoff of 24 hours to eliminate the upper outlier variables. For our suspiciously high shortest-trip, we cross-referenced the min in the original .csv in Excel and it confirms the shortest trip is just over an hour. This is not an outlier, it is simply what the data shows. It could be accounted for by the time of the year the data is representing. We will include this clarification and the choice to include a cutoff max in our final report discussing limitations.

#Filter Data to remove 'max' outliers: 
dat_filtered <- dat %>% 
                  filter(tripduration_hour <= 24)

max(dat_filtered$tripduration_hour) #Check new filtered dataset values

## [1] 24

min(dat_filtered$tripduration_hour)

## [1] 1.016667

mean(dat_filtered$tripduration_hour)

## [1] 12.15619

ggplot(dat_filtered, aes(x = tripduration_hour)) + 
  geom_boxplot() +
  labs(title = "Boxplot of Trip Duration (Hours)") #Visually explore trip duration

ggplot(dat_filtered, aes(x = tripduration_hour)) + 
  geom_histogram(bins = 50) + 
  labs(title = "Histogram of Trip Duration (Hours)") +
  xlim(c(0, 24)) #Visually explore trip duration

5. Preparing for Linear Regression

Now that our data has been cleaned and explored, lets set up our regression model. This is where you will determine your dependent and independent variables. For simplicity, we will be using ‘tripduration_hour’ as our dependent variable, and ‘usertype_Subscriber’ as our explanatory variable.

6. Conducting Linear Regression

Using the lm() function, we assign it to “model”. lm() is used to fit linear regression models. As mentioned before, tripduration_hour is our dependent variable that we want to understand how factors influence it. Our independent variable is usertype_Subscriber to determine the impact of subscribership on trip duration. There is an underlying assumption of a linear relationship between subscribership and trip duration.

model <- lm(tripduration_hour ~ usertype_Subscriber, data = dat_filtered)
summary(model)

## 
## Call:
## lm(formula = tripduration_hour ~ usertype_Subscriber, data = dat_filtered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.0975  -4.5308  -0.5641   4.3948  12.9448 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         14.11415    0.01424   991.0   <2e-16 ***
## usertype_Subscriber -3.05898    0.01780  -171.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.684 on 442536 degrees of freedom
## Multiple R-squared:  0.06255,    Adjusted R-squared:  0.06254 
## F-statistic: 2.953e+04 on 1 and 442536 DF,  p-value: < 2.2e-16

7. Interpreting Results

Observing the model summary, we can see critical information that tells us about usertype on trip duration.

Dependent Variable coefficient (14.11415) shows the average trip duration for non-subscribers, ~14 hours.

Subscribership coefficient (-.3.05898) reveals there is a decrease in trip duration among subscribers by about 3 hours.

Residual standard error (5.684) tells us that the average deviation will from the true line will be about 5.68* hours off.

Multiple R-squared (0.062550) shows us that only approximately 6.255% of the variance in trip duration is explained by the user type.

Adjusted R-squared (0.06254) confirms the minimal impact of using only one predictor in our model.

ggplot(dat_filtered, aes(x = usertype_Subscriber, y = tripduration_hour)) +
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(x = "User Type Subscriber (0 = Customer, 1 = Subscriber)", y = "Trip Duration (hours)", title = "Trip Duration by User Type with Regression Line") +
  theme_minimal()

# Calculate residuals from the model
dat_filtered$residuals <- residuals(model)

ggplot(dat_filtered, aes(x = usertype_Subscriber, y = residuals)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(x = "User Type Subscriber (0 = Customer, 1 = Subscriber)", y = "Residuals", title = "Residuals of the Linear Model by User Type") +
  theme_minimal()

8. Reporting Findings

Our finding suggest that usertype actually results in a decrease in total trip duration. There are several limitations within this example though that should be mentioned. The data, although matched with the source, still contained seemingly large trip durations. The results could be skwewed because of the time of year the data is from, or many other factors. For now, we know that, based on this data, there is a descrease is trip duration for subscribers. There are many other variables within this data set that could be used to run a regression. It would be wise to do additional control variables that both correlate with our policy variable as well as thoughs that are not as to reduce standard errors and help remove bias associated with our policy variable.

Works Cited

This code through references and cites the following sources:

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.
Robinson D, Hayes A, Couch S (2023). broom: Convert Statistical Objects into Tidy Tibbles. R package version 1.0.5, https://CRAN.R-project.org/package=broom.
Kaplan J (2023). fastDummies: Fast Creation of Dummy (Binary) Columns and Rows from Categorical Variables. R package version 1.7.3, https://CRAN.R-project.org/package=fastDummies.
Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL https://www.jstatsoft.org/v40/i03/.
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.4, https://CRAN.R-project.org/package=dplyr.

Running a Linear Regression A-Z

Sean West

30 December 2024