1. Merging Datasets from multiple CSVs

# loading libraries
library(tidyverse)
library(dplyr)
#Setting the working Directory
setwd("~/STAA57_Project")
# Create empty data frame
bikeshare_data <- data.frame()
# Loop through all CSV files in folder
for (file in list.files(pattern=".csv")) {
  temp_data <- read.csv(file, header=TRUE)  # Read in CSV file
  bikeshare_data <- rbind(bikeshare_data, temp_data)  # Bind data to the df
  rm(temp_data)  # Remove original data to free up memory
}

2. Description of the Dataset

names(bikeshare_data)
## [1] "trip_id"               "trip_duration_seconds" "from_station_id"      
## [4] "trip_start_time"       "from_station_name"     "trip_stop_time"       
## [7] "to_station_id"         "to_station_name"       "user_type"
  1. Trip ID: A unique identifier for each trip
  2. Trip Duration (Seconds): The length of the trip in seconds
  3. From Station ID: The unique identifier for the starting station of the trip
  4. Trip Start Time: The date and time the trip began
  5. From Station Name: The name of the starting station for the trip
  6. Trip Stop Time: The date and time the trip ended
  7. To Station ID: The unique identifier for the ending station of the trip
  8. Bike ID: The name of the ending station for the trip
  9. User Type: A categorical variable indicating whether the user is a “Annual Member” (annual pass holder of the bikeshare program) or a “Casual Member”(24 or 72 hour pass holders of the bikeshare program)

3. The Background of the Data

The bikeshare data for 2018 was collected by the City of Toronto, specifically by the Transportation Services division. The data was collected in the context of a bikeshare program called Bike Share Toronto, which is a public bicycle sharing system in Toronto, Canada. Bike Share Toronto provides access to bicycles at various stations throughout the city, allowing users to rent a bike for short trips and return it to any station within the system. The data collected includes information about bike trips, such as trip duration, start and end station, and user type, and other variables.This data is publicly available through the City of Toronto’s Open Data portal for research and analysis purposes.

The dataset can be used for analysis and insights into the usage patterns and trends of the bikeshare program in Toronto during the year 2018. It is typically used by researchers, policymakers, and urban planners to analyze bike usage patterns, evaluate the effectiveness of the bikeshare program, and inform transportation planning and policy decisions. Overall, the bikeshare data for 2018 is a valuable resource for investigating bike usage patterns, user behavior, and system performance in Toronto, and can contribute to evidence-based decision making in transportation planning and policy.

4. What is the overall research question?

As researchers, our overall research question is to investigate the usage patterns and trends of bikeshare data in Toronto for the year 2018. We aim to analyze various variables, such as trip duration, trip start time, user type, and station information, to gain insights into how the bikeshare system was utilized by different user groups, and to identify any patterns or trends that may emerge from the data. Specifically, our research question is:

  1. What are the summary statistics for trip duration by user type, and how do they compare?
  2. How does the total number of trips taken by each user type vary by month?
  3. What are the top 10 most popular starting stations for bike rides in the dataset?
  4. How many trips lasted 30 minutes or more, and how many lasted 45 minutes or more, for each user type?
  5. What is the trend of total trips by hour of day, and how does it relate to peak usage times?
  6. How does monthly trip volume vary by user type, and what is the overall trend over the year?

5. Tables

5.1 Trip Duration (in Seconds) Summary By User Type

Summary of Trip Duration (in seconds) by User Type
user_type Number_Trips Average_duration Median_duration Min_duration Max_duration
Annual Member 1572980 725.0167 600 60 55077
Casual Member 349975 2032.4958 1220 60 54971

From the table, we can see that the “Annual Member” user type has a significantly larger number of trips (1,572,980) than the “Casual Member” user type (349,975). However, the “Casual Member” user type has a higher average trip duration (2,032.4958 seconds) compared to the “Annual Member” user type (725.0167 seconds). The median duration for both user types is lower than the average, which indicates that there are some trips with much higher durations, skewing the average upwards.

5.2 Table of Total Trips by Month for each user type

Total Trips by Month for each user type
Month Annual Member Casual Member
Jan 42469 1390
Feb 47276 2455
Mar 78564 6405
Apr 82194 12589
May 160989 51761
Jun 186463 64374
Jul 215835 70481
Aug 209770 71449
Sep 207789 47212
Oct 160326 15553
Nov 100675 3612
Dec 80630 2694

The Table of Total Trips by Month for each user type displays the total number of trips taken by annual members and casual members for each month of the year. For example, in January, there were 42,469 trips taken by annual members and 1,390 trips taken by casual members. In February, there were 47,276 trips taken by annual members and 2,455 trips taken by casual members. This continues for each month of the year, showing the total number of trips taken by each user type.

This table can provide insights into how usage of the bike share service varies by month and by user type. For example, it appears that usage by both annual and casual members increases during the warmer months, as the number of trips taken by both groups is much higher in the summer months (June through August) compared to the winter months (December through February). It also appears that annual members take many more trips overall than casual members.

For Annual Members, the month with the highest number of trips is July, with 215,835 trips. The month with the lowest number of trips is December, with 80,630 trips. For Casual Members, the month with the highest number of trips is May, with 51,761 trips. The month with the lowest number of trips is January, with 1,390 trips.

5.4 Trip duration More than 30 minutes and More than 45 minutes

Trip Duration more than 30 minutes and 45 minutes.
user_type >= 30 mins >= 45 mins
Annual Member 32647 10364
Casual Member 87792 51791

The table “Trip Duration more than 30 minutes and 45 minutes” shows the number of trips taken by each user type (Annual Member and Casual Member) that lasted 30 minutes or more and 45 minutes or more, respectively.

For Annual Members, there were 32,647 trips that lasted 30 minutes or more, and 10,364 trips that lasted 45 minutes or more. For Casual Members, there were 87,792 trips that lasted 30 minutes or more, and 51,791 trips that lasted 45 minutes or more.

This information could be useful for analyzing the usage patterns of different user types and identifying potential areas for improvement in the bike sharing system. For example, if a significant number of casual users are taking longer trips, the bike sharing company could consider offering different pricing plans or incentives to encourage shorter trips and more frequent use. Additionally, the data could help inform decisions about bike station placement and bike fleet size based on usage patterns.

6. Graphs

6.1 Line Chart of Total Trips by Hour

The Line Chart of Total Trips by Hour displays the total number of trips taken by hour of day for the bikeshare data in 2018. The x-axis represents the hour of the day in a 24-hour format, while the y-axis represents the total number of trips taken during that hour.

The line shows the trend of the total number of trips throughout the day. It appears that the number of trips starts to increase from around 5 AM, peaks around 8 AM, drops gradually until noon, picks up again in the afternoon until around 5 PM, and then starts to decrease steadily until midnight.

This pattern indicates that the bikeshare service is mostly used for commuting to work or school during the weekdays, with higher demand during rush hours in the morning and evening. The chart could be useful to help the bikeshare company plan for bike availability and schedule maintenance based on peak usage times.

6.2 Stacked bar plot of trip volume by user type and month

The stacked bar plot shows the trend of monthly trip volume by user type for the Bikeshare data in 2018. The X-axis represents the months, and the Y-axis represents the total number of trips. The bars are stacked based on the user type, with casual users in red and annual members in blue.

From this plot, we can see that the number of trips taken by Annual Members is consistently higher than the number of trips taken by Casual Members throughout the year. We can also observe that the number of trips taken by both user types peaks in the summer months (June to August) and declines during the winter months (December to February). This suggests that the bikeshare program is more popular during the warmer months of the year.

Overall, this plot provides an overview of the distribution of trips by user type over the months, which can help stakeholders make informed decisions about their marketing strategies and resource allocation.

6.3 Heatmap of trips by day of the week

This heatmap represents the trend of the number of bike trips taken on each day of the week and at each hour of the day. The x-axis of the heatmap represents the hour of the day, while the y-axis represents the day of the week. The cells of the heatmap are colored according to the number of bike trips that occurred at each hour of the day on each day of the week. The color scale ranges from yellow (indicating low trip volumes) to red (indicating high trip volumes).

From the Heatmap, we can observe that the bike usage is highest during the weekdays, particularly on Tuesday, Wednesday, and Thursday. During these weekdays, the highest usage hours are the morning commute hours from 7-9 am and evening hours from 4-6 pm. On weekends, the usage of bikes is relatively lower than weekdays, and there is a slightly different pattern in bike usage. Bike usage is highest during the afternoon hours, particularly from 12 pm to 4 pm on both Saturday and Sunday.

Overall, this heatmap provides an insight into the bike usage pattern during different times of the week, which can help the bikeshare company to plan its resources and services accordingly.

6.4 Line plot of trip volume over time

This line chart shows the total number of trips taken each month over the course of the year 2018, based on the data in the “bikeshare_data” dataset. The x-axis represents the months of the year, and the y-axis represents the total number of trips taken. Each point on the line represents the total number of trips taken in a given month. The blue line connects these points to show the trend in trip volume over the year.

From the chart, we can see that the number of trips generally increased from January to July, with the highest number of trips in July, followed by a gradual decrease in the number of trips from August to December. The chart also shows a dip in the number of trips in November and December, which may be due to seasonal factors such as colder weather. Overall, the chart provides an overview of the trend in trip volume over the course of the year 2018.

7. Hypothesis Testing

Trip duration Between Trips taken on weekdays versus weekends.

We would like to determine if there is a significant difference in trip duration between trips taken on weekdays versus weekends.

Hypothesis:

  • \(H_0\): The average trip duration on weekdays (\(\mu_d\)) is equal to the average trip duration on weekend (\(\mu_e\)), i.e., \(\mu_d=\mu_e\)
  • \(H_1\): The average trip duration on weekdays (\(\mu_d\)) is not equal to the average trip duration on weekend (\(\mu_e\)), i.e., \(\mu_d\neq\mu_e\)

To test this hypothesis, we can use a two-sample t-test to compare the mean trip duration of weekday trips with the mean trip duration of weekend trips. We can split the data into two groups: weekday trips (Monday to Friday) and weekend trips (Saturday and Sunday).

We can then calculate the p-value associated with the t-statistic, which is the probability of observing a t-value as extreme or more extreme than the one calculated, assuming the null hypothesis is true. If the p-value is less than our significance level (typically 0.05), we can reject the null hypothesis and conclude that there is a significant difference in trip duration between weekday and weekend trips.

We can also calculate a confidence interval for the difference between the means to estimate the range of values in which the true difference between the means is likely to fall.

Overall, this hypothesis testing approach allows us to determine whether there is a statistically significant difference in trip duration between trips taken on weekdays versus weekends, and provides us with an estimate of the magnitude of this difference.

## 
##  Welch Two Sample t-test
## 
## data:  weekdays$trip_duration_seconds and weekends$trip_duration_seconds
## t = -112.86, df = 579131, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -383.8696 -370.7647
## sample estimates:
## mean of x mean of y 
##  873.7795 1251.0967

The Welch Two Sample t-test shows a significant difference between the mean trip durations of weekdays and weekends. The p-value (2.2e-16) is less than the significance level of 0.05, indicating strong evidence against the null hypothesis of no difference. The confidence interval (-383.8696, -370.7647) also does not include 0, further supporting the conclusion that there is a significant difference between the two groups. Therefore, we can conclude that there is a significant difference in trip duration between trips taken on weekdays versus weekends. Specifically, the mean trip duration on weekends (1251.0967 seconds) is longer than that on weekdays (873.7795 seconds).

8. Bootstrapping

Estimating the average trip duration for casual members

We could use bootstrapping to estimate the sampling distribution of the mean and compute a confidence interval.

## Mean trip duration for casual members: 2032.496
## 95% confidence interval: (  2022.838 - 2042.171  )

Based on the result, the mean trip duration for casual members is 2032.496 seconds. The 95% confidence interval is (2022.838 - 2042.171) seconds. This means that we are 95% confident that the true mean trip duration for casual members falls within this range.

9. Non-Linear Regression Analysis

Analyzing the relationship between the number of trips and hour of the day.

In this case, we would treat the hour of the day as a continuous variable and fit a non-linear regression model with the number of trips as the dependent variable and the hour of the day as the independent variable. The resulting model would give us information on the direction and strength of the relationship between the hour of the day and the number of trips.

## 
## Call:
## lm(formula = num_trips ~ poly(trip_start_hour, 3, raw = TRUE), 
##     data = trips_by_hour)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -38269 -21133  -9337   9561 103476 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                           11345.41   25688.88   0.442  0.66348   
## poly(trip_start_hour, 3, raw = TRUE)1 -5051.71    9882.68  -0.511  0.61483   
## poly(trip_start_hour, 3, raw = TRUE)2  2225.45    1011.04   2.201  0.03964 * 
## poly(trip_start_hour, 3, raw = TRUE)3   -86.35      28.87  -2.991  0.00721 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36500 on 20 degrees of freedom
## Multiple R-squared:  0.6909, Adjusted R-squared:  0.6445 
## F-statistic:  14.9 on 3 and 20 DF,  p-value: 2.501e-05

Explanation

The non-linear regression model suggests that there is a significant relationship between the number of bike trips and the hour of the day, and that this relationship is best explained by a third-order polynomial function. The coefficient of determination (R-squared) of 0.69 indicates that the model explains 69% of the variation in the data. The p-values for the coefficients of the quadratic and cubic terms are both significant, suggesting that the non-linear terms are needed to better explain the relationship between the number of bike trips and the hour of the day. However, the p-value for the coefficient of the linear term is not significant, suggesting that there may not be a significant linear relationship between the number of bike trips and the hour of the day.

Intrepreting Regression Parameter

The regression parameters in this case refer to the coefficients of the independent variables in the multiple regression equation.

For this non-linear regression model, we have the following coefficients:

  • The intercept coefficient represents the value of the dependent variable (num_trips) when all independent variables are equal to zero. In this case, the intercept coefficient is 11345.41, which means that when the trip_start_hour, trip_start_hour squared, and trip_start_hour cubed are all zero, the expected value of num_trips is 11345.41.

  • The coefficient of trip_start_hour represents the linear effect of the start hour on the number of trips. In this case, the coefficient is negative (-5051.71), which suggests that as the start hour increases by one unit, the expected value of num_trips decreases by 5051.71.

  • The coefficient of I(trip_start_hour^2) represents the quadratic effect of start hour on the number of trips. In this case, the coefficient is positive (2225.45), which suggests that the relationship between start hour and num_trips is curvilinear, with a maximum value of num_trips occurring at a specific start hour.

  • The coefficient of I(trip_start_hour^3) represents the cubic effect of start hour on the number of trips. In this case, the coefficient is negative (-86.35), which suggests that the curvilinear relationship between start hour and num_trips is concave down, with the rate of decrease in num_trips slowing down as start hour increases.

Overall, this non-linear regression suggests that the relationship between start hour and the number of trips is not simply linear, but rather a curvilinear relationship with a peak at a specific start hour. Additionally, the cubic term indicates that this curvilinear relationship is concave down, meaning that the rate of decrease in the number of trips as start hour increases slows down as start hour increases.

Overall, this suggests that there is a non-linear relationship between the hour of the day and the number of trips taken, and this relationship is well captured by the polynomial regression model.

Plotting the Result

10. Cross Validation

Analyzing between Trip Duration (in hour) and hour of the day

In this case, we can use cross-validation to analyze the relationship between trip duration and hour of the day in the bikeshare data.

The idea behind this approach is to split the dataset into two parts, with one part used to train the model and the other part used to validate it. We can then measure the accuracy of the model on the validation set and adjust the model as needed to improve its performance.

To use cross-validation to analyze the relationship between trip duration (in hour) and hour of the day, we would start by splitting the dataset into a training set and a validation set. We could then use Non-linear regression to build a model that predicts trip duration based on the hour of the day, using the training set. Once we have built the model, we would use the validation set to evaluate its performance.

Our approach is k-fold cross-validation, where the dataset is divided into k equally-sized subsets. We then train the model on k-1 of the subsets and use the remaining subset for validation. We repeat this process k times, so that each subset is used for validation once.

## The Average MSE is: 0.002516312

Conclusion

The average MSE of 0.002516312 is quite small, which suggests that the non-linear regression model is fitting the data quite well. This means that there is likely a relationship between the hour of the day and the average trip duration (in hour), and that this relationship can be captured by the non-linear regression model.

Plotting the Result

11. Summary of Research

Based on the information presented in the report, the following are the key findings:

  1. The bikeshare service is mostly used by annual members, who take significantly more trips than casual members.
  2. The average trip duration for casual members is higher than that of annual members, but both groups have a similar median duration.
  3. Usage of the bikeshare service increases during the warmer months (June through August) compared to the winter months (December through February).
  4. The most popular starting station is “York St / Queens Quay W” with 24,017 trips.
  5. Annual members mostly use the bikeshare service for commuting to work or school, with higher demand during rush hours in the morning and evening.
  6. Casual members tend to take longer trips compared to annual members, with a significant number of trips lasting 30 minutes or more.
  7. The bikeshare company could consider offering different pricing plans or incentives to encourage shorter trips and more frequent use by casual members.

Overall, the report provides insights into the usage patterns and behavior of users of the bikeshare service, which could help inform decisions about bike station placement, bike fleet size, pricing plans, and incentives to encourage more usage.