Introduction

Concerns about the detrimental impact of vehical exhaust emissions on health and the environment are causing an increasing number of people to forego cars in favor of alternative forms of transportation such as bikes. Not only do bikes have a much lower carbon footprint, but renting bikes today has never been easier: bike sharing systems have made the entire process, from rental to return, as automatic and hassle free as possible. This dataset contains information about bike rentals with related weather and seasonal information for 2011 and 2012. It was obtained from https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

This exploratory analysis seeks to uncover patterns in bike rentals to get a better understanding of how people in Washington DC use bikes.

Univariate Plots Section

## [1] 17379    17
##  [1] "instant"    "dteday"     "season"     "yr"         "mnth"      
##  [6] "hr"         "holiday"    "weekday"    "workingday" "weathersit"
## [11] "temp"       "atemp"      "hum"        "windspeed"  "casual"    
## [16] "registered" "cnt"
## 'data.frame':    17379 obs. of  17 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...
##     instant             dteday          season            yr        
##  Min.   :    1   2011-01-01:   24   Min.   :1.000   Min.   :0.0000  
##  1st Qu.: 4346   2011-01-08:   24   1st Qu.:2.000   1st Qu.:0.0000  
##  Median : 8690   2011-01-09:   24   Median :3.000   Median :1.0000  
##  Mean   : 8690   2011-01-10:   24   Mean   :2.502   Mean   :0.5026  
##  3rd Qu.:13034   2011-01-13:   24   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :17379   2011-01-15:   24   Max.   :4.000   Max.   :1.0000  
##                  (Other)   :17235                                   
##       mnth              hr           holiday           weekday     
##  Min.   : 1.000   Min.   : 0.00   Min.   :0.00000   Min.   :0.000  
##  1st Qu.: 4.000   1st Qu.: 6.00   1st Qu.:0.00000   1st Qu.:1.000  
##  Median : 7.000   Median :12.00   Median :0.00000   Median :3.000  
##  Mean   : 6.538   Mean   :11.55   Mean   :0.02877   Mean   :3.004  
##  3rd Qu.:10.000   3rd Qu.:18.00   3rd Qu.:0.00000   3rd Qu.:5.000  
##  Max.   :12.000   Max.   :23.00   Max.   :1.00000   Max.   :6.000  
##                                                                    
##    workingday       weathersit         temp           atemp       
##  Min.   :0.0000   Min.   :1.000   Min.   :0.020   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.340   1st Qu.:0.3333  
##  Median :1.0000   Median :1.000   Median :0.500   Median :0.4848  
##  Mean   :0.6827   Mean   :1.425   Mean   :0.497   Mean   :0.4758  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:0.660   3rd Qu.:0.6212  
##  Max.   :1.0000   Max.   :4.000   Max.   :1.000   Max.   :1.0000  
##                                                                   
##       hum           windspeed          casual         registered   
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.:0.4800   1st Qu.:0.1045   1st Qu.:  4.00   1st Qu.: 34.0  
##  Median :0.6300   Median :0.1940   Median : 17.00   Median :115.0  
##  Mean   :0.6272   Mean   :0.1901   Mean   : 35.68   Mean   :153.8  
##  3rd Qu.:0.7800   3rd Qu.:0.2537   3rd Qu.: 48.00   3rd Qu.:220.0  
##  Max.   :1.0000   Max.   :0.8507   Max.   :367.00   Max.   :886.0  
##                                                                    
##       cnt       
##  Min.   :  1.0  
##  1st Qu.: 40.0  
##  Median :142.0  
##  Mean   :189.5  
##  3rd Qu.:281.0  
##  Max.   :977.0  
## 
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The distribution of bike count is skewed to the right, necessitating a log transformation. Bike count for each day appears to be in the 200 to 300 range.

I am curious to see the difference between registered ridership and casual ridership.

It appears that there are far more registered users than casual users. Before moving onto bivariate analysis, I’d like to see how temperature and humidity is distributed.

The temp variable is normalized, so the distribution is not surprising. Humidity appears to be a little skewed. It appears Washington DC tends towards high humidity.

Univariate Analysis

What is the structure of your dataset?

There are 17379 observations and 17 variables.

What is/are the main feature(s) of interest in your dataset?

Total bike rentals (cnt)

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I expect bike rentals will vary depending on features such as the time of the day, temperature, day of the week, and month of the year.

Did you create any new variables from existing variables in the dataset?

I converted several existing variables into factor variables with more descriptive labels. Based on my analysis in the subsequent sections, I realized that there are certain peak hours of the day in the morning and evening during which bike rentals are much higher. To reflect this using more informative visualizations, I created new categorical variables for hour of the day (Eg: Peak Moring/Peak Evening) as well as temperature (Eg: Very Cold/Cool/Warm)

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Bike count has a right skewed distribution, so a log transformation is needed to normalize the data. Later on, a log transformation will be needed when building a linear model as well.

Bivariate and Multivariate Plots Section

## Warning: position_dodge requires constant width: output may be incorrect

Unsurprisingly, rentals increase as the weather gets warmer.

## Warning: position_dodge requires constant width: output may be incorrect

Rentals appear to peak in the morning at 8-9am and evening at 5-7pm. This seems to co-incide with the start and end of the workday. It is reasonable to expect that a major proportion of rentals at these times come from registered users.

As expected, registered rentals match the overall peaks. I wonder if there is such a pattern for casual users?

Casual ridership peaks during the day at non-peak hours. This is not too surprising. People who ride bikes to work everyday would likely be registered users.

I’m curious as to how ridership varies throughout the year. Are there more rentals in certain months? I expect an increase in rentals Spring onwards, decreasing as winter approaches and the weather gets colder.

##       Jan       Feb       Mar       Apr       May      June       Jul 
##  94.42477 112.86503 155.41073 187.26096 222.90726 240.51528 231.81989 
##       Aug       Sep       Oct       Nov       Dec 
## 238.09763 240.77314 222.15851 177.33542 142.30344

The trend is as I expected, however there are a few surprises. Why is there a drop in rentals in July? Does the weather get unbearably hot for riding bikes? But this is also the beginning of the holiday season. Perhaps people take vacations and this causes a drop in registered rentals? I am curious to see whether there is a difference between casual and registered users for July.

##      Jan      Feb      Mar      Apr      May     June      Jul      Aug 
##  85.9979 101.7069 125.2383 144.9492 172.3125 189.1917 179.2950 189.2576 
##      Sep      Oct      Nov      Dec 
## 191.8358 180.9731 151.8636 127.6757

##       Jan       Feb       Mar       Apr       May      June       Jul 
##  8.426872 11.158091 30.172437 42.311761 50.594758 51.323611 52.524866 
##       Aug       Sep       Oct       Nov       Dec 
## 48.840000 48.937370 41.185389 25.471816 14.627782

I am intrigued by the results. Registered ridership drops in July, while casual ridership actually increases. Maybe my hypothesis is true. Perhaps registered users, who ride their bikes to work everyday, go on vacation and this causes a drop in overall rentals, since registered users outnumber casual users. On the other hand, there may be tourists visiting the city during July who might make use of the bike sharing platform. Perhaps this explains the surge in casual ridership.

Now I want to examine ridership through the week. Are there more rentals on certain days?

##    Sunday    Monday   Tuesday Wednesday  Thursday    Friday  Saturday 
##  177.4688  183.7447  191.2389  191.1305  196.4367  196.1359  190.2098

Median rentals are lowest on Sundays and Mondays, interestingly, peaking on Thursdays and Fridays.

I want to check whether registered ridership is higher on working days. At the same time, I want to look into whether casual rentals are higher on weekends.

##    Sunday    Monday   Tuesday Wednesday  Thursday    Friday  Saturday 
##  56.16347  28.55345  23.58051  23.15919  24.87252  31.45879  61.24682
##    FALSE     TRUE 
## 35.40838 44.71800

##    Sunday    Monday   Tuesday Wednesday  Thursday    Friday  Saturday 
##  121.3054  155.1912  167.6584  167.9713  171.5641  164.6771  128.9630
##    FALSE     TRUE 
## 155.0202 112.1520

The results are as I expected. There is an overall drop in rentals on weekends because registered users to not ride bikes to and from work. Now I want to compare ridership between years.

##     2011     2012 
## 143.7944 234.6664

There is a dramatic increase in rentals in 2012 compared to 2011. Perhaps the bike sharing platform became more popular. I wonder if this trend continued to 2013 and onward?

I want to get further insight into how bike rentals vary with the time of the day as well as with temperature and year.

Rentals are maximum at the peak hours in warm weather. I want to see if this trend is true for both 2011 and 2012, as well as for all days of the week.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

It appears to be the case. An interesting result is that there appears to be an increase in ridership on Sunday evenings and Friday mornings.

Bivariate and Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

There is an overall increase in bike rentals in warm weather (and warm months). On weekdays, there are certain hours in the morning and evening when bike rentals peak. In building any predictive model, time of the day and weather information would be essential for predicting bike count. There are very few rentals late at night and early in the morning.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Casual ridership is scattered throughout the day, while registered ridership (by far the majority) peaks at the start and end of the working day. Interestingly, casual ridership increases in July, while registered ridership drops.

What was the strongest relationship you found?

Bike rentals strongly correlate with temperature as well as hour of the day.


Final Plots and Summary

Plot One

Description One

Bike ridership strongly correlates with temperature. In the very cold winter months, rentals are few, while they reach their peaks during cool and warm weather, before dropping once again as the weather starts to become unbearably hot.

Plot Two

Description Two

Bike rentals steadily increase throughout the year, peaking in the summer months as the weather gets warmer before dropping as winter approaches. Interestingly, there is an overall drop in rentals in the July-August months before it picks up again in September.

Plot Three

Description Three

Bike count fluctuates throughout the day, peaking in the morning and evening. This co-incides with the start and end of the workday. Rentals late at night and the early hours of the morning are very low.


Reflection

This was an interesting dataset to work with. There was no tidying to do and the preprocessing was minimal: I made certain variables categorical and gave them more descriptive labels and ranges. Exploring the data revealed many intriguing insights, some obvious and others less so. Riding bikes is becoming an increasingly popular form of transportation across the nation. Datasets like this give bike sharing services the chance to optimize their resources by looking into questions such as: What are the best times to perform maintenance work? How many bikes should be stocked on a particular day at a particular hour? For example, knowing how bike rentals peak during the start and end of the work day will allow the company to stock bikes appropriately and ensure registered users (as well as potential customers) are never short of a bike. This data can be used to improve transportation systems across the city.

Future exploration of this dataset would include building a linear model to quantify the variation in bike rentals based on predictors such as hour of the day and temperature. Beyond that, with access to location information of different bike rental stations, analysis can be done into how rental patterns vary across the city. Also, a more detailed analysis can be done on individual days and months. For example, do rentals increase on holidays like the 4th of July? What about New Year’s Eve?