Concerns about the detrimental impact of vehical exhaust emissions on health and the environment are causing an increasing number of people to forego cars in favor of alternative forms of transportation such as bikes. Not only do bikes have a much lower carbon footprint, but renting bikes today has never been easier: bike sharing systems have made the entire process, from rental to return, as automatic and hassle free as possible. This dataset contains information about bike rentals with related weather and seasonal information for 2011 and 2012. It was obtained from https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
This exploratory analysis seeks to uncover patterns in bike rentals to get a better understanding of how people in Washington DC use bikes.
## [1] 17379 17
## [1] "instant" "dteday" "season" "yr" "mnth"
## [6] "hr" "holiday" "weekday" "workingday" "weathersit"
## [11] "temp" "atemp" "hum" "windspeed" "casual"
## [16] "registered" "cnt"
## 'data.frame': 17379 obs. of 17 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...
## instant dteday season yr
## Min. : 1 2011-01-01: 24 Min. :1.000 Min. :0.0000
## 1st Qu.: 4346 2011-01-08: 24 1st Qu.:2.000 1st Qu.:0.0000
## Median : 8690 2011-01-09: 24 Median :3.000 Median :1.0000
## Mean : 8690 2011-01-10: 24 Mean :2.502 Mean :0.5026
## 3rd Qu.:13034 2011-01-13: 24 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :17379 2011-01-15: 24 Max. :4.000 Max. :1.0000
## (Other) :17235
## mnth hr holiday weekday
## Min. : 1.000 Min. : 0.00 Min. :0.00000 Min. :0.000
## 1st Qu.: 4.000 1st Qu.: 6.00 1st Qu.:0.00000 1st Qu.:1.000
## Median : 7.000 Median :12.00 Median :0.00000 Median :3.000
## Mean : 6.538 Mean :11.55 Mean :0.02877 Mean :3.004
## 3rd Qu.:10.000 3rd Qu.:18.00 3rd Qu.:0.00000 3rd Qu.:5.000
## Max. :12.000 Max. :23.00 Max. :1.00000 Max. :6.000
##
## workingday weathersit temp atemp
## Min. :0.0000 Min. :1.000 Min. :0.020 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.340 1st Qu.:0.3333
## Median :1.0000 Median :1.000 Median :0.500 Median :0.4848
## Mean :0.6827 Mean :1.425 Mean :0.497 Mean :0.4758
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:0.660 3rd Qu.:0.6212
## Max. :1.0000 Max. :4.000 Max. :1.000 Max. :1.0000
##
## hum windspeed casual registered
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.0
## 1st Qu.:0.4800 1st Qu.:0.1045 1st Qu.: 4.00 1st Qu.: 34.0
## Median :0.6300 Median :0.1940 Median : 17.00 Median :115.0
## Mean :0.6272 Mean :0.1901 Mean : 35.68 Mean :153.8
## 3rd Qu.:0.7800 3rd Qu.:0.2537 3rd Qu.: 48.00 3rd Qu.:220.0
## Max. :1.0000 Max. :0.8507 Max. :367.00 Max. :886.0
##
## cnt
## Min. : 1.0
## 1st Qu.: 40.0
## Median :142.0
## Mean :189.5
## 3rd Qu.:281.0
## Max. :977.0
##
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
The distribution of bike count is skewed to the right, necessitating a log transformation. Bike count for each day appears to be in the 200 to 300 range.
I am curious to see the difference between registered ridership and casual ridership.
It appears that there are far more registered users than casual users. Before moving onto bivariate analysis, I’d like to see how temperature and humidity is distributed.
The temp variable is normalized, so the distribution is not surprising. Humidity appears to be a little skewed. It appears Washington DC tends towards high humidity.
There are 17379 observations and 17 variables.
Total bike rentals (cnt)
I expect bike rentals will vary depending on features such as the time of the day, temperature, day of the week, and month of the year.
I converted several existing variables into factor variables with more descriptive labels. Based on my analysis in the subsequent sections, I realized that there are certain peak hours of the day in the morning and evening during which bike rentals are much higher. To reflect this using more informative visualizations, I created new categorical variables for hour of the day (Eg: Peak Moring/Peak Evening) as well as temperature (Eg: Very Cold/Cool/Warm)
Bike count has a right skewed distribution, so a log transformation is needed to normalize the data. Later on, a log transformation will be needed when building a linear model as well.
## Warning: position_dodge requires constant width: output may be incorrect
Unsurprisingly, rentals increase as the weather gets warmer.
## Warning: position_dodge requires constant width: output may be incorrect
Rentals appear to peak in the morning at 8-9am and evening at 5-7pm. This seems to co-incide with the start and end of the workday. It is reasonable to expect that a major proportion of rentals at these times come from registered users.
As expected, registered rentals match the overall peaks. I wonder if there is such a pattern for casual users?
Casual ridership peaks during the day at non-peak hours. This is not too surprising. People who ride bikes to work everyday would likely be registered users.
I’m curious as to how ridership varies throughout the year. Are there more rentals in certain months? I expect an increase in rentals Spring onwards, decreasing as winter approaches and the weather gets colder.
## Jan Feb Mar Apr May June Jul
## 94.42477 112.86503 155.41073 187.26096 222.90726 240.51528 231.81989
## Aug Sep Oct Nov Dec
## 238.09763 240.77314 222.15851 177.33542 142.30344
The trend is as I expected, however there are a few surprises. Why is there a drop in rentals in July? Does the weather get unbearably hot for riding bikes? But this is also the beginning of the holiday season. Perhaps people take vacations and this causes a drop in registered rentals? I am curious to see whether there is a difference between casual and registered users for July.
## Jan Feb Mar Apr May June Jul Aug
## 85.9979 101.7069 125.2383 144.9492 172.3125 189.1917 179.2950 189.2576
## Sep Oct Nov Dec
## 191.8358 180.9731 151.8636 127.6757
## Jan Feb Mar Apr May June Jul
## 8.426872 11.158091 30.172437 42.311761 50.594758 51.323611 52.524866
## Aug Sep Oct Nov Dec
## 48.840000 48.937370 41.185389 25.471816 14.627782
I am intrigued by the results. Registered ridership drops in July, while casual ridership actually increases. Maybe my hypothesis is true. Perhaps registered users, who ride their bikes to work everyday, go on vacation and this causes a drop in overall rentals, since registered users outnumber casual users. On the other hand, there may be tourists visiting the city during July who might make use of the bike sharing platform. Perhaps this explains the surge in casual ridership.
Now I want to examine ridership through the week. Are there more rentals on certain days?
## Sunday Monday Tuesday Wednesday Thursday Friday Saturday
## 177.4688 183.7447 191.2389 191.1305 196.4367 196.1359 190.2098
Median rentals are lowest on Sundays and Mondays, interestingly, peaking on Thursdays and Fridays.
I want to check whether registered ridership is higher on working days. At the same time, I want to look into whether casual rentals are higher on weekends.
## Sunday Monday Tuesday Wednesday Thursday Friday Saturday
## 56.16347 28.55345 23.58051 23.15919 24.87252 31.45879 61.24682
## FALSE TRUE
## 35.40838 44.71800
## Sunday Monday Tuesday Wednesday Thursday Friday Saturday
## 121.3054 155.1912 167.6584 167.9713 171.5641 164.6771 128.9630
## FALSE TRUE
## 155.0202 112.1520
The results are as I expected. There is an overall drop in rentals on weekends because registered users to not ride bikes to and from work. Now I want to compare ridership between years.
## 2011 2012
## 143.7944 234.6664
There is a dramatic increase in rentals in 2012 compared to 2011. Perhaps the bike sharing platform became more popular. I wonder if this trend continued to 2013 and onward?
I want to get further insight into how bike rentals vary with the time of the day as well as with temperature and year.
Rentals are maximum at the peak hours in warm weather. I want to see if this trend is true for both 2011 and 2012, as well as for all days of the week.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
It appears to be the case. An interesting result is that there appears to be an increase in ridership on Sunday evenings and Friday mornings.
There is an overall increase in bike rentals in warm weather (and warm months). On weekdays, there are certain hours in the morning and evening when bike rentals peak. In building any predictive model, time of the day and weather information would be essential for predicting bike count. There are very few rentals late at night and early in the morning.
Casual ridership is scattered throughout the day, while registered ridership (by far the majority) peaks at the start and end of the working day. Interestingly, casual ridership increases in July, while registered ridership drops.
Bike rentals strongly correlate with temperature as well as hour of the day.
Bike ridership strongly correlates with temperature. In the very cold winter months, rentals are few, while they reach their peaks during cool and warm weather, before dropping once again as the weather starts to become unbearably hot.
Bike rentals steadily increase throughout the year, peaking in the summer months as the weather gets warmer before dropping as winter approaches. Interestingly, there is an overall drop in rentals in the July-August months before it picks up again in September.
Bike count fluctuates throughout the day, peaking in the morning and evening. This co-incides with the start and end of the workday. Rentals late at night and the early hours of the morning are very low.
This was an interesting dataset to work with. There was no tidying to do and the preprocessing was minimal: I made certain variables categorical and gave them more descriptive labels and ranges. Exploring the data revealed many intriguing insights, some obvious and others less so. Riding bikes is becoming an increasingly popular form of transportation across the nation. Datasets like this give bike sharing services the chance to optimize their resources by looking into questions such as: What are the best times to perform maintenance work? How many bikes should be stocked on a particular day at a particular hour? For example, knowing how bike rentals peak during the start and end of the work day will allow the company to stock bikes appropriately and ensure registered users (as well as potential customers) are never short of a bike. This data can be used to improve transportation systems across the city.
Future exploration of this dataset would include building a linear model to quantify the variation in bike rentals based on predictors such as hour of the day and temperature. Beyond that, with access to location information of different bike rental stations, analysis can be done into how rental patterns vary across the city. Also, a more detailed analysis can be done on individual days and months. For example, do rentals increase on holidays like the 4th of July? What about New Year’s Eve?