These are the dimensions of the bicycle_rentals dataset
## [1] 17379 15
The features in this dataset are:
## [1] "instant" "dteday" "season" "yr" "mnth"
## [6] "hr" "holiday" "weekday" "workingday" "weathersit"
## [11] "temp" "atemp" "hum" "windspeed" "cnt"
## instant dteday season yr mnth hr holiday weekday workingday
## 1 1 2011-01-01 spring 0 1 0 0 Saturday 0
## 2 2 2011-01-01 spring 0 1 1 0 Saturday 0
## 3 3 2011-01-01 spring 0 1 2 0 Saturday 0
## 4 4 2011-01-01 spring 0 1 3 0 Saturday 0
## 5 5 2011-01-01 spring 0 1 4 0 Saturday 0
## 6 6 2011-01-01 spring 0 1 5 0 Saturday 0
## weathersit temp atemp hum windspeed cnt DoW
## 1 clear 0.24 0.2879 0.81 0.0000 16 Saturday
## 2 clear 0.22 0.2727 0.80 0.0000 40 Saturday
## 3 clear 0.22 0.2727 0.80 0.0000 32 Saturday
## 4 clear 0.24 0.2879 0.75 0.0000 13 Saturday
## 5 clear 0.24 0.2879 0.75 0.0000 1 Saturday
## 6 mild 0.24 0.2576 0.75 0.0896 1 Saturday
## [1] "clear" "mild" "light precip" "heavy precip"
## [1] "spring" "summer" "fall" "winter"
## 'data.frame': 17379 obs. of 16 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ season : chr "spring" "spring" "spring" "spring" ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : chr "Saturday" "Saturday" "Saturday" "Saturday" ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: chr "clear" "clear" "clear" "clear" ...
## $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...
## $ DoW : Factor w/ 7 levels "Sunday","Monday",..: 7 7 7 7 7 7 7 7 7 7 ...
## instant dteday season yr
## Min. : 1 2011-01-01: 24 Length:17379 Min. :0.0000
## 1st Qu.: 4346 2011-01-08: 24 Class :character 1st Qu.:0.0000
## Median : 8690 2011-01-09: 24 Mode :character Median :1.0000
## Mean : 8690 2011-01-10: 24 Mean :0.5026
## 3rd Qu.:13034 2011-01-13: 24 3rd Qu.:1.0000
## Max. :17379 2011-01-15: 24 Max. :1.0000
## (Other) :17235
## mnth hr holiday weekday
## Min. : 1.000 Min. : 0.00 Min. :0.00000 Length:17379
## 1st Qu.: 4.000 1st Qu.: 6.00 1st Qu.:0.00000 Class :character
## Median : 7.000 Median :12.00 Median :0.00000 Mode :character
## Mean : 6.538 Mean :11.55 Mean :0.02877
## 3rd Qu.:10.000 3rd Qu.:18.00 3rd Qu.:0.00000
## Max. :12.000 Max. :23.00 Max. :1.00000
##
## workingday weathersit temp atemp
## Min. :0.0000 Length:17379 Min. :0.020 Min. :0.0000
## 1st Qu.:0.0000 Class :character 1st Qu.:0.340 1st Qu.:0.3333
## Median :1.0000 Mode :character Median :0.500 Median :0.4848
## Mean :0.6827 Mean :0.497 Mean :0.4758
## 3rd Qu.:1.0000 3rd Qu.:0.660 3rd Qu.:0.6212
## Max. :1.0000 Max. :1.000 Max. :1.0000
##
## hum windspeed cnt DoW
## Min. :0.0000 Min. :0.0000 Min. : 1.0 Sunday :2502
## 1st Qu.:0.4800 1st Qu.:0.1045 1st Qu.: 40.0 Monday :2479
## Median :0.6300 Median :0.1940 Median :142.0 Tuesday :2453
## Mean :0.6272 Mean :0.1901 Mean :189.5 Wednesday:2475
## 3rd Qu.:0.7800 3rd Qu.:0.2537 3rd Qu.:281.0 Thursday :2471
## Max. :1.0000 Max. :0.8507 Max. :977.0 Friday :2487
## Saturday :2512
| Feature | Explanation |
|---|---|
| instant | Unique identifier for each observation in the dataset. |
| dteday | The date for which the observation was recorded. |
| season |
The prevailing season for which the observation was recorded. 1: Spring 2: Summer 3: Fall 4: Winter |
| yr |
The year for which the observation was recorded. 0: 2011, 1:2012. |
| mnth | The month for which the observation was recorded. 1-Jan to 12-Dec. |
| hr | The hour, in 24-hour-format, for which the observation was recorded. |
| holiday | Whether the observation was recorded for holiday or a working day. |
| weekday |
The day of the week for which the observation was recorded. |
| workingday | If the day of observation is neither weekend nor holiday, then this value is 1, otherwise it is 0. |
| weathersit |
Weather situation prevailing on the day of observation: |
| temp | Normalized temperature in Celsius, on the day of observation. The values are divided to 41 (max). |
| atemp | Normalized feeling temperature in Celsius. The values are divided to 50 (max). |
| hum | Normalized humidity. The values are divided to 100 (max). |
| windspeed | Normalized wind speed. The values are divided to 67 (max). |
| casual | count of casual users. |
| registered | count of registered users. |
| cnt | count of total rental bikes including both casual and registered. |
We see that the minimum ridership values on any given day is 1 and the maximum on any given day is just shy of 1000.
## dteday cnt
## 668 2012-10-29 22
## 27 2011-01-27 431
## 726 2012-12-26 441
## 26 2011-01-26 506
## 65 2011-03-06 605
## 69 2011-03-10 623
We see from the preceding dataframe dump that the lowest ever bicycle rentals in our dataset happened on 29th October 2012. A simple web search shows us that this was the day Hurricane Sandy made landfall in the Washington DC area - a majority of schools, colleges and offices had shut down! The next low occurred on 27th January 2011. On this day, it was a blizzard that had shut down Washington DC! Thus, analyzing transportation statistics over a period of time and co-relating incidents that correspond to variations to the trend, may help us build predictive models.
Let us look at some plots now.
The first plot on the left shows the total number of users renting bicycles in each of the two years for which we have data. At a very high level, we see that the total number of users in the initial year are outnumbered by the total number of users in the subsequent year. I attribute this trend to a number of possible factors like rise in popularity of the rental service, increase in the travelling population and increase in the crowding factor on other modes of public transportation.
The second plot shows the total number of casual users renting bicycles in the two years for which we have data. We see that the total number of casual users in the initial year are outnumbered by the total number of users in the subsequent year.
The last plot shows the total number of registered users renting bicycles in the two years for which we have data. We see that the total number of registered users in the initial year are outnumbered by the total number of users in the subsequent year. We also see that the year-over-year increase in registered users is greater than the year-over-year increase in casual users. This may be due to casual-to-registered user conversion, or simply more people choosing to register than rent casually.
I chose to include these plot because I wanted to look at the overall trends and whether there were any opposing trends among casual and registered users. We do not see any opposing trends, but it looks like registered ridership is signaling the popularity of the service.
The bicycle rental frequencies seem to progressively decline for higher counts. I chose to include this plot because I wanted to convey a sense of popularity for the service. We see that daily/hourly instances with lower total bicycle rentals happen more frequently than daily/hourly instances with higher total rentals.
I wonder how the frequencies will vary by season and weather conditions.
As is intuitive, we see that bicycle rentals are highest when the weather is clear, next highest when the weather is just misty, a few brave souls renting bicycles when there is light precipitation and least when there is heavy precipitation. I chose to include this plot because bicycle transportation is open to the elements and is therefore very different from other modes of transportation. I wanted to get an initial sense of whether the prevailing conditions played a major role in influencing bicycle rentals.
I cannot fathom the reason for this, but it looks like bicycle rentals are highest in the fall, next highest in summer, third in winter and lowest in spring. As in the previous plot, I wanted to explore whether prevailing conditions played a part in influencing bicycle rentals. Looking at the seasonal trends, we get a fair sense of how general cold weather/warm weather, etc. influence bicycle rentals.
Next, I subset-ed the data into two groups - one subset for holidays and one subset for working days. Here are the summaries of the bicycle rental numbers for the respective subsets:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 40.0 144.0 190.4 281.5 977.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 28.75 97.00 156.90 253.20 712.00
From the preceding stats, we see that, all else set aside, bicycle rental numbers are higher on working days than on holidays - indicating that bicycle rentals may not be for recreational use.
Next, let’s explore how bicycle rentals stack up by day of the week.
From the plot on the left, we see that the busiest day for bicycle rentals is Friday, and the least busy day is Sunday. Rental numbers climb up through the work week and then fall off on the weekend.
From the plot on the right, we see that bicycle rentals peak at 8:00am and 5:00pm and drop off on either side, suggesting that the peaks may be driven up by people using bicycles to get to and from work.
I chose to include these plots because I wanted to see whether there were any particular days that were busiest and any that were less busy, as also the busier hours of the day as well the less busy ones.
The plot on the left shows the daily total bicycle rentals by the day of the week. In the previous section, we saw which were the busiest days of the week for bicycle rentals; I chose this plot so that we could see how noisy the data really is. During the weekend, we see the least variance, while we see greater variance for data on Monday and Wednesday.
The plot on the right shows the hourly total bicycle rentals. In the previous section, we saw which were the busiest hours of the day for bicycle rentals; I chose this plot so that we could see how noisy the datareally is. This plot turned out to be really interesting; we see the medians following the trend we saw in the previous section and we also see that most of the variance in bicycle rental hours happens during the middle portion of the day. I have limited the y-axis to a value of 300, so that the outliers are eliminated and the charts are most readable.
The preceding plot shows the median total bicycle rentals on working days for each hour of the day(line in red). On working days, the trend is pretty close to the trends overall, suggesting that bicycle rentals are primarily driven by office/school-goers. This plot continues the exploration from where we left off in the previous plot. The line in blue shows bicycle rentals by the hour on holidays. The plot for holidays shows higher numbers during the middle portion of the day. I chose these plots because I wanted to see if the hourly trends on working days matched the hourly trends on all days. It does follow closely, reinforcing our belief that bicycle rentals are primarily used by commuters.
Let us now see how the numbers of casual and registered renters vary by the day of the week and the hour of the day.
From plot on the left, we see that casual ridership is higher on weekends than on weekdays, while the trend is reversed for registered ridership. We also see that the hourly ridership plot for registered riders is similar to the overall hourly ridership plot - which means that rental registration is a pretty good predictor of bicycle rentals.
Though this plot may have better belonged in the bivariate section, I chose to include this plot here because it nicely wraps up the conclusions drawn in the preceding graphs. We see how registered users(more likely commuter than not) drive up rentals in the middle portion of the week while casual(more likely to be recreational users than not) drive up rentals during the weekend.
What is the structure of your dataset?
I am working with the bicycle rentals dataset, which contains bicycle rentals information at an agency in Washington DC., for the years 2011-2012. The dataset also contains weather data for the same days as the bicycle rentals.
The dataset contains 731 daily records as well as 17379 hourly records for bicycle rentals at one agency in the city.
There are 17 features in the hourly dataset, and they are listed here:
instant - An index for the record
dteday - The date
season - The prevailing climatic season
yr - Year
mnth - Month
hr - The hour of observation
holiday - Whether or not the particular day was a holiday(weekends are counted as holidays too)
weekday - The day of the week
workingday - Whether or not the particular day was a working day(inverse of the holiday value)
weathersit - What kind of weather prevailed on the particular day
temp - The hourly temperature
atemp - The average temperage for the day
hum - The humidity on the particular day
windspeed - The windspeed on the particular day
casual - The number of rentals by non-registered customers
registered - The number of rentals by registered customers
cnt - Total number of rentals
The variables weathersit, season, holiday and workingday are categorical variables.
weathersit has four states that I have simplified into: clear, misty, light_precipi(light rain/snow) and heavy_precipi(heavy rain/sleet/ snow).
season has four states: spring, summer, fall and winter.
holiday and workingday are categorical variables that have binary values and orthogonal with each other in each observation.
Other observations:
1) Most bicycle rentals are made by registered customers than casual customers.
2) More bicycle rentals are made by people that are commuting to their jobs/schools than recreational customers.
3)Bicycle rentals are higher in dry weather than wet weather.
What is/are the main feature(s) of interest in your dataset?
The main feature of interest in the dataset is the bicycle rental count. I am trying to build a prediction model for bicycle rentals. From our univariate analyses, we have seen that the general weather situation and season have pretty good predictive power.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
We need to analyse some more to build a prediction model with higher confidence.
I suspect windspeed, humidity and temperature will play a role in predicting bicycle rentals, but we shall see.
Did you create any new variables from existing variables in the dataset?
No, I did not. Had the total count of bicycle rentals been missing, I could have used “casual” and “registered” to derive the total count.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
I log-transformed the total bicycle rentals for each day, to get a sense for the distribution.
1) When I looked at the distribution of total bicycle rentals as a whole, I found that the lower numbers occurred far more frequently. There was a tiny gap in the frequencies around the mark, which remains unexplained.
2) When I plotted the busiest hours of the day overall, I saw two peaks, which corresponded to the commute hours.
3) When I looked at the mean trends on a daily basis, I see that registered users rented mainly during the work week and lesser on weekends. For casual users, this trend was reversed - casual users rented less during the work week than on weekends.
I did not have to do anything to clean the dataset.
From the univariate analyses performed so far, I have seen that some features tend to correlate with each other. For example, average temperatures and precipitation seem to correlate with the season in a broader sense; these variables also seem to correlate with the weather situation in a much more specific sense.
A quick look at the correlation data shows that bicycle rentals are mostly influenced by the feeling temperature and hour of the day. I chose to include the correlation chart because it shows correlation scores for different pairs of features and therefore will give us a good idea of what relationships we should be exploring in detail.
I will now explore some scatter plots involving total bicyle rentals together with normalized daily feeling temperatures, humidity and windspeed.
From the preceding plot, we can see that bicycle rentals gradually climbs up as the temperature increases, and is lowest at the extremeties, ie., when the feeling temperatures are at their lowest and highest. From the correlation chart above, we see a positive correlation between the total bicycle rental numbers and the prevailing temperatures, with a correlation score of 0.397. I chose to include this plot because we saw in the previous section how prevailing weather conditions influence bicycle rentals. I wanted to see whether prevailing temperatures had any role to play.
From the preceding plots, we see that most bicycle rentals occur when the normalized humidity is between 0.30-0.80.Taking a closer look at the rental number distributions between normalized humidities of 0.75-1.0, we see that there are certain “pockets” of distributions. At this point, it is not entirely clear what might be causing these distributions. It may be because certain values of normalized humidity never occur. From the correlation chart above, we see that there is a negative correlation between the total bicycle rental numbers and the prevailing humidity, with a correlation score of -0.325. I chose to include this plot because it helps us reinforce in greater details the relationship between the total number of bicycle rentals and the prevailing humidity that was revealed by the correlation chart.
From the preceding plots, we see that bicycle rentals are pretty steady at lower windspeeds, dropping off at higher windspeeds . In the correlation chart above, we see that the correlation between total bicycle rental numbers and the prevailing windspeeds is posistive, but weak, with a correlation score of 0.0893. I chose to include this plot because it nicely wraps up our investigation into how the three main weather factors influence bicycle rentals individually.
I chose to include scatterplots in the above charts to investigate if there were any patterns. I did not see any particular patterns, other than the fact that a combination of moderate-high humidity, moderate temperature and low-moderate windspeeds might work well for bicycle renters!
From the preceding plot, we see that bicycle rentals mostly happen in “moderate” feeling temperature. We see that clear weather rentals occur across the spectrum of feeling temperature. We also see that bicycle rentals in misty and light precipitation weather situations happen in a much narrower band of temperature. A peak stands out at a normalized feeling temperature of about 0.63, and we see that the rentals spike in clear, light precipitation and misty weather. Of course, very few rentals occur during periods of heavy precipitation.
Since the weather factors tend to straddle seasons and even occur unseasonally, I included this chart to explore how bicycle rentals vary with the feeling temperatures. Overlaying it with the prevailing weather situation shows us the the different weather situations in which the temperature values occur.
From the preceding plot, we see that most of the humidity range is dominated by clear weather and this is where a bulk of bicycle rentals take place. We do see a few peaks towards the higher end of the humidity spectrum, and this quartile seems to be dominated by misty weather for the most part and precipitative weather to a lesser extent.
Since the weather factors tend to straddle seasons and even occur unseasonally, I included this chart to explore how bicycle rentals vary with the prevailing humidity. Overlaying it with the prevailing weather situation shows us the the different weather situations in which the humidity values occur.
From the preceding plot, we see that most of the bicycle rentals happen in the first two quartiles of normalized windspeed. Rentals are highest when the normalized windspeed is 0, in fact. Beyond a normalized windspeed of 0.30, however, bicycle rentals fall away steadily and are almost non-exixtent in very windy weather. Most of the visible spectrum is dominated by clear weather.
Since the weather factors tend to straddle seasons and even occur unseasonally, I included this chart to explore how bicycle rentals vary with the prevailing windspeed. Overlaying it with the prevailing weather situation shows us the the different weather situations in which the windspeed values occur.
In fact, most of the bicycle rentals happen in clear weather, as shown by the summary below.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## clear 1 46.0 159 204.90 304.0 977
## heavy precip 23 29.5 36 74.33 100.0 164
## light precip 1 21.0 63 111.60 152.5 891
## mild 1 40.0 133 175.20 257.0 957
Let’s also look at a summary of bicycle rentals by season.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## fall 1 68 199.0 236.0 345 977
## spring 1 23 76.0 111.1 158 801
## summer 1 46 165.0 208.3 311 957
## winter 1 46 155.5 198.9 295 967
From the above summary, it looks like bicycle rentals are high in the fall and summer and taper off in the winter and spring.
Let’s now look at the number of casual bicycle rentals per season.
In the plot on the left, we see bar plots for casual bicycle rentals by the prevailing season. We see that casual ridership is highest in fall, followed by summer, spring and winter. In the plot on the right, we see bar plots for registered bicycle rentals by the prevailing season. As in the case of casual users, we see that ridership is highest in summer. For registered users, though, next highest is in winter followed by spring. As in the case of casual users, lowest ridership occurs in spring.
I chose to include these plots because we can, at a glance, see how the seasonal trends for bicycle rentals differs for casual renters and registered renters.
In the plot on the left, we see how casual bicycle rentals stack up against the prevailing temperature. The highest number of casual bicycle rentals happen between normalized temperatures of 0.55 and 0.75.
In the plot on the right, we see how registered bicycle rentals stack up against the prevailing temperatures. As in the case of casual riders, normalized temperatures 0.48 to 0.75 seem to be most popular.
I chose to include these plots because it is interesting to see how the distribution of casual and registered bicycle rentals each vary by the prevailing temperatures. It also helps us appreciate that the relationship between bicycle rental numbers and prevailing temperatures is non-linear and why temperature may not have too much of an influence on a linear model built to predict casual bicycle rentals.
In the plot on the left, we see how casual bicycle rentals stack up against the prevailing humidity. We see that normalized humidity ranges between 0.30 and 0.55 seem to be very popular among casual bicycle rentals.
In the plot on the right, we explore how registered bicycle rentals stack up against the prevailing humidity. For registered users, we see that normalized humidity values between 0.30 and 0.75 are most popular among registered riders.
I chose to include these plots because it helps us appreciate that the relationship between bicycle rental numbers and the prevailing humidity is non-linear and why the humidity factor is likely to not have a great influence on any linear model we build to predict bicycle rentals.
In the preceding plot, we see how total bicycle rentals stack up against the prevailing wind conditions. We see that bicycle rentals are highest when the windspeed is low, dropping off as the prevailing windspeed picks up.
The correlation between bicycle rentals and windspeed is very low, as evidenced by the correlation chart we saw earlier. However, I chose to include this plot because it shows the distribution of bicycle rental numbers across the spectrum of windspeed and helps us appreciate the fact that there is no linear relationship between the two.
The preceding plot is a frequency plot of bicycle rentals. Most rentals happen in small numbers at a time, which shows that the popularity of the service is still picking up. As popularity increases, we should expect to see more occurences of higher number of bikes per day.
Though the number of registered riders using the service is consistently higher than the number of casual renters, we see that the general trends remain the same. Registered users seem to be a little more loyal than casual users, in that the winter ridership is about the same as the summer ridership - whereas it is much lower in the case of casual users.
I chose to include this plot in the report because this chart shows what we may call the “popularity score” for the bicycle rental service as a whole.
We have already seen that average feeling temperatures and hour of the day are the main influencers of bicycle rentals. Let’s look at how different weather conditions vary with the categorical variables.
We already have a feel for how feeling temperatures are connected with weather conditions. We will see how average feeling temperatures vary by season.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## fall 0.2424 0.6061 0.6667 0.6560 0.7121 1.0000
## spring 0.0000 0.2121 0.2879 0.2981 0.3788 0.6515
## summer 0.1667 0.4394 0.5303 0.5205 0.6212 0.8788
## winter 0.1515 0.3333 0.4242 0.4157 0.5000 0.6818
The feeling temperatures are highest in fall and summer, followed by spring and winter. This follows the general trend of bicycle rentals by season. I chose to plot this chart because I wanted to explore the relationship between the different seasons and prevailing temperatures in a given season. With a boxplot, I am able to see the median value for each season, as well as the outliers.
Let’s now look at the trend of humidity by seasons.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## fall 0.16 0.50 0.65 0.6332 0.78 1
## spring 0.00 0.43 0.56 0.5813 0.74 1
## summer 0.16 0.46 0.64 0.6270 0.81 1
## winter 0.16 0.53 0.66 0.6671 0.82 1
Humidity values are highest in winter, about the same in fall and summer, and lowest in spring. This is contrary to the trends we see for bicycle rentals. This reinforces our finding that humidity is not a great predictor of bicycle rentals. I chose to include this plot because I wanted to explore the relationship between the humidity and season features. I chose a boxplot because it lets us see the seasonal medians as well as any noisy data points.
Let us take a look at how windspeeds influence bicycle rentals.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## fall 0 0.1045 0.1642 0.1716 0.2537 0.8507
## spring 0 0.1045 0.1940 0.2151 0.2985 0.8060
## summer 0 0.1343 0.1940 0.2034 0.2836 0.7164
## winter 0 0.0896 0.1642 0.1708 0.2537 0.6418
Windspeeds are highest in spring and summer, and about the same in fall and winter. This trend is contrary to the bicycle rental trends .
From the preceding boxplots, we see that only average feeling temperature trends mimic that of bicycle rentals. The choice of the preceding boxplots was driven by my desire to explore the relationship between features and how they strengthen or weaken each other. I used boxplots because while a particular range of values of a given feature is dominant for the season, unseasonal values do occur; I wanted to quantify how frequent these unseasonal values would be.
##
## Call:
## lm(formula = cnt ~ registered, data = df_master)
##
## Residuals:
## Min 1Q Median 3Q Max
## -118.031 -16.264 -9.957 4.730 304.223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.296458 0.459718 22.4 <2e-16 ***
## registered 1.165032 0.002131 546.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42.51 on 17377 degrees of freedom
## Multiple R-squared: 0.9451, Adjusted R-squared: 0.9451
## F-statistic: 2.99e+05 on 1 and 17377 DF, p-value: < 2.2e-16
The linear model above shows how the registered user count influences the total count. We see that the correlation is pretty high, with an r^2 value of 0.945.
##
## Call:
## lm(formula = cnt ~ temp, data = df_master)
##
## Residuals:
## Min 1Q Median 3Q Max
## -291.37 -110.23 -32.86 76.77 744.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0356 3.4827 -0.01 0.992
## temp 381.2949 6.5344 58.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 165.9 on 17377 degrees of freedom
## Multiple R-squared: 0.1638, Adjusted R-squared: 0.1638
## F-statistic: 3405 on 1 and 17377 DF, p-value: < 2.2e-16
The linear model above shows how the prevailing temperatures influence the total bicycle rental count. There is a moderate correlation, with an r^2 value of 0.1638.
##
## Call:
## lm(formula = cnt ~ hr, data = df_master)
##
## Residuals:
## Min 1Q Median 3Q Max
## -305.86 -102.80 -45.85 66.50 730.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.0952 2.4616 28.48 <2e-16 ***
## hr 10.3378 0.1829 56.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 166.7 on 17377 degrees of freedom
## Multiple R-squared: 0.1553, Adjusted R-squared: 0.1552
## F-statistic: 3195 on 1 and 17377 DF, p-value: < 2.2e-16
The linear model above shows how the total bicycle rentals are influenced by the hour of the day. We see that there is a moderate correlation, with an r^2 value of 0.1552.
##
## Call:
## lm(formula = cnt ~ season, data = df_master)
##
## Residuals:
## Min 1Q Median 3Q Max
## -235.02 -118.87 -39.11 85.89 768.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 236.016 2.615 90.268 < 2e-16 ***
## seasonspring -124.902 3.753 -33.284 < 2e-16 ***
## seasonsummer -27.672 3.716 -7.447 9.99e-14 ***
## seasonwinter -37.147 3.755 -9.893 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 175.3 on 17375 degrees of freedom
## Multiple R-squared: 0.06599, Adjusted R-squared: 0.06583
## F-statistic: 409.2 on 3 and 17375 DF, p-value: < 2.2e-16
The linear model above shows how the total bicycle rentals are influenced by the prevailing season. We see that there is a slight correlation, with an r^2 value of 0.06583.
I chose to include the linear models above because, in anticipation of building a linear prediction model, I am trying to decide on a set of features. From the correlation chart and other plots above, we already have a fair idea of what features may have a strong correlation with the bicycle rental numbers; the linear models above help us to quantify that influence.
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
The number of bicycle rentals correlate strongly with the year and hour of the day. The former can be attributed to the assumption that the popularity of the rental service increased in the second year. This assumption is inconclusive because we only have data from two years. I do not see any particularly strong relationships between the number of bicycle rentals and other features in the dataset like humidity, weather temperature, feeling temperature, etc., (I consider those relationships that have an R^2 value > 0.50 to be a strong relationship).
From the exploration that I conducted into the data, I see that about 16% of total bicycle rentals on a given day can be predicted by the prevailing temperature on that day.
I also see that a whopping 94% of total bicycle rentals on a given day can be predicted by the number of users that are registered with the service as on that day.
Given the preceding findings, I looked to see if there is a strong relationship between the number of casual bicycle renters and any other feature in the dataset. The strongest correlation I found was between casual ridership and normalized temperature - roughly 21% of casual bicycle rentals may be predicted by looking at the normalized prevailing temperatures for the day.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
I looked at the relationships between some more features in the dataset. I see that there is a moderate relationship between humidity and the prevailing weather situation of the day. I see that there is a strong relationship between the normalized temperature on a given day and the corresponding season.
What was the strongest relationship you found?
The strongest relationship I found was between the total bicycle rental numbers and number of registered renters. The other strong relationship I found was between the prevailing daily normalized temperatures and the season.
We see that daily median ridership for each day of the week is lowest in spring and highest in the fall seasons. The daily medians numbers follow a closer trend early in the week in summer and winter. I chose to include this plot because it shows very clearly the seasonal trends by day of the week. From this plot, we see why the prevailing season may be an acceptable indicator of bicycle rental numbers.
Looking at the median hourly ridership numbers, we see that the numbers are lowest in spring, while they follow a closer trend in the other seasons. The numbers are pretty much the same in the early hours of the day(owing to the extremely low rentals). For hourly numbers in summer, fall and winter, we see they closely converge at around 9:00am and between 4:00pm and 5:00pm - showing that these are primarily planned rentals. I chose to include this chart because it supplements the investigation we performed in the previous plot. When we look at the plot above, we see that the seasonal trends hold for pretty much the entire range of hours.
The preceding plot shows the median daily bicycle rentals depending on the prevailing weather situation of the day. Heavy precipitation seems to not deter bicycle renter on Mondays and Tuesdays, while on other days, heavy precipitation ensures low ridership. As predictable, highest bicyle ridership happens on clear days, followed by mild weather days and then on days that see light precipitation. I chose to include this plot because it reveals the daily bicycle rental trends by prevailing weather situation - and why there may not be a close correlation between the rental numbers and prevailing weahter.
Lastly, we look at median hourly bicycle rentals by weather situation. As predictable, heavy precipitation ensures lowest bicycle ridership except, curiously, in the earliest hours of the day when it is higher than in any other weather situation. Up until 9:00am, we see that ridership in clear weather and mild weather are almost the same. After 9:00am, rental numbers in clear weather are higher than rental numbers in mild weather. Bicycle rentals on days with light precipitation are higher than in heavy precipitation, but lower than in clear and mild weather, at all hours of the day. I chose to include this plot because it nicely wraps up the investigation we have been conducting over the past few plots. The trends are more pronounced for three of the prevailing weather classes - and reveals why the correlation may be weak between prevailing weather and bicycle rental numbers.
These plots above show how the hourly and daily bicycle rentals depending on the weather situation and the seasons. I wonder how these trends would vary with temperature, humidity and windspeed factors too.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## fall 1 68 199.0 236.0 345 977
## spring 1 23 76.0 111.1 158 801
## summer 1 46 165.0 208.3 311 957
## winter 1 46 155.5 198.9 295 967
The preceding plot shows the total bicycle rentals cut by the season. Viewing the data as a boxplot shows us the median values, as well as appreciate the fact that the rental numbers are more bunched up in fall and summer(meaning that the total number of rentals stuck in a tight range) whereas in spring and winter we saw a greater variance.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## clear 1 46.0 159 204.90 304.0 977
## heavy precip 23 29.5 36 74.33 100.0 164
## light precip 1 21.0 63 111.60 152.5 891
## mild 1 40.0 133 175.20 257.0 957
The preceding boxplots reinforce our findings that bicycle rentals are lowest in spring, followed by winter and summer and highest in the fall. Looking at the stats for each quartile, we see that the trends are close for the other three seasons except spring. We also see that periods of precipitation show lower bicycle rentals in the initial quartiles than for higher quartiles. The main anomaly we see is that for the first quartile, bicycle rentals are higher in periods of heavy precipitation than in periods of light precipitation.Towards the higher quartiles, though, we see that bicycle rentals in non-precipitative weather dominates.
Since bicycle rentals are dominated by registered users, and because we saw from prior analysis that bicycle rental trends closely match office/school-going patters, I thought that it might be a good idea to explore the trends among casual users. I added a new column to the dataframe to represent the ratio of casual user rentals to the total renters.
We see that the bulk of ratio frequencies is below 1:2 casual:total rentals. Of these, we see that casual ridership is highest in winter, followed by summer, spring and fall. I think this may be because winter ridership may be primarily ad-hoc, while in better weather, people make plans and register for bicycle rentals. I chose to include this plot because it reinforces our earlier findings that bicycle rentals are dominated by registered users. Coloring the numbers by the prevailing season, we easily see the variance by season.
The highest occurences of casual ridership is seen in mild and light precipitative weather, followed by clear weather. There is very little casual ridership in heavy precipitation. This may be because casual riders are those who rent bicycles ad-hoc in slightly inclement weather, while they make plans and register for the service when clear weather is expected. I chose to include this plot because it reinforces our earlier findings that bicycle rentals are dominated by registered users. Coloring the numbers by the prevailing season, we easily see the variance by the prevailing weather situation.
On plotting the frequency of casual ridership ratio to the total by day of the week, we see that casual ridership is higher earlier in the week than later. Interestingly, we see that casual ridership is higher on Sundays than on Saturday. I chose to include this plot because it reinforces our earlier findings that bicycle rentals are dominated by registered users. Coloring the numbers by the prevailing season, we easily see the variance by the prevailing weather situation.
Finally, on plotting the frequencies of casual ridership ratios to the total by the fact that the days are holidays/working days, we see that casual ridership is higher on working days than on holidays. We see that on holidays, we see higher ratios of casual riders to total rentals. This may be explained by the fact that people like to explore the city on holidays.
I chose to include this plot because, in prior charts and analyses, we concluded that registered rentals dominate total bicycle rental numbers. That leaves us wondering as to whether this trend holds on “non-commuter days”, ie., holidays. This chart explains that question and clears the air.
All of these plots reinforce the findings that we reached using earlier histograms and plots. Now, in order to build a linaear model to predict bicycle ridership, I’m going to decide on a set of variables that correlate with ridership.
I chose to include this plot because we already found out, from previous charts and analyses, how certain hours of the day are popular with bicycle renters. I wanted to see how the bicycle rental numbers varied by hour of the day and the prevailing seasons, and how the data points were distributed. We see that the trends in this chart follow the trends we saw earlier, that the middle portion of the day is a clear favorite among bicycle renters except in spring, where the trends are a little diluted.
From the plots above, we see how clear weather is a favorite among bicycle renters, with higher number of hourly rentals happening in this weather. The next favorite is mild weather. Periods of light precipitation see moderate rentals, while periods of heavy precipitation see almost no rentals. I chose to include this plot because we already found out, from previous charts and analyses, how certain hours of the day are popular with bicycle renters and wanted to see how the bicycle rental numbers varied by hour of the day and the prevailing seasons, and how the data points were distributed.
From the plots above and the r^2 values of the linear models above, we can deduce that the features that may help us predict total bicycle rentals on any given day are “registered”, “temp”, “hr”, “season”, “weathersit”, in the order of confidence. Let’s try to build a linear prediction model using these features and see if we can predict bicycle rental numbers with a high level of confidence.
##
## Calls:
## m1: lm(formula = I(cnt) ~ I(registered), data = df_master)
## m2: lm(formula = I(cnt) ~ I(registered) + temp, data = df_master)
## m3: lm(formula = I(cnt) ~ I(registered) + temp + hr, data = df_master)
## m4: lm(formula = I(cnt) ~ I(registered) + temp + hr + season, data = df_master)
## m5: lm(formula = I(cnt) ~ I(registered) + temp + hr + season + weathersit,
## data = df_master)
## m6: lm(formula = I(cnt) ~ I(registered) + temp + hr + season + weathersit +
## windspeed, data = df_master)
##
## =============================================================================================================
## m1 m2 m3 m4 m5 m6
## -------------------------------------------------------------------------------------------------------------
## (Intercept) 10.296*** -25.759*** -33.535*** -56.904*** -53.708*** -55.745***
## (0.460) (0.835) (0.920) (1.857) (1.866) (1.884)
## I(registered) 1.165*** 1.129*** 1.114*** 1.112*** 1.109*** 1.108***
## (0.002) (0.002) (0.002) (0.002) (0.002) (0.002)
## temp 83.584*** 83.145*** 109.100*** 107.789*** 107.100***
## (1.661) (1.644) (2.622) (2.611) (2.609)
## hr 0.893*** 0.805*** 0.827*** 0.796***
## (0.046) (0.047) (0.047) (0.047)
## season: spring/fall 18.307*** 18.247*** 17.099***
## (1.332) (1.326) (1.333)
## season: summer/fall 16.161*** 16.560*** 15.863***
## (0.924) (0.920) (0.923)
## season: winter/fall 13.241*** 13.651*** 13.454***
## (1.111) (1.107) (1.105)
## weathersit: heavy precip/clear 0.538 0.633
## (22.388) (22.355)
## weathersit: light precip/clear -12.852*** -13.401***
## (1.103) (1.104)
## weathersit: mild/clear -5.546*** -5.379***
## (0.683) (0.683)
## windspeed 17.916***
## (2.470)
## -------------------------------------------------------------------------------------------------------------
## R-squared 0.945 0.952 0.953 0.954 0.954 0.955
## adj. R-squared 0.945 0.952 0.953 0.954 0.954 0.954
## sigma 42.511 39.716 39.302 38.948 38.759 38.701
## F 299011.649 172556.049 117596.137 59925.300 40359.602 36436.869
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -89825.690 -88643.242 -88460.671 -88301.904 -88215.940 -88189.651
## Deviance 31402836.983 27407601.645 26837762.249 26351857.617 26092447.751 26013625.792
## AIC 179657.381 177294.484 176931.343 176619.808 176453.880 176403.301
## BIC 179680.670 177325.536 176970.158 176681.913 176539.274 176496.457
## N 17379 17379 17379 17379 17379 17379
## =============================================================================================================
To the variables I mention in the section above, I added the “windspeed” variable to see if factoring it in would give my prediction model some more accuracy. We see that the prediction model I just built can help predict bicycle rentals on a given day with about 95.5% confidence.
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
Many of the variables interact and strengthen each others’ influence over bicycle rental numbers. For example, temperature, humidity and windspeed are tightly related to the weather situation. Each season has a weather situation that dominates a majority of the days in that season.
Were there any interesting or surprising interactions between features?
I did not see any surprising relationships between features in the dataset.
Did you create any models with your dataset? Discuss the strengths and limitations of your model.
Yes, I created a prediction model with the dataset. The main strength of my prediction model is that it is able to predict bicycle rental numbers with about 95.5% confidence. Some limitations of the model:
1) The dataset spans only about two years. As we collect more and more data, we may need to tweak the model.
2) The model does not take into account any special events happening in the city on a given day(this is actually a limitation of the dataset!). If we know whether there were any significant events in the city(rally, game, etc.) on a given day, we could have used that as an input into our model.
In this plot, we see how the registered bicycle rental numbers trump casual bicycle rentals. Looking at these trends gave us an early indication that looking at the number of registered rentals per hour/day would give us a good idea of what the total bicycle rental numbers would look like.
In the previous plot, we see the hourly trends for bicycle rentals across all the observations in the dataset, colored by seasons. In all seasons, we see that one of the hours peaks occurs at 8:00 AM and the second one at 5:00 PM. These peaks suggest that bicycle rentals are utilized by users who use the bikes to get to work/school(since these activities tend to be between fixed times everyday). We see that the cicycle rental numbers fall on either side of the peak, marking their lowest points during the night hours. The general trend holds in all seasons, but the numbers are highest in the fall, followed by summer, winter and spring. I think this is mainly because of temperature, precipitation and other factors associated with the seasons.
In the previous plot, we see hourly trends for bicycle rentals across all observations in the dataset, colored by the prevailing weather situation. The trends we see in the previous plot are, to some extent, seen in this plot too. Let’s consider the three weather situations - clear, mild and light precipitation first. We see the conventional “commuter peaks” at 8:00 AM and 5:00PM, with ridership falling off on either sides of the peaks. Interestingly, renter behavior is very different during periods of heavy precipitation. Total rentals during heavy precipitation stays close to none in the early hours of the day, peaking slightly at around 6:00 PM.
The bicycle rentals dataset consists of 17379 hourly observations of bicycle rentals at a rental agency in Washington DC. I chose this dataset because I wanted to gain some experience in choosing a tidy dataset to explore. I see that it is mostly weather-related features that closely influence the bicycle rental numbers on a given day. Factors like temperature, precipitation and windspeed tie closely into the prevailing weather situations and seasons. Therefore, to some extent, these factors influence and strengthen each other. Another feature that influences bicycle rentals is the hour of the day. I think this may be driven by office/school-goers who use bicycles to commute - and I see peaks occuring at the conventional working hours. Also, as expected, we see lesser bicycle rentals during the cold season and during periods of heavy precipitation.
I searched for a long time and at various sources to find a good, tidy dataset to practice my skills on. Many datasets were interesting, but did not meet the conditions for this project. I downloaded and melted several datasets from the gapminder collection and explored them, only to find that the main features did not influence each other and the prediction model that I tried to put together was very weak. Looking back, I consider those attempts to be good ones to cut my teeth on, just that they weren’t impressive enough for a final project.
Ultimately, I found the bicycle rentals dataset, which met all the criteria for the project. One of the first things I did was to plot a ggpairs chart and put together a linear regression model. When I had concluded that the dataset was interesting enough, I set out to explore it in a systematic manner. During the course of the project, I was able to put together a linear prediction model for bicycle rentals. The model is able to predict bicycle rentals with a 95% confidence level.
In conclusion, I would like to correlate this dataset with events that happened in Washington DC. on each of these days. That would help me see if casual bicycle rentals were driven by people wanting to avoid event-driven crowds on the popular transportation systems. Also, I would love to explore the data for a few more years than contained in this dataset.
[1] Fanaee-T, Hadi, and Gama, Joao, “Event labeling combining ensemble detectors and background knowledge”, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.