Updated on Sun May 21 19:57:52 2017
Create an RMarkdown document named Bike_Sharing.Rmd. Include the code and markup for exercises E-L.
E. Exercise: Import the bike sharing data
library(readr)
bikeshare <- read_csv("bikesharedailydata.csv")
This data spans the District of Columbia, Arlington County, Alexandria, Montgomery County and Fairfax County. The Capital Bikeshare system is owned by the participating jurisdictions and is operated by Motivate, a Brooklyn, NY-based company that operates several other bikesharing systems including Citibike in New York City, Hubway in Boston and Divvy Bikes in Chicago.
F. Exercise: Take a look at the data
We preview the data using the head function to show the first few observations.
head(bikeshare)
## # A tibble: 6 × 16
## instant dteday season yr mnth holiday weekday workingday weathersit
## <int> <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 1 1/1/11 1 0 1 0 6 0 2
## 2 2 1/2/11 1 0 1 0 0 0 2
## 3 3 1/3/11 1 0 1 0 1 1 1
## 4 4 1/4/11 1 0 1 0 2 1 1
## 5 5 1/5/11 1 0 1 0 3 1 1
## 6 6 1/6/11 1 0 1 0 4 1 1
## # ... with 7 more variables: temp <dbl>, atemp <dbl>, hum <dbl>,
## # windspeed <dbl>, casual <int>, registered <int>, cnt <int>
Next, we view the variables and types by using the str function.
str(bikeshare)
## Classes 'tbl_df', 'tbl' and 'data.frame': 731 obs. of 16 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : chr "1/1/11" "1/2/11" "1/3/11" "1/4/11" ...
## $ season : int 1 1 1 1 1 1 NA 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 NA ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 0 1 2 3 4 5 6 0 1 ...
## $ workingday: int 0 0 1 1 1 1 1 0 0 1 ...
## $ weathersit: int 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num 0.344 0.363 0.196 0.2 0.227 ...
## $ atemp : num 0.364 0.354 0.189 0.212 0.229 ...
## $ hum : num 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed : num 0.16 0.249 0.248 0.16 0.187 ...
## $ casual : int 331 131 120 108 82 88 148 68 54 41 ...
## $ registered: int 654 670 1229 1454 1518 1518 1362 891 768 1280 ...
## $ cnt : int 985 801 1349 1562 1600 1606 1510 959 822 1321 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 16
## .. ..$ instant : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ dteday : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ season : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ yr : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ mnth : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ holiday : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ weekday : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ workingday: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ weathersit: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ temp : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ atemp : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ hum : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ windspeed : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ casual : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ registered: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ cnt : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
We see that the dataset contains 731 observations of 16 variables. We can use the dim function to see this as well.
dim(bikeshare)
## [1] 731 16
G. Exercise: Understanding the variables
Take a look column named season. What is the meaning of season? What are the possible values for this variable? What type of variable is it?
bikeshare$season
## [1] 1 1 1 1 1 1 NA 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [24] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [47] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [70] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [93] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [116] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [139] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [162] 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3
## [185] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [208] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [231] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [254] 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4
## [277] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [300] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [323] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [346] 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [369] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [392] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [415] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [438] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [461] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [484] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [507] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [530] 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [553] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [576] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [599] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [622] 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4
## [645] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [668] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [691] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [714] 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1
unique(bikeshare$season)
## [1] 1 NA 2 3 4
class(bikeshare$season)
## [1] "integer"
We see that the season column contains numbers 1, 2, 3 and 4, as well as a missing number (NA). The variable is an integer.
The numbers represent the seasons of the year. From the data dictionary we know that 1 represents spring, 2 represents summer, 3 represents fall and 4 represents winter.
H. Exercise: Renaming columns
Preparing your data
It may be useful to rename the columns in the dataset to improve the readability.
Let’s remind ourselves what the column names are.
names(bikeshare)
## [1] "instant" "dteday" "season" "yr" "mnth"
## [6] "holiday" "weekday" "workingday" "weathersit" "temp"
## [11] "atemp" "hum" "windspeed" "casual" "registered"
## [16] "cnt"
We can rename columns with the rename function from the dplyr library.
library(dplyr)
bikeshare <- rename(bikeshare, humidity = hum, date = dteday, year = yr, month = mnth, weather = weathersit, temperature = temp, feeltemp = atemp, count = cnt)
names(bikeshare)
## [1] "instant" "date" "season" "year" "month"
## [6] "holiday" "weekday" "workingday" "weather" "temperature"
## [11] "feeltemp" "humidity" "windspeed" "casual" "registered"
## [16] "count"
The names now are more reflective of the data.
We can also rename the columns with R base functions. For example, to rename the column ‘year’
names(bikeshare)[names(bikeshare) == "yr"] <- "year"
names(bikeshare)
## [1] "instant" "date" "season" "year" "month"
## [6] "holiday" "weekday" "workingday" "weather" "temperature"
## [11] "feeltemp" "humidity" "windspeed" "casual" "registered"
## [16] "count"
I. Exercise: Dealing with missing values
It is important to ensure that the data is formatted appropriately. The rows should correspond to observations and the columns correspond the observed variables. This makes it easier to map the data to visual properties such as position, color, size, or shape. A preprocessing step is necessary to verify the dataset for correctness and consistency. Incomplete information has a high potential for incorrect results.
Tactics
There are several ways to tackle data that are incomplete. Each has its pros and cons.
Problem
Row 7, column 3: The season variable has no value
Row 10, column 5: The month has no value.
Solution
In these two cases it’s easy to replace the value with a pre-known value. We wouldn’t want to ignore the record because the values can be easily determined. For the missing record in “season”, it is likely to be 1 since the records on the days preceding and after the record are 1. Similarly for the missing record in “mnth”.
Updating the records
bikeshare$season[7]
## [1] NA
bikeshare$season[7] <- 1
bikeshare$season[7]
## [1] 1
bikeshare$month[10]
## [1] NA
bikeshare$month[10] <- 1
bikeshare$month[10]
## [1] 1
J. Exercise: Understand - Calculate basic summary statistics
It is helpful to calculate some summary statistics about the dataset to learn more about the distribution, the median, minimum, maximum values, variance, standard deviation, number of observations and attributes.
summary(bikeshare)
## instant date season year
## Min. : 1.0 Length:731 Min. :1.000 Min. :0.0000
## 1st Qu.:183.5 Class :character 1st Qu.:2.000 1st Qu.:0.0000
## Median :366.0 Mode :character Median :3.000 Median :1.0000
## Mean :366.0 Mean :2.497 Mean :0.5007
## 3rd Qu.:548.5 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :731.0 Max. :4.000 Max. :1.0000
## month holiday weekday workingday
## Min. : 1.00 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.: 4.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000
## Median : 7.00 Median :0.00000 Median :3.000 Median :1.000
## Mean : 6.52 Mean :0.02873 Mean :2.997 Mean :0.684
## 3rd Qu.:10.00 3rd Qu.:0.00000 3rd Qu.:5.000 3rd Qu.:1.000
## Max. :12.00 Max. :1.00000 Max. :6.000 Max. :1.000
## weather temperature feeltemp humidity
## Min. :1.000 Min. :0.05913 Min. :0.07907 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200
## Median :1.000 Median :0.49833 Median :0.48673 Median :0.6267
## Mean :1.395 Mean :0.49538 Mean :0.47435 Mean :0.6279
## 3rd Qu.:2.000 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302
## Max. :3.000 Max. :0.86167 Max. :0.84090 Max. :0.9725
## windspeed casual registered count
## Min. :0.02239 Min. : 2.0 Min. : 20 Min. : 22
## 1st Qu.:0.13495 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152
## Median :0.18097 Median : 713.0 Median :3662 Median :4548
## Mean :0.19049 Mean : 848.2 Mean :3656 Mean :4504
## 3rd Qu.:0.23321 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956
## Max. :0.50746 Max. :3410.0 Max. :6946 Max. :8714
The summary function shows the mean, median, minimum, and maximum values for each variable in the data set. This is particular useful for continuous variables such as temperature, count, casual, and registered. For example, you can easily see the average number of customers (casual and registered) per day. Note the missing NAs in the “season” and “month”.
K. Exercise: Understand - Visualize
Exploring the data visually is usually very helpful. As a first step, consider scatterplots to show relationships between variables, histograms for frequencies, density plots to show distributions, and box plots to show the range of values.
Kernel density plot
To see the distribution of the ridership, we can use Kernel density plots which are an effective way to view the distribution of a variable. We create the plot using plot(density(x)) where x is a numeric vector.
Consider a density plot that shows the shape of the data for the number of riders per day.
density_riders = density(bikeshare$count)
plot(density_riders, main= "Number of riders per day", sub= round(mean(bikeshare$count), 2), "Mean =", frame=FALSE)
polygon(density_riders, col="gray", border="gray")
We see from the density plot that the riderships cluster around 4500 per day and has two other smaller peaks around 2000 and 7000 per day.
Histogram
Consider a histogram that shows the frequency of the weather situation by day.
hist(bikeshare$weather, col="gray", border = "gray", xlab="Weather", main="Frequency of weather situations", xlim = c(1,4))
From the data dictionary we know that 1 refers to “Clear, Few clouds, Partly cloudy, Partly cloudy”, 2 refers to “Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist”, 3 refers to “Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds”, while 4 refers to “Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog”.
Value | Meaning |
---|---|
1 | Clear, Few clouds, Partly cloudy, Partly cloudy |
2 | Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist |
3 | Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds |
4 | Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog |
From the histogram, we see that 1 happens on most days, while there is no day when 4 occured.
We can see this by calling the table function.
table(bikeshare$weather)
##
## 1 2 3
## 463 247 21
L. Exercise: Scatter plots
To see relationships, scatter plots are useful. In this case, we are looking for positive or negative correlations.
Scatter plot
Consider a simple scatter plot that shows the relationship between the rentals and temperature
plot(bikeshare$count, bikeshare$feeltemp, main= "Relationship between bike rentals and average daily temperature", frame=FALSE, xlab="Number of rentals per day", ylab="Average daily temperature in degrees fahrenheit", col = "blue")
Scatter plot with fit lines
To aid in the interpretation, it is helpful to add a linear regression line if there is a linear relationship or a lowess line. A lowess line will more accurately fit the line to the data.
plot(bikeshare$count, bikeshare$feeltemp, main= "Relationship between bike rentals and average daily temperature", frame=FALSE, xlab="Number of rentals per day", ylab="Average daily temperature in degrees fahrenheit")
# Add fit lines
abline(lm(bikeshare$feeltemp~bikeshare$count), col="blue") # regression line (y~x)
lines(lowess(bikeshare$count, bikeshare$feeltemp), col="orange") # lowess line (x,y)
Scatter plot with grouped categorical data (season)
It is useful to use color to group categorical data. In this example, we are grouping the points by season. We can do so using the ggvis package.
#static chart
library(ggvis)
bikeshare %>%
ggvis(x=~count, y=~feeltemp) %>%
layer_points(fill = ~season) %>%
add_axis("x", title = "Number of rentals per day") %>%
add_axis("y", title = "Average daily temperature in degrees fahrenheit")
Scatter plot with grouped categorical data (year)
We can also look at the data by year.
#static chart
library(ggvis)
bikeshare %>%
ggvis(x=~count, y=~feeltemp) %>%
layer_points(fill = ~year) %>%
add_axis("x", title = "Number of rentals per day") %>%
add_axis("y", title = "Average daily temperature in degrees fahrenheit")
Note from the data dictionary that year = 0 in this case refers to 2011. 1 refers to 2012.
Write a script to determine the average ridership on weekends versus weekdays. Next, let’s imagine it costs $10 per day to rent a bike on a weekday and $12 on a weekend. What is the annual weekday rental revenue in 2011 and 2012? What is the annual weekend revenue in 2011 and 2012?
Hint: Use a for loop and if/else logic.
First, we note that that the days of the week are numbered as follows:
Value | Meaning |
---|---|
0 | Sunday |
1 | Monday |
2 | Tuesday |
3 | Wednesday |
4 | Thursday |
5 | Friday |
6 | Saturday |
Let us compute the average ridership on weekends (i.e. Sunday and Saturday).
ridership_weekends <- (sum(bikeshare$count[bikeshare$weekday == 0]) + sum(bikeshare$count[bikeshare$weekday == 6])) / 2
print(ridership_weekends)
## [1] 460917
Next, we compute the average ridership on weekdays (Monday through Friday).
ridership_weekdays <- (sum(bikeshare$count[bikeshare$weekday == 1]) + sum(bikeshare$count[bikeshare$weekday == 2]) + sum(bikeshare$count[bikeshare$weekday == 3]) + sum(bikeshare$count[bikeshare$weekday == 4]) + sum(bikeshare$count[bikeshare$weekday == 5]))/ 5
print(ridership_weekdays)
## [1] 474169
The annual weekend rental revenue in 2011 is computed as follows:
rental_weekends2011 <- (sum(bikeshare$count[bikeshare$year == 0 & bikeshare$weekday == 0]) + sum(bikeshare$count[bikeshare$year == 0 & bikeshare$weekday == 6])) * 12
print(rental_weekends2011)
## [1] 4281804
The annual weekend rental revenue in 2012 is computed as follows:
rental_weekends2012 <- (sum(bikeshare$count[bikeshare$year == 1 & bikeshare$weekday == 0]) + sum(bikeshare$count[bikeshare$year == 1 & bikeshare$weekday == 6])) * 12
print(rental_weekends2012)
## [1] 6780204
The annual weekday rental revenue in 2011 is computed as follows:
rental_weekdays2011 <- (sum(bikeshare$count[bikeshare$year == 0 & bikeshare$weekday == 1]) + sum(bikeshare$count[bikeshare$year == 0 & bikeshare$weekday == 2]) + sum(bikeshare$count[bikeshare$year == 0 & bikeshare$weekday == 3]) + sum(bikeshare$count[bikeshare$year == 0 & bikeshare$weekday == 4]) + sum(bikeshare$count[bikeshare$year == 0 & bikeshare$weekday == 5])) * 10
print(rental_weekdays2011)
## [1] 8862860
The annual weekday rental revenue in 2012 is computed as follows:
rental_weekdays2012 <- (sum(bikeshare$count[bikeshare$year == 1 & bikeshare$weekday == 1]) + sum(bikeshare$count[bikeshare$year == 1 & bikeshare$weekday == 2]) + sum(bikeshare$count[bikeshare$year == 1 & bikeshare$weekday == 3]) + sum(bikeshare$count[bikeshare$year == 1 & bikeshare$weekday == 4]) + sum(bikeshare$count[bikeshare$year == 1 & bikeshare$weekday == 5])) * 10
print(rental_weekdays2012)
## [1] 14845590
We can also compute the above using a loop and if/else logic to make it easier to read the code and cross-check the computations (plus get some practice).
# extract the data for each day of the week
for (i in c(0:6)){
if (i == 0){
sundays <- bikeshare[bikeshare[, "weekday"] == i,]}
else if (i == 1){
mondays <- bikeshare[bikeshare[, "weekday"] == i,]}
else if (i == 2){
tuesdays <- bikeshare[bikeshare[, "weekday"] == i,]}
else if (i == 3){
wednesdays <- bikeshare[bikeshare[, "weekday"] == i,]}
else if (i == 4){
thursdays <- bikeshare[bikeshare[, "weekday"] == i,]}
else if (i == 5){
fridays <- bikeshare[bikeshare[, "weekday"] == i,]}
else{
saturdays <- bikeshare[bikeshare[, "weekday"] == i,]}
}
# extract the ridership for 2011 for weekends, multiply by $12
rental_weekends2011_loop <- sum(rbind(sundays[sundays$year == 0,"count"], saturdays[saturdays$year == 0,"count"]))*12
rental_weekends2011_loop <- format(rental_weekends2011_loop,big.mark=",")
print(paste("The annual weekend rental revenue in 2011 is $", rental_weekends2011_loop))
## [1] "The annual weekend rental revenue in 2011 is $ 4,281,804"
# extract the ridership for 2012 for weekends, multiply by $12
rental_weekends2012_loop <- sum(rbind(sundays[sundays$year == 1,"count"], saturdays[saturdays$year == 1,"count"]))*12
rental_weekends2012_loop <- format(rental_weekends2012_loop,big.mark=",")
print(paste("The annual weekend rental revenue in 2012 is $", rental_weekends2012_loop))
## [1] "The annual weekend rental revenue in 2012 is $ 6,780,204"
# extract the ridership for 2011 for weekdays, multiply by $10
rental_weekdays2011_loop <- sum(rbind(mondays[mondays$year == 0,"count"], tuesdays[tuesdays$year == 0, "count"], wednesdays[wednesdays$year == 0,"count"], thursdays[thursdays$year == 0,"count"],fridays[fridays$year == 0,"count"]))*10
rental_weekdays2011_loop <- format(rental_weekdays2011_loop,big.mark=",")
print(paste("The annual weekday rental revenue in 2011 is $", rental_weekdays2011_loop))
## [1] "The annual weekday rental revenue in 2011 is $ 8,862,860"
# extract the ridership for 2012 for weekdays, multiply by $10
rental_weekdays2012_loop <- sum(rbind(mondays[mondays$year == 1,"count"], tuesdays[tuesdays$year == 1, "count"], wednesdays[wednesdays$year == 1,"count"], thursdays[thursdays$year == 1,"count"],fridays[fridays$year == 1,"count"]))*10
rental_weekdays2012_loop <- format(rental_weekdays2012_loop,big.mark=",")
print(paste("The annual weekday rental revenue in 2012 is $", rental_weekdays2012_loop))
## [1] "The annual weekday rental revenue in 2012 is $ 14,845,590"
At this point in the process, you should have gained enough insight to frame a question to guide the rest of your analysis. Sometimes you don’t know what to ask of the data and other times the questions you have cannot be answered by the data that you have. In most visual analytical explorations there will be a back and forth between defining the questions and identifying the data sources that have contain the information you need to extract.
Often your question will fall into one of three categories: Past, present, or future.
Try to answer the following questions. Show your work as a data visualization.
Let’s take a quick visual look at the data distribution using kernel density plots to see if weather affects rental behaviour.
library(ggplot2)
ggplot(bikeshare, aes(count)) +
geom_density() +
facet_wrap(~ weather) +
xlab("Number of rentals per day") +
ylab("Density") +
ggtitle("Distribution of Daily Ridership under Different Weather Situations")
We see that the distribution of the daily ridership varies under different weather situations. When the weather is fairly good (i.e. 1 and 2), the ridership distribution is broadly similar, although we move from having three visible peaks in 1 to two peaks in 2. However, when the weather is poor (i.e. 3) with rain and snow, the ridership distribution is very skewed and the number of ridership falls noticeably.
Let’s dig a bit deeper to see if there is relationship between temperature and ridership under the various weather situtations.
library(ggplot2)
ggplot(bikeshare, aes(x = feeltemp, y = count, colour = factor(weather))) +
geom_point(size = 1) +
geom_smooth(method = "lm") +
xlab("Average daily temperature in degrees fahrenheit") +
ylab("Number of rentals per day") +
ggtitle("Impact of Weather & Temperature on Daily Ridership")
We can make several observations from the plot:
* There is a positive relationship between temperature and ridership under all three weather situations
* The relationships are similar (similar slopes)
* The variability under weather 3 is wider than those under weathers 1 and 2 (note the much lesser occurences of weather 3 though)
* Weather 3 affects rental behaviour most. The ridership under weather 3 is noticeably lesser.
* Given the same temperature, ridership is consistently higher when weather is good (1).
Let’s see if these relationships hold across all seasons.
ggplot(bikeshare, aes(x = feeltemp, y = count, colour = factor(weather))) +
geom_point(size = 1) +
geom_smooth(method = "lm") +
facet_wrap(~ season, nrow = 1) +
xlab("Average daily temperature in degrees fahrenheit") +
ylab("Number of rentals per day") +
ggtitle("Impact of Weather & Temperature on Daily Ridership Across Seasons")
Interestingly, the relationships do not hold across seasons:
* In spring and summer (i.e. 1 and 2), there seems to be a negative relationship between temperature and ridership when the weather is poor (3). However, the data is sparse hence the relationship may not be robust
* In autumn (3),there is notably a negative relationship between temperature and ridership when the weather is good (1). There seems to be no relationship between temperature and ridership when the weather is poor (3) but again, note the sparseness of the data
* In winter (4), the relationships are positive, which makes intuitive sense one would expect people to travel more when the temperature improves in winter
Let’s see if these relationships hold in both years.
ggplot(bikeshare, aes(x = feeltemp, y = count, colour = factor(weather))) +
geom_point(size = 1) +
geom_smooth(method = "lm") +
facet_wrap(~ year) +
xlab("Average daily temperature in degrees fahrenheit") +
ylab("Number of rentals per day") +
ggtitle("Impact of Weather & Temperature on Daily Ridership Across Years")
The relationships do seem to hold intertemporally. The variabiliy of the data weather = 3 in 2012 is noticeably wider though.
Let’s see if these relationships hold across days.
ggplot(bikeshare, aes(x = feeltemp, y = count, colour = factor(weather))) +
geom_point(size = 1) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ weekday) +
xlab("Average daily temperature in degrees fahrenheit") +
ylab("Number of rentals per day") +
ggtitle("Impact of Weather & Temperature on Daily Ridership Across Days")
The relationships seem to be broadly similar across days (ignoring weather = 3 given the sparseness of the data)
From the visualizations, it is clear that weather conditions do affect rental behaviors.
What about precipitations? Does it affect rental behaviour? What about day of week, season, hour of the day, etc?
Let’s recall what variables are available in the dataset.
names(bikeshare)
## [1] "instant" "date" "season" "year" "month"
## [6] "holiday" "weekday" "workingday" "weather" "temperature"
## [11] "feeltemp" "humidity" "windspeed" "casual" "registered"
## [16] "count"
There is no data on precipitation to allow us to investigate whether precipitation affects rental behaviour. That said, to the extent that weather = 3 reflects a rainy day, one could infer that precipitation would affect rental behaviour.
Similarly, there is no data on hour of the day to allow us to investigate whether the hour of the day affects rental behaviour.
Let’s examine day of the week more closely.
library(ggplot2)
ggplot(bikeshare, aes(count)) +
geom_density() +
facet_wrap(~ weekday) +
xlab("Number of rentals per day") +
ylab("Density") +
ggtitle("Distribution of Daily Ridership by Day")
There are some noticeable differences in the distributions. Tuesday, in particular, is relatively more leptokurtic.
A similar picture appears when we see what is the ridership frequency for each day using a histogram.
library(ggplot2)
ggplot(bikeshare, aes(count, fill = weekday)) +
geom_histogram() +
facet_wrap(~ weekday, ncol = 3) +
xlab("Number of rentals per day") +
ggtitle("Histogram of Daily Ridership by Day")
Next, we investigate whether there is a relationship between temperature and ridership across days.
library(ggplot2)
ggplot(bikeshare, aes(x = feeltemp, y = count, colour = factor(weekday))) +
geom_point(size = 1) +
geom_smooth(method = "lm", se = FALSE) +
xlab("Average daily temperature in degrees fahrenheit") +
ylab("Number of rentals per day") +
ggtitle("Impact of Day & Temperature on Daily Ridership")
The relationship is positive across all days, with the regression lines exhibiting similar slopes and all bunched together.
Likewise, there is no evidence to suggest that the day of the week affects rental behaviour when we hold other factors like humidity, windspeed etc constant.
library(ggplot2)
ggplot(bikeshare, aes(x = humidity, y = count, colour = factor(weekday))) +
geom_point(size = 1) +
geom_smooth(method = "lm", se = FALSE) +
xlab("Humidity") +
ylab("Number of rentals per day") +
ggtitle("Impact of Day & Humidity on Daily Ridership")
library(ggplot2)
ggplot(bikeshare, aes(x = windspeed, y = count, colour = factor(weekday))) +
geom_point(size = 1) +
geom_smooth(method = "lm", se = FALSE) +
xlab("Windspeed") +
ylab("Number of rentals per day") +
ggtitle("Impact of Day & Windspeed on Daily Ridership")
The visualizations suggess that days of the week do not affect rental behaviour.