Foundations of Statistics Using R: Homework

Updated on Sun May 21 19:57:52 2017

N. Homework: Communicate - Create an RMarkdown document

Create an RMarkdown document named Bike_Sharing.Rmd. Include the code and markup for exercises E-L.

E. Exercise: Import the bike sharing data

library(readr)
bikeshare <- read_csv("bikesharedailydata.csv")

This data spans the District of Columbia, Arlington County, Alexandria, Montgomery County and Fairfax County. The Capital Bikeshare system is owned by the participating jurisdictions and is operated by Motivate, a Brooklyn, NY-based company that operates several other bikesharing systems including Citibike in New York City, Hubway in Boston and Divvy Bikes in Chicago.

F. Exercise: Take a look at the data

We preview the data using the head function to show the first few observations.

head(bikeshare)

## # A tibble: 6 × 16
##   instant dteday season    yr  mnth holiday weekday workingday weathersit
##     <int>  <chr>  <int> <int> <int>   <int>   <int>      <int>      <int>
## 1       1 1/1/11      1     0     1       0       6          0          2
## 2       2 1/2/11      1     0     1       0       0          0          2
## 3       3 1/3/11      1     0     1       0       1          1          1
## 4       4 1/4/11      1     0     1       0       2          1          1
## 5       5 1/5/11      1     0     1       0       3          1          1
## 6       6 1/6/11      1     0     1       0       4          1          1
## # ... with 7 more variables: temp <dbl>, atemp <dbl>, hum <dbl>,
## #   windspeed <dbl>, casual <int>, registered <int>, cnt <int>

Next, we view the variables and types by using the str function.

str(bikeshare)

## Classes 'tbl_df', 'tbl' and 'data.frame':    731 obs. of  16 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "1/1/11" "1/2/11" "1/3/11" "1/4/11" ...
##  $ season    : int  1 1 1 1 1 1 NA 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 NA ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 16
##   .. ..$ instant   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ dteday    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ season    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ yr        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ mnth      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ holiday   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ weekday   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ workingday: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ weathersit: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ temp      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ atemp     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ hum       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ windspeed : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ casual    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ registered: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ cnt       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

We see that the dataset contains 731 observations of 16 variables. We can use the dim function to see this as well.

dim(bikeshare)

## [1] 731  16

G. Exercise: Understanding the variables

Take a look column named season. What is the meaning of season? What are the possible values for this variable? What type of variable is it?

bikeshare$season

##   [1]  1  1  1  1  1  1 NA  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##  [24]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##  [47]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##  [70]  1  1  1  1  1  1  1  1  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2
##  [93]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [116]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [139]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [162]  2  2  2  2  2  2  2  2  2  2  3  3  3  3  3  3  3  3  3  3  3  3  3
## [185]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [208]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [231]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [254]  3  3  3  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4  4  4  4
## [277]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [300]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [323]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [346]  4  4  4  4  4  4  4  4  4  1  1  1  1  1  1  1  1  1  1  1  1  1  1
## [369]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
## [392]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
## [415]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
## [438]  1  1  1  1  1  1  1  1  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [461]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [484]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [507]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
## [530]  2  2  2  2  2  2  2  2  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [553]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [576]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [599]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [622]  3  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4  4  4  4  4  4
## [645]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [668]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [691]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [714]  4  4  4  4  4  4  4  1  1  1  1  1  1  1  1  1  1  1

unique(bikeshare$season)

## [1]  1 NA  2  3  4

class(bikeshare$season)

## [1] "integer"

We see that the season column contains numbers 1, 2, 3 and 4, as well as a missing number (NA). The variable is an integer.

The numbers represent the seasons of the year. From the data dictionary we know that 1 represents spring, 2 represents summer, 3 represents fall and 4 represents winter.

H. Exercise: Renaming columns
Preparing your data

It may be useful to rename the columns in the dataset to improve the readability.
Let’s remind ourselves what the column names are.

names(bikeshare)

##  [1] "instant"    "dteday"     "season"     "yr"         "mnth"      
##  [6] "holiday"    "weekday"    "workingday" "weathersit" "temp"      
## [11] "atemp"      "hum"        "windspeed"  "casual"     "registered"
## [16] "cnt"

We can rename columns with the rename function from the dplyr library.

library(dplyr)
bikeshare <- rename(bikeshare, humidity = hum, date = dteday, year = yr, month = mnth, weather = weathersit, temperature = temp, feeltemp = atemp, count = cnt)
names(bikeshare)

##  [1] "instant"     "date"        "season"      "year"        "month"      
##  [6] "holiday"     "weekday"     "workingday"  "weather"     "temperature"
## [11] "feeltemp"    "humidity"    "windspeed"   "casual"      "registered" 
## [16] "count"

The names now are more reflective of the data.

We can also rename the columns with R base functions. For example, to rename the column ‘year’

names(bikeshare)[names(bikeshare) == "yr"] <- "year"
names(bikeshare)

##  [1] "instant"     "date"        "season"      "year"        "month"      
##  [6] "holiday"     "weekday"     "workingday"  "weather"     "temperature"
## [11] "feeltemp"    "humidity"    "windspeed"   "casual"      "registered" 
## [16] "count"

I. Exercise: Dealing with missing values

It is important to ensure that the data is formatted appropriately. The rows should correspond to observations and the columns correspond the observed variables. This makes it easier to map the data to visual properties such as position, color, size, or shape. A preprocessing step is necessary to verify the dataset for correctness and consistency. Incomplete information has a high potential for incorrect results.

Tactics

There are several ways to tackle data that are incomplete. Each has its pros and cons.

Ignore any record with missing values
Replace empty fields with a pre-defined value
Replace empty fields with the most frequently appeared value
Use the mean value
Manual approach

Problem

Row 7, column 3: The season variable has no value
Row 10, column 5: The month has no value.

Solution
In these two cases it’s easy to replace the value with a pre-known value. We wouldn’t want to ignore the record because the values can be easily determined. For the missing record in “season”, it is likely to be 1 since the records on the days preceding and after the record are 1. Similarly for the missing record in “mnth”.

Updating the records

bikeshare$season[7]

## [1] NA

bikeshare$season[7] <- 1
bikeshare$season[7]

## [1] 1

bikeshare$month[10]

## [1] NA

bikeshare$month[10] <- 1
bikeshare$month[10]

## [1] 1

J. Exercise: Understand - Calculate basic summary statistics

It is helpful to calculate some summary statistics about the dataset to learn more about the distribution, the median, minimum, maximum values, variance, standard deviation, number of observations and attributes.

summary(bikeshare)

##     instant          date               season           year       
##  Min.   :  1.0   Length:731         Min.   :1.000   Min.   :0.0000  
##  1st Qu.:183.5   Class :character   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :366.0   Mode  :character   Median :3.000   Median :1.0000  
##  Mean   :366.0                      Mean   :2.497   Mean   :0.5007  
##  3rd Qu.:548.5                      3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :731.0                      Max.   :4.000   Max.   :1.0000  
##      month          holiday           weekday        workingday   
##  Min.   : 1.00   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 4.00   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 7.00   Median :0.00000   Median :3.000   Median :1.000  
##  Mean   : 6.52   Mean   :0.02873   Mean   :2.997   Mean   :0.684  
##  3rd Qu.:10.00   3rd Qu.:0.00000   3rd Qu.:5.000   3rd Qu.:1.000  
##  Max.   :12.00   Max.   :1.00000   Max.   :6.000   Max.   :1.000  
##     weather       temperature         feeltemp          humidity     
##  Min.   :1.000   Min.   :0.05913   Min.   :0.07907   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200  
##  Median :1.000   Median :0.49833   Median :0.48673   Median :0.6267  
##  Mean   :1.395   Mean   :0.49538   Mean   :0.47435   Mean   :0.6279  
##  3rd Qu.:2.000   3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302  
##  Max.   :3.000   Max.   :0.86167   Max.   :0.84090   Max.   :0.9725  
##    windspeed           casual         registered       count     
##  Min.   :0.02239   Min.   :   2.0   Min.   :  20   Min.   :  22  
##  1st Qu.:0.13495   1st Qu.: 315.5   1st Qu.:2497   1st Qu.:3152  
##  Median :0.18097   Median : 713.0   Median :3662   Median :4548  
##  Mean   :0.19049   Mean   : 848.2   Mean   :3656   Mean   :4504  
##  3rd Qu.:0.23321   3rd Qu.:1096.0   3rd Qu.:4776   3rd Qu.:5956  
##  Max.   :0.50746   Max.   :3410.0   Max.   :6946   Max.   :8714

The summary function shows the mean, median, minimum, and maximum values for each variable in the data set. This is particular useful for continuous variables such as temperature, count, casual, and registered. For example, you can easily see the average number of customers (casual and registered) per day. Note the missing NAs in the “season” and “month”.

K. Exercise: Understand - Visualize

Exploring the data visually is usually very helpful. As a first step, consider scatterplots to show relationships between variables, histograms for frequencies, density plots to show distributions, and box plots to show the range of values.

Kernel density plot

To see the distribution of the ridership, we can use Kernel density plots which are an effective way to view the distribution of a variable. We create the plot using plot(density(x)) where x is a numeric vector.

Consider a density plot that shows the shape of the data for the number of riders per day.

density_riders = density(bikeshare$count)
plot(density_riders, main= "Number of riders per day", sub= round(mean(bikeshare$count), 2), "Mean =", frame=FALSE)
polygon(density_riders, col="gray", border="gray")

We see from the density plot that the riderships cluster around 4500 per day and has two other smaller peaks around 2000 and 7000 per day.

Histogram

Consider a histogram that shows the frequency of the weather situation by day.

hist(bikeshare$weather, col="gray", border = "gray", xlab="Weather", main="Frequency of weather situations", xlim = c(1,4))

From the data dictionary we know that 1 refers to “Clear, Few clouds, Partly cloudy, Partly cloudy”, 2 refers to “Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist”, 3 refers to “Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds”, while 4 refers to “Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog”.

Value	Meaning
1	Clear, Few clouds, Partly cloudy, Partly cloudy
2	Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3	Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4	Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

From the histogram, we see that 1 happens on most days, while there is no day when 4 occured.

We can see this by calling the table function.

table(bikeshare$weather)

## 
##   1   2   3 
## 463 247  21

L. Exercise: Scatter plots

To see relationships, scatter plots are useful. In this case, we are looking for positive or negative correlations.

Scatter plot

Consider a simple scatter plot that shows the relationship between the rentals and temperature

plot(bikeshare$count, bikeshare$feeltemp, main= "Relationship between bike rentals and average daily temperature", frame=FALSE, xlab="Number of rentals per day", ylab="Average daily temperature in degrees fahrenheit", col = "blue")

Scatter plot with fit lines

To aid in the interpretation, it is helpful to add a linear regression line if there is a linear relationship or a lowess line. A lowess line will more accurately fit the line to the data.

plot(bikeshare$count, bikeshare$feeltemp, main= "Relationship between bike rentals and average daily temperature", frame=FALSE, xlab="Number of rentals per day", ylab="Average daily temperature in degrees fahrenheit")

# Add fit lines
abline(lm(bikeshare$feeltemp~bikeshare$count), col="blue") # regression line (y~x) 
lines(lowess(bikeshare$count, bikeshare$feeltemp), col="orange") # lowess line (x,y)

Scatter plot with grouped categorical data (season)

It is useful to use color to group categorical data. In this example, we are grouping the points by season. We can do so using the ggvis package.

#static chart
library(ggvis)
bikeshare %>% 
  ggvis(x=~count, y=~feeltemp) %>% 

layer_points(fill = ~season)   %>% 
  add_axis("x", title = "Number of rentals per day") %>%
  add_axis("y", title = "Average daily temperature in degrees fahrenheit")

Scatter plot with grouped categorical data (year)

We can also look at the data by year.

#static chart
library(ggvis)
bikeshare %>% 
  ggvis(x=~count, y=~feeltemp) %>% 

layer_points(fill = ~year)   %>% 
  add_axis("x", title = "Number of rentals per day") %>%
  add_axis("y", title = "Average daily temperature in degrees fahrenheit")

Note from the data dictionary that year = 0 in this case refers to 2011. 1 refers to 2012.

R. Homework: Determine the average ridership

Write a script to determine the average ridership on weekends versus weekdays. Next, let’s imagine it costs $10 per day to rent a bike on a weekday and $12 on a weekend. What is the annual weekday rental revenue in 2011 and 2012? What is the annual weekend revenue in 2011 and 2012?

Hint: Use a for loop and if/else logic.

First, we note that that the days of the week are numbered as follows:

Value	Meaning
0	Sunday
1	Monday
2	Tuesday
3	Wednesday
4	Thursday
5	Friday
6	Saturday

Let us compute the average ridership on weekends (i.e. Sunday and Saturday).

ridership_weekends <- (sum(bikeshare$count[bikeshare$weekday == 0]) + sum(bikeshare$count[bikeshare$weekday == 6])) / 2
print(ridership_weekends)

## [1] 460917

Next, we compute the average ridership on weekdays (Monday through Friday).

ridership_weekdays <- (sum(bikeshare$count[bikeshare$weekday == 1]) + sum(bikeshare$count[bikeshare$weekday == 2]) + sum(bikeshare$count[bikeshare$weekday == 3]) + sum(bikeshare$count[bikeshare$weekday == 4]) + sum(bikeshare$count[bikeshare$weekday == 5]))/ 5
print(ridership_weekdays)

## [1] 474169

The annual weekend rental revenue in 2011 is computed as follows:

rental_weekends2011 <- (sum(bikeshare$count[bikeshare$year == 0 & bikeshare$weekday == 0]) +  sum(bikeshare$count[bikeshare$year == 0 & bikeshare$weekday == 6])) * 12
print(rental_weekends2011)

## [1] 4281804

The annual weekend rental revenue in 2012 is computed as follows:

rental_weekends2012 <- (sum(bikeshare$count[bikeshare$year == 1 & bikeshare$weekday == 0]) +  sum(bikeshare$count[bikeshare$year == 1 & bikeshare$weekday == 6])) * 12
print(rental_weekends2012)

## [1] 6780204

The annual weekday rental revenue in 2011 is computed as follows:

rental_weekdays2011 <- (sum(bikeshare$count[bikeshare$year == 0 & bikeshare$weekday == 1]) + sum(bikeshare$count[bikeshare$year == 0 & bikeshare$weekday == 2]) + sum(bikeshare$count[bikeshare$year == 0 & bikeshare$weekday == 3]) + sum(bikeshare$count[bikeshare$year == 0 & bikeshare$weekday == 4]) + sum(bikeshare$count[bikeshare$year == 0 & bikeshare$weekday == 5])) * 10
print(rental_weekdays2011)

## [1] 8862860

The annual weekday rental revenue in 2012 is computed as follows:

rental_weekdays2012 <- (sum(bikeshare$count[bikeshare$year == 1 & bikeshare$weekday == 1]) + sum(bikeshare$count[bikeshare$year == 1 & bikeshare$weekday == 2]) + sum(bikeshare$count[bikeshare$year == 1 & bikeshare$weekday == 3]) + sum(bikeshare$count[bikeshare$year == 1 & bikeshare$weekday == 4]) + sum(bikeshare$count[bikeshare$year == 1 & bikeshare$weekday == 5])) * 10
print(rental_weekdays2012)

## [1] 14845590

We can also compute the above using a loop and if/else logic to make it easier to read the code and cross-check the computations (plus get some practice).

# extract the data for each day of the week 
for (i in c(0:6)){ 
  if (i == 0){
    sundays <- bikeshare[bikeshare[, "weekday"] == i,]}
  else if (i == 1){
    mondays <- bikeshare[bikeshare[, "weekday"] == i,]}
  else if (i == 2){
    tuesdays <- bikeshare[bikeshare[, "weekday"] == i,]}
  else if (i == 3){
    wednesdays <- bikeshare[bikeshare[, "weekday"] == i,]}
  else if (i == 4){
    thursdays <- bikeshare[bikeshare[, "weekday"] == i,]}
  else if (i == 5){
    fridays <- bikeshare[bikeshare[, "weekday"] == i,]}
  else{
    saturdays <- bikeshare[bikeshare[, "weekday"] == i,]}
}

# extract the ridership for 2011 for weekends, multiply by $12
rental_weekends2011_loop <- sum(rbind(sundays[sundays$year == 0,"count"], saturdays[saturdays$year == 0,"count"]))*12
rental_weekends2011_loop <- format(rental_weekends2011_loop,big.mark=",")
print(paste("The annual weekend rental revenue in 2011 is $", rental_weekends2011_loop))

## [1] "The annual weekend rental revenue in 2011 is $ 4,281,804"

# extract the ridership for 2012 for weekends, multiply by $12
rental_weekends2012_loop <- sum(rbind(sundays[sundays$year == 1,"count"], saturdays[saturdays$year == 1,"count"]))*12
rental_weekends2012_loop <- format(rental_weekends2012_loop,big.mark=",")
print(paste("The annual weekend rental revenue in 2012 is $", rental_weekends2012_loop))

## [1] "The annual weekend rental revenue in 2012 is $ 6,780,204"

# extract the ridership for 2011 for weekdays, multiply by $10
rental_weekdays2011_loop <- sum(rbind(mondays[mondays$year == 0,"count"], tuesdays[tuesdays$year == 0, "count"], wednesdays[wednesdays$year == 0,"count"], thursdays[thursdays$year == 0,"count"],fridays[fridays$year == 0,"count"]))*10
rental_weekdays2011_loop <- format(rental_weekdays2011_loop,big.mark=",")
print(paste("The annual weekday rental revenue in 2011 is $", rental_weekdays2011_loop))

## [1] "The annual weekday rental revenue in 2011 is $ 8,862,860"

# extract the ridership for 2012 for weekdays, multiply by $10
rental_weekdays2012_loop <- sum(rbind(mondays[mondays$year == 1,"count"], tuesdays[tuesdays$year == 1, "count"], wednesdays[wednesdays$year == 1,"count"], thursdays[thursdays$year == 1,"count"],fridays[fridays$year == 1,"count"]))*10
rental_weekdays2012_loop <- format(rental_weekdays2012_loop,big.mark=",")
print(paste("The annual weekday rental revenue in 2012 is $", rental_weekdays2012_loop))

## [1] "The annual weekday rental revenue in 2012 is $ 14,845,590"

V. Homework: Devise the problem, challenge, and/or questions

At this point in the process, you should have gained enough insight to frame a question to guide the rest of your analysis. Sometimes you don’t know what to ask of the data and other times the questions you have cannot be answered by the data that you have. In most visual analytical explorations there will be a back and forth between defining the questions and identifying the data sources that have contain the information you need to extract.

Often your question will fall into one of three categories: Past, present, or future.

Try to answer the following questions. Show your work as a data visualization.

Do weather conditions affect rental behaviors?
Does the precipitation, day of week, season, hour of the day, etc. affect rental behavior?
Which weather conditions affect behavior the most? Do they differ by season

Let’s take a quick visual look at the data distribution using kernel density plots to see if weather affects rental behaviour.

library(ggplot2)
ggplot(bikeshare, aes(count)) + 
  geom_density() +
  facet_wrap(~ weather) +
  xlab("Number of rentals per day") + 
  ylab("Density") + 
  ggtitle("Distribution of Daily Ridership under Different Weather Situations")

We see that the distribution of the daily ridership varies under different weather situations. When the weather is fairly good (i.e. 1 and 2), the ridership distribution is broadly similar, although we move from having three visible peaks in 1 to two peaks in 2. However, when the weather is poor (i.e. 3) with rain and snow, the ridership distribution is very skewed and the number of ridership falls noticeably.

Let’s dig a bit deeper to see if there is relationship between temperature and ridership under the various weather situtations.

library(ggplot2)
ggplot(bikeshare, aes(x = feeltemp, y = count, colour = factor(weather))) + 
  geom_point(size = 1) +
  geom_smooth(method = "lm") +
  xlab("Average daily temperature in degrees fahrenheit") + 
  ylab("Number of rentals per day") + 
  ggtitle("Impact of Weather & Temperature on Daily Ridership")

We can make several observations from the plot:
* There is a positive relationship between temperature and ridership under all three weather situations
* The relationships are similar (similar slopes)
* The variability under weather 3 is wider than those under weathers 1 and 2 (note the much lesser occurences of weather 3 though)
* Weather 3 affects rental behaviour most. The ridership under weather 3 is noticeably lesser.
* Given the same temperature, ridership is consistently higher when weather is good (1).

Let’s see if these relationships hold across all seasons.

ggplot(bikeshare, aes(x = feeltemp, y = count, colour = factor(weather))) + 
  geom_point(size = 1) +
  geom_smooth(method = "lm") +
  facet_wrap(~ season, nrow = 1) + 
  xlab("Average daily temperature in degrees fahrenheit") + 
  ylab("Number of rentals per day") + 
  ggtitle("Impact of Weather & Temperature on Daily Ridership Across Seasons")

Interestingly, the relationships do not hold across seasons:
* In spring and summer (i.e. 1 and 2), there seems to be a negative relationship between temperature and ridership when the weather is poor (3). However, the data is sparse hence the relationship may not be robust
* In autumn (3),there is notably a negative relationship between temperature and ridership when the weather is good (1). There seems to be no relationship between temperature and ridership when the weather is poor (3) but again, note the sparseness of the data
* In winter (4), the relationships are positive, which makes intuitive sense one would expect people to travel more when the temperature improves in winter

Let’s see if these relationships hold in both years.

ggplot(bikeshare, aes(x = feeltemp, y = count, colour = factor(weather))) + 
  geom_point(size = 1) +
  geom_smooth(method = "lm") +
  facet_wrap(~ year) + 
  xlab("Average daily temperature in degrees fahrenheit") + 
  ylab("Number of rentals per day") + 
  ggtitle("Impact of Weather & Temperature on Daily Ridership Across Years")

The relationships do seem to hold intertemporally. The variabiliy of the data weather = 3 in 2012 is noticeably wider though.

Let’s see if these relationships hold across days.

ggplot(bikeshare, aes(x = feeltemp, y = count, colour = factor(weather))) + 
  geom_point(size = 1) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ weekday) + 
  xlab("Average daily temperature in degrees fahrenheit") + 
  ylab("Number of rentals per day") + 
  ggtitle("Impact of Weather & Temperature on Daily Ridership Across Days")

The relationships seem to be broadly similar across days (ignoring weather = 3 given the sparseness of the data)

From the visualizations, it is clear that weather conditions do affect rental behaviors.

What about precipitations? Does it affect rental behaviour? What about day of week, season, hour of the day, etc?

Let’s recall what variables are available in the dataset.

names(bikeshare)

##  [1] "instant"     "date"        "season"      "year"        "month"      
##  [6] "holiday"     "weekday"     "workingday"  "weather"     "temperature"
## [11] "feeltemp"    "humidity"    "windspeed"   "casual"      "registered" 
## [16] "count"

There is no data on precipitation to allow us to investigate whether precipitation affects rental behaviour. That said, to the extent that weather = 3 reflects a rainy day, one could infer that precipitation would affect rental behaviour.

Similarly, there is no data on hour of the day to allow us to investigate whether the hour of the day affects rental behaviour.

Let’s examine day of the week more closely.

library(ggplot2)
ggplot(bikeshare, aes(count)) + 
  geom_density() +
  facet_wrap(~ weekday) +
  xlab("Number of rentals per day") + 
  ylab("Density") + 
  ggtitle("Distribution of Daily Ridership by Day")

There are some noticeable differences in the distributions. Tuesday, in particular, is relatively more leptokurtic.

A similar picture appears when we see what is the ridership frequency for each day using a histogram.

library(ggplot2)
ggplot(bikeshare, aes(count, fill = weekday)) + 
  geom_histogram() +
  facet_wrap(~ weekday, ncol = 3) +
  xlab("Number of rentals per day") +
  ggtitle("Histogram of Daily Ridership by Day")

Next, we investigate whether there is a relationship between temperature and ridership across days.

library(ggplot2)
ggplot(bikeshare, aes(x = feeltemp, y = count, colour = factor(weekday))) + 
  geom_point(size = 1) +
  geom_smooth(method = "lm", se = FALSE) +
  xlab("Average daily temperature in degrees fahrenheit") + 
  ylab("Number of rentals per day") + 
  ggtitle("Impact of Day & Temperature on Daily Ridership")

The relationship is positive across all days, with the regression lines exhibiting similar slopes and all bunched together.

Likewise, there is no evidence to suggest that the day of the week affects rental behaviour when we hold other factors like humidity, windspeed etc constant.

library(ggplot2)
ggplot(bikeshare, aes(x = humidity, y = count, colour = factor(weekday))) + 
  geom_point(size = 1) +
  geom_smooth(method = "lm", se = FALSE) +
  xlab("Humidity") + 
  ylab("Number of rentals per day") + 
  ggtitle("Impact of Day & Humidity on Daily Ridership")

library(ggplot2)
ggplot(bikeshare, aes(x = windspeed, y = count, colour = factor(weekday))) + 
  geom_point(size = 1) +
  geom_smooth(method = "lm", se = FALSE) +
  xlab("Windspeed") + 
  ylab("Number of rentals per day") + 
  ggtitle("Impact of Day & Windspeed on Daily Ridership")

The visualizations suggess that days of the week do not affect rental behaviour.

Foundations of Statistics Using R: Homework

Alvin Eng

N. Homework: Communicate - Create an RMarkdown document

R. Homework: Determine the average ridership

V. Homework: Devise the problem, challenge, and/or questions