DATA 606 Data Project Proposal

Data Preparation

The data set can be loaded from a CSV file saved in my GitHub repository.

# load data
library(tidyverse)
library(knitr)

repo <- "https://raw.githubusercontent.com/kecbenson/DATA_606_Project/master/day.csv"
df <- read_csv(repo)
df

## # A tibble: 731 x 16
##    instant dteday     season    yr  mnth holiday weekday workingday
##      <int> <date>      <int> <int> <int>   <int>   <int>      <int>
##  1       1 2011-01-01      1     0     1       0       6          0
##  2       2 2011-01-02      1     0     1       0       0          0
##  3       3 2011-01-03      1     0     1       0       1          1
##  4       4 2011-01-04      1     0     1       0       2          1
##  5       5 2011-01-05      1     0     1       0       3          1
##  6       6 2011-01-06      1     0     1       0       4          1
##  7       7 2011-01-07      1     0     1       0       5          1
##  8       8 2011-01-08      1     0     1       0       6          0
##  9       9 2011-01-09      1     0     1       0       0          0
## 10      10 2011-01-10      1     0     1       0       1          1
## # ... with 721 more rows, and 8 more variables: weathersit <int>,
## #   temp <dbl>, atemp <dbl>, hum <dbl>, windspeed <dbl>, casual <int>,
## #   registered <int>, cnt <int>

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Is the bike rental behavior of registered users more or less sensitive than that of casual users to changes in weather conditions?

Cases

What are the cases, and how many are there?

The data set includes 731 cases. Each case is a daily observation in 2011 and 2012 of 16 variables relating to (a) calendar information, (b) weather in the DC metro area, and (c) the number of bike rentals in the DC bike sharing program.

glimpse(df)

## Observations: 731
## Variables: 16
## $ instant    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ dteday     <date> 2011-01-01, 2011-01-02, 2011-01-03, 2011-01-04, 20...
## $ season     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ yr         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ mnth       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ holiday    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
## $ weekday    <int> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, ...
## $ workingday <int> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, ...
## $ weathersit <int> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, ...
## $ temp       <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.22695...
## $ atemp      <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.22927...
## $ hum        <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0...
## $ windspeed  <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.18690...
## $ casual     <int> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25...
## $ registered <int> 654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 1...
## $ cnt        <int> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1...

Data collection

Describe the method of data collection.

I downloaded the data set from the UC Irvine Machine Learning Repository:

https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

There are two data sets available in the repository, which contain hourly or daily counts of bike rentals during 2011 and 2012 in the Washington, D.C. bike sharing program, along with corresponding weather information.

For this project, I chose to work with the daily data set.

Type of study

What type of study is this (observational/experiment)?

This is an observational study of historical data.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

The data set was available from the UC Irvine Machine Learning Repository (cited above). The authors collected their data from the following sources:

DC bike share data: http://capitalbikeshare.com/system-data
Weather data: http://www.freemeteo.com
Holiday schedule: http://dchr.dc.gov/page/holiday-schedule

The data set already has been cleaned and tidied by the authors.

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

I haven’t decided yet, but potential choices include:

cnt: daily count of bike rentals by all users
registered: daily count of bike rentals by registered users
casual: daily count of bike rentals by non-registered users.

These are all quantitative variables.

Independent Variables

You should have two independent variables, one quantitative and one qualitative.

The data set has a number of variables to choose from, including:

Quantitative variables:
- temp or atemp: normalized temperature or windchill in degrees Celsius
- hum: normalized humidity
- windspeed: normalized wind speed
Qualitative variables:
- season: season (1 to 4)
- month: month (1 to 12)
- workingday: indicator of whether the day is a workday (1) or a weekend day / holiday (0)
- weathersit: weather category (1 to 4), e.g., 1 = “Clear / few clouds / partly cloudy / cloudy” while 4 = “Heavy rain / hail / thunderstorm / mist / snow / fog”

I haven’t decided yet, but I’m leaning toward using the temp (quantitative) and workingday (qualitative) variables.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

Below are summary statistics for the dataset, and a histogram showing the distribution of total daily rental counts.

# summary statistics
summary(df)

##     instant          dteday               season            yr        
##  Min.   :  1.0   Min.   :2011-01-01   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:183.5   1st Qu.:2011-07-02   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :366.0   Median :2012-01-01   Median :3.000   Median :1.0000  
##  Mean   :366.0   Mean   :2012-01-01   Mean   :2.497   Mean   :0.5007  
##  3rd Qu.:548.5   3rd Qu.:2012-07-01   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :731.0   Max.   :2012-12-31   Max.   :4.000   Max.   :1.0000  
##       mnth          holiday           weekday        workingday   
##  Min.   : 1.00   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 4.00   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 7.00   Median :0.00000   Median :3.000   Median :1.000  
##  Mean   : 6.52   Mean   :0.02873   Mean   :2.997   Mean   :0.684  
##  3rd Qu.:10.00   3rd Qu.:0.00000   3rd Qu.:5.000   3rd Qu.:1.000  
##  Max.   :12.00   Max.   :1.00000   Max.   :6.000   Max.   :1.000  
##    weathersit         temp             atemp              hum        
##  Min.   :1.000   Min.   :0.05913   Min.   :0.07907   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200  
##  Median :1.000   Median :0.49833   Median :0.48673   Median :0.6267  
##  Mean   :1.395   Mean   :0.49538   Mean   :0.47435   Mean   :0.6279  
##  3rd Qu.:2.000   3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302  
##  Max.   :3.000   Max.   :0.86167   Max.   :0.84090   Max.   :0.9725  
##    windspeed           casual         registered        cnt      
##  Min.   :0.02239   Min.   :   2.0   Min.   :  20   Min.   :  22  
##  1st Qu.:0.13495   1st Qu.: 315.5   1st Qu.:2497   1st Qu.:3152  
##  Median :0.18097   Median : 713.0   Median :3662   Median :4548  
##  Mean   :0.19049   Mean   : 848.2   Mean   :3656   Mean   :4504  
##  3rd Qu.:0.23321   3rd Qu.:1096.0   3rd Qu.:4776   3rd Qu.:5956  
##  Max.   :0.50746   Max.   :3410.0   Max.   :6946   Max.   :8714

# histogram of daily rental counts: all users
ggplot(df) + geom_histogram(aes(x = cnt)) + labs(title = "Distribution of daily rental count: All users")

The data show that registered and casual users exhibit different rental behavior during the week, across seasons, and under different weather conditions. For instance, registered users tend to ride more during the work week, whereas casual users tend to ride more on the weekends and holidays.

# boxplot of rental counts, by weekday or weekend/holiday: registered vs. casual users
ggplot(df) + geom_boxplot(aes(y = registered)) + facet_wrap(~ workingday) +
    labs(title = "Daily rental count by weekday (1) vs. weekend/holiday (0): Registered users")

ggplot(df) + geom_boxplot(aes(y = casual)) + facet_wrap(~ workingday) +
    labs(title = "Daily rental count by weekday (1) vs. weekend/holiday (0): Casual users")

Also, it appears that registered users may be more sensitive to temperature conditions than casual users. In particular, the slope of the regression line of rental count vs. temperature appears to be steeper during cold weather (season = 1) and hot weather (season = 3) for registered users compared to casual users. Whether this difference is statistically significant can be investigated in the project.

# scatter plots of rental count vs temp, by season: registered vs. casual users
ggplot(df, aes(x = temp, y = registered)) + geom_point(aes(color = season)) + facet_wrap(~ season) + 
    geom_smooth(method = "lm", se = FALSE) + labs(title = "Daily rental count vs. temperature, by season (1-4): Registered users")

ggplot(df, aes(x = temp, y = casual)) + geom_point(aes(color = season)) + facet_wrap(~ season) + 
    geom_smooth(method = "lm", se = FALSE) + labs(title = "Daily rental count vs. temperature, by season (1-4): Casual users")