The data set can be loaded from a CSV file saved in my GitHub repository.
# load data
library(tidyverse)
library(knitr)
repo <- "https://raw.githubusercontent.com/kecbenson/DATA_606_Project/master/day.csv"
df <- read_csv(repo)
df
## # A tibble: 731 x 16
## instant dteday season yr mnth holiday weekday workingday
## <int> <date> <int> <int> <int> <int> <int> <int>
## 1 1 2011-01-01 1 0 1 0 6 0
## 2 2 2011-01-02 1 0 1 0 0 0
## 3 3 2011-01-03 1 0 1 0 1 1
## 4 4 2011-01-04 1 0 1 0 2 1
## 5 5 2011-01-05 1 0 1 0 3 1
## 6 6 2011-01-06 1 0 1 0 4 1
## 7 7 2011-01-07 1 0 1 0 5 1
## 8 8 2011-01-08 1 0 1 0 6 0
## 9 9 2011-01-09 1 0 1 0 0 0
## 10 10 2011-01-10 1 0 1 0 1 1
## # ... with 721 more rows, and 8 more variables: weathersit <int>,
## # temp <dbl>, atemp <dbl>, hum <dbl>, windspeed <dbl>, casual <int>,
## # registered <int>, cnt <int>
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
Is the bike rental behavior of registered users more or less sensitive than that of casual users to changes in weather conditions?
What are the cases, and how many are there?
The data set includes 731 cases. Each case is a daily observation in 2011 and 2012 of 16 variables relating to (a) calendar information, (b) weather in the DC metro area, and (c) the number of bike rentals in the DC bike sharing program.
glimpse(df)
## Observations: 731
## Variables: 16
## $ instant <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ dteday <date> 2011-01-01, 2011-01-02, 2011-01-03, 2011-01-04, 20...
## $ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ yr <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ mnth <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ holiday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
## $ weekday <int> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, ...
## $ workingday <int> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, ...
## $ weathersit <int> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, ...
## $ temp <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.22695...
## $ atemp <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.22927...
## $ hum <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0...
## $ windspeed <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.18690...
## $ casual <int> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25...
## $ registered <int> 654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 1...
## $ cnt <int> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1...
Describe the method of data collection.
I downloaded the data set from the UC Irvine Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
There are two data sets available in the repository, which contain hourly or daily counts of bike rentals during 2011 and 2012 in the Washington, D.C. bike sharing program, along with corresponding weather information.
For this project, I chose to work with the daily data set.
What type of study is this (observational/experiment)?
This is an observational study of historical data.
If you collected the data, state self-collected. If not, provide a citation/link.
The data set was available from the UC Irvine Machine Learning Repository (cited above). The authors collected their data from the following sources:
The data set already has been cleaned and tidied by the authors.
What is the response variable? Is it quantitative or qualitative?
I haven’t decided yet, but potential choices include:
cnt: daily count of bike rentals by all usersregistered: daily count of bike rentals by registered userscasual: daily count of bike rentals by non-registered users.These are all quantitative variables.
You should have two independent variables, one quantitative and one qualitative.
The data set has a number of variables to choose from, including:
temp or atemp: normalized temperature or windchill in degrees Celsiushum: normalized humiditywindspeed: normalized wind speedseason: season (1 to 4)month: month (1 to 12)workingday: indicator of whether the day is a workday (1) or a weekend day / holiday (0)weathersit: weather category (1 to 4), e.g., 1 = “Clear / few clouds / partly cloudy / cloudy” while 4 = “Heavy rain / hail / thunderstorm / mist / snow / fog”I haven’t decided yet, but I’m leaning toward using the temp (quantitative) and workingday (qualitative) variables.
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
Below are summary statistics for the dataset, and a histogram showing the distribution of total daily rental counts.
# summary statistics
summary(df)
## instant dteday season yr
## Min. : 1.0 Min. :2011-01-01 Min. :1.000 Min. :0.0000
## 1st Qu.:183.5 1st Qu.:2011-07-02 1st Qu.:2.000 1st Qu.:0.0000
## Median :366.0 Median :2012-01-01 Median :3.000 Median :1.0000
## Mean :366.0 Mean :2012-01-01 Mean :2.497 Mean :0.5007
## 3rd Qu.:548.5 3rd Qu.:2012-07-01 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :731.0 Max. :2012-12-31 Max. :4.000 Max. :1.0000
## mnth holiday weekday workingday
## Min. : 1.00 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.: 4.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000
## Median : 7.00 Median :0.00000 Median :3.000 Median :1.000
## Mean : 6.52 Mean :0.02873 Mean :2.997 Mean :0.684
## 3rd Qu.:10.00 3rd Qu.:0.00000 3rd Qu.:5.000 3rd Qu.:1.000
## Max. :12.00 Max. :1.00000 Max. :6.000 Max. :1.000
## weathersit temp atemp hum
## Min. :1.000 Min. :0.05913 Min. :0.07907 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200
## Median :1.000 Median :0.49833 Median :0.48673 Median :0.6267
## Mean :1.395 Mean :0.49538 Mean :0.47435 Mean :0.6279
## 3rd Qu.:2.000 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302
## Max. :3.000 Max. :0.86167 Max. :0.84090 Max. :0.9725
## windspeed casual registered cnt
## Min. :0.02239 Min. : 2.0 Min. : 20 Min. : 22
## 1st Qu.:0.13495 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152
## Median :0.18097 Median : 713.0 Median :3662 Median :4548
## Mean :0.19049 Mean : 848.2 Mean :3656 Mean :4504
## 3rd Qu.:0.23321 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956
## Max. :0.50746 Max. :3410.0 Max. :6946 Max. :8714
# histogram of daily rental counts: all users
ggplot(df) + geom_histogram(aes(x = cnt)) + labs(title = "Distribution of daily rental count: All users")
The data show that registered and casual users exhibit different rental behavior during the week, across seasons, and under different weather conditions. For instance, registered users tend to ride more during the work week, whereas casual users tend to ride more on the weekends and holidays.
# boxplot of rental counts, by weekday or weekend/holiday: registered vs. casual users
ggplot(df) + geom_boxplot(aes(y = registered)) + facet_wrap(~ workingday) +
labs(title = "Daily rental count by weekday (1) vs. weekend/holiday (0): Registered users")
ggplot(df) + geom_boxplot(aes(y = casual)) + facet_wrap(~ workingday) +
labs(title = "Daily rental count by weekday (1) vs. weekend/holiday (0): Casual users")
Also, it appears that registered users may be more sensitive to temperature conditions than casual users. In particular, the slope of the regression line of rental count vs. temperature appears to be steeper during cold weather (season = 1) and hot weather (season = 3) for registered users compared to casual users. Whether this difference is statistically significant can be investigated in the project.
# scatter plots of rental count vs temp, by season: registered vs. casual users
ggplot(df, aes(x = temp, y = registered)) + geom_point(aes(color = season)) + facet_wrap(~ season) +
geom_smooth(method = "lm", se = FALSE) + labs(title = "Daily rental count vs. temperature, by season (1-4): Registered users")
ggplot(df, aes(x = temp, y = casual)) + geom_point(aes(color = season)) + facet_wrap(~ season) +
geom_smooth(method = "lm", se = FALSE) + labs(title = "Daily rental count vs. temperature, by season (1-4): Casual users")