Choose a dataset
You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.
Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected.
To complete this task:
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.4.0 v purrr 0.3.4
## v tibble 3.1.8 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(cowplot)
## Warning: package 'cowplot' was built under R version 4.1.3
library(lattice)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.3
## corrplot 0.92 loaded
library(caTools)
## Warning: package 'caTools' was built under R version 4.1.3
library(caret)
## Warning: package 'caret' was built under R version 4.1.3
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.1.3
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
For this analysis, I have decided to work with the Bike Sharing dataset. The dataset can be accessed at https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset#
This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information. The hourly subset contains 17379 counts, while the daily subset contains 731 counts. For the purposes of this analysis, I will be working with the daily counts. Each subset contains 16 attributes.
The objective of this analysis is to predict the bike rental count based on different environmental settings.
As the first step of exploratory analysis we will take a glimpse at the data.
data <- data.frame(read.csv('day.csv'))
head(data)
## instant dteday season yr mnth holiday weekday workingday weathersit
## 1 1 2011-01-01 1 0 1 0 6 0 2
## 2 2 2011-01-02 1 0 1 0 0 0 2
## 3 3 2011-01-03 1 0 1 0 1 1 1
## 4 4 2011-01-04 1 0 1 0 2 1 1
## 5 5 2011-01-05 1 0 1 0 3 1 1
## 6 6 2011-01-06 1 0 1 0 4 1 1
## temp atemp hum windspeed casual registered cnt
## 1 0.344167 0.363625 0.805833 0.1604460 331 654 985
## 2 0.363478 0.353739 0.696087 0.2485390 131 670 801
## 3 0.196364 0.189405 0.437273 0.2483090 120 1229 1349
## 4 0.200000 0.212122 0.590435 0.1602960 108 1454 1562
## 5 0.226957 0.229270 0.436957 0.1869000 82 1518 1600
## 6 0.204348 0.233209 0.518261 0.0895652 88 1518 1606
Looking at the datatypes of the columns all but ‘dteday’ are numeric. ‘dteday’ will be converted to type date.
We will be predicting the
data$dteday <- as.Date(data$dteday)
In order to make sure we have complete data we will do a check for any missing values.
sum(is.na(data))
## [1] 0
Now that we have confirmed that no missing values exist in the data and have properly converted the column type, we can start with the statistical and visual summaries.
summary(data)
## instant dteday season yr
## Min. : 1.0 Min. :2011-01-01 Min. :1.000 Min. :0.0000
## 1st Qu.:183.5 1st Qu.:2011-07-02 1st Qu.:2.000 1st Qu.:0.0000
## Median :366.0 Median :2012-01-01 Median :3.000 Median :1.0000
## Mean :366.0 Mean :2012-01-01 Mean :2.497 Mean :0.5007
## 3rd Qu.:548.5 3rd Qu.:2012-07-01 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :731.0 Max. :2012-12-31 Max. :4.000 Max. :1.0000
## mnth holiday weekday workingday
## Min. : 1.00 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.: 4.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000
## Median : 7.00 Median :0.00000 Median :3.000 Median :1.000
## Mean : 6.52 Mean :0.02873 Mean :2.997 Mean :0.684
## 3rd Qu.:10.00 3rd Qu.:0.00000 3rd Qu.:5.000 3rd Qu.:1.000
## Max. :12.00 Max. :1.00000 Max. :6.000 Max. :1.000
## weathersit temp atemp hum
## Min. :1.000 Min. :0.05913 Min. :0.07907 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200
## Median :1.000 Median :0.49833 Median :0.48673 Median :0.6267
## Mean :1.395 Mean :0.49538 Mean :0.47435 Mean :0.6279
## 3rd Qu.:2.000 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302
## Max. :3.000 Max. :0.86167 Max. :0.84090 Max. :0.9725
## windspeed casual registered cnt
## Min. :0.02239 Min. : 2.0 Min. : 20 Min. : 22
## 1st Qu.:0.13495 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152
## Median :0.18097 Median : 713.0 Median :3662 Median :4548
## Mean :0.19049 Mean : 848.2 Mean :3656 Mean :4504
## 3rd Qu.:0.23321 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956
## Max. :0.50746 Max. :3410.0 Max. :6946 Max. :8714
#create boxplot
meltD <- melt(data)
## No id variables; using all as measure variables
## Warning: attributes are not identical across measure variables; they will be
## dropped
p <- ggplot(meltD, aes(factor(variable), value))
p + geom_boxplot() + facet_wrap(~variable, scale="free")
hist.data.frame(data)
df <- data %>% select(!dteday)
corrplot(cor(df),
method = 'circle', order = 'alphabet', type = 'lower', diag = FALSE, number.cex = 0.75, tl.cex = 0.5, col=colorRampPalette(c("blue","white","red"))(200))
For this analysis, we will be using two methodologies: Linear Regression and Neural Network. We will be predicting the count ‘cnt’ of rental bikes based on weather conditions. These attributes include: ‘weathersit’, ‘temp’, ‘atemp’, ‘hum’, and ‘windspeed’.
The tracker data frame will contain the name, RMSE, and R-Square for each model
tracker <- data.frame(matrix(vector(), 0, 3,
dimnames=list(c(), c("Name", "RMSE", "R-Squared"))),
stringsAsFactors=F)
The dataset is split into a training and testing sets on a 85:15 ratio.
set.seed(123)
train_ind <- sample(seq_len(nrow(data)), size = floor(0.85 * nrow(data)))
train <- data[train_ind, ]
test <- data[-train_ind, ]
lm <- lm(cnt ~ weathersit + temp + atemp + hum + windspeed, data = train)
lm
##
## Call:
## lm(formula = cnt ~ weathersit + temp + atemp + hum + windspeed,
## data = train)
##
## Coefficients:
## (Intercept) weathersit temp atemp hum windspeed
## 3548.4 -480.7 -5527.6 13227.0 -1873.6 -3802.3
lm_pred <- predict(lm, newdata=test)
# RMSE
lm_rmse <- sqrt(mean((test$cnt - lm_pred)^2))
#R-Square
lm_rs <- (cor(test$cnt, lm_pred))^2
tracker[nrow(tracker) + 1,] <- c("Linear Regression", lm_rmse, lm_rs)
plot(lm,col = "purple")
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
.size = c(1:10),
.bag = FALSE)
ctrl <- trainControl(method = "cv", number = 10)
nnet <- train(cnt ~ weathersit + temp + atemp + hum + windspeed, data = train,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("YeoJohnson", "center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 84851,
maxit = 500)
## Warning: executing %dopar% sequentially: no parallel backend registered
nnet_pred <- predict(nnet, newdata = test)
# RMSE
nnet_rmse <- sqrt(mean((test$cnt - nnet_pred)^2))
#R-Square
nnet_rs <- (cor(test$cnt, nnet_pred))^2
tracker[nrow(tracker) + 1,] <- c("Neural Network", nnet_rmse, nnet_rs)
tracker %>% arrange(RMSE)
## Name RMSE R.Squared
## 1 Neural Network 1297.83870692773 0.590298664438584
## 2 Linear Regression 1502.19416554519 0.445070379921359
From the tracker table above we can see that out of the two methodologies, Neural Network seems to be the better approach. The results seem appropriate, as Neural Networks have a better mechanism to fit and understand historical data and its patterns and are in general more robust.
The data and this analysis provides vital information that can be used to make future market decisions. Looking at the results can give insight into conditions under which more bikes are used and conditions under which business is slow. For this particular study we have only used weather related data, however one shinning aspect of this data set is the timed data. The timed data, whether it be hourly or daily allows us to perform time series analysis that can give further information that can be used to make future predictions. This analysis only focuses on a set number of attributes, but analysis on remaining ones or other combinations would provide to be useful.