Data_624_Homework

Choose a dataset

You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.
Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected.
To complete this task:
- Describe the problem you are trying to solve.
- Describe your datases and what you did to prepare the data for analysis.
- Methodologies you used for analyzing the data
- What’s the purpose of the analysis performed
- Make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.4.0     v purrr   0.3.4
## v tibble  3.1.8     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.1.3

## Warning: package 'tibble' was built under R version 4.1.3

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(cowplot)

## Warning: package 'cowplot' was built under R version 4.1.3

library(lattice)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.1.3

## corrplot 0.92 loaded

library(caTools)

## Warning: package 'caTools' was built under R version 4.1.3

library(caret)

## Warning: package 'caret' was built under R version 4.1.3

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(Hmisc)

## Warning: package 'Hmisc' was built under R version 4.1.3

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

Data

For this analysis, I have decided to work with the Bike Sharing dataset. The dataset can be accessed at https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset#

This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information. The hourly subset contains 17379 counts, while the daily subset contains 731 counts. For the purposes of this analysis, I will be working with the daily counts. Each subset contains 16 attributes.

instant: record index
dteday : date
season : season (1:winter, 2:spring, 3:summer, 4:fall)
yr : year (0: 2011, 1:2012)
mnth : month ( 1 to 12)
hr : hour (0 to 23)
holiday : weather day is holiday or not (extracted from [Web Link])
weekday : day of the week
workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
hum: Normalized humidity. The values are divided to 100 (max)
windspeed: Normalized wind speed. The values are divided to 67 (max)
casual: count of casual users
registered: count of registered users
cnt: count of total rental bikes including both casual and registered

Objective

The objective of this analysis is to predict the bike rental count based on different environmental settings.

Exploratory Data Analysis

Glimpse

As the first step of exploratory analysis we will take a glimpse at the data.

data <- data.frame(read.csv('day.csv'))
head(data)

##   instant     dteday season yr mnth holiday weekday workingday weathersit
## 1       1 2011-01-01      1  0    1       0       6          0          2
## 2       2 2011-01-02      1  0    1       0       0          0          2
## 3       3 2011-01-03      1  0    1       0       1          1          1
## 4       4 2011-01-04      1  0    1       0       2          1          1
## 5       5 2011-01-05      1  0    1       0       3          1          1
## 6       6 2011-01-06      1  0    1       0       4          1          1
##       temp    atemp      hum windspeed casual registered  cnt
## 1 0.344167 0.363625 0.805833 0.1604460    331        654  985
## 2 0.363478 0.353739 0.696087 0.2485390    131        670  801
## 3 0.196364 0.189405 0.437273 0.2483090    120       1229 1349
## 4 0.200000 0.212122 0.590435 0.1602960    108       1454 1562
## 5 0.226957 0.229270 0.436957 0.1869000     82       1518 1600
## 6 0.204348 0.233209 0.518261 0.0895652     88       1518 1606

Convert Column Datatype

Looking at the datatypes of the columns all but ‘dteday’ are numeric. ‘dteday’ will be converted to type date.

We will be predicting the

data$dteday <- as.Date(data$dteday)

NA

In order to make sure we have complete data we will do a check for any missing values.

sum(is.na(data))

## [1] 0

Now that we have confirmed that no missing values exist in the data and have properly converted the column type, we can start with the statistical and visual summaries.

Summary

Numerical

summary(data)

##     instant          dteday               season            yr        
##  Min.   :  1.0   Min.   :2011-01-01   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:183.5   1st Qu.:2011-07-02   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :366.0   Median :2012-01-01   Median :3.000   Median :1.0000  
##  Mean   :366.0   Mean   :2012-01-01   Mean   :2.497   Mean   :0.5007  
##  3rd Qu.:548.5   3rd Qu.:2012-07-01   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :731.0   Max.   :2012-12-31   Max.   :4.000   Max.   :1.0000  
##       mnth          holiday           weekday        workingday   
##  Min.   : 1.00   Min.   :0.00000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 4.00   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 7.00   Median :0.00000   Median :3.000   Median :1.000  
##  Mean   : 6.52   Mean   :0.02873   Mean   :2.997   Mean   :0.684  
##  3rd Qu.:10.00   3rd Qu.:0.00000   3rd Qu.:5.000   3rd Qu.:1.000  
##  Max.   :12.00   Max.   :1.00000   Max.   :6.000   Max.   :1.000  
##    weathersit         temp             atemp              hum        
##  Min.   :1.000   Min.   :0.05913   Min.   :0.07907   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200  
##  Median :1.000   Median :0.49833   Median :0.48673   Median :0.6267  
##  Mean   :1.395   Mean   :0.49538   Mean   :0.47435   Mean   :0.6279  
##  3rd Qu.:2.000   3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302  
##  Max.   :3.000   Max.   :0.86167   Max.   :0.84090   Max.   :0.9725  
##    windspeed           casual         registered        cnt      
##  Min.   :0.02239   Min.   :   2.0   Min.   :  20   Min.   :  22  
##  1st Qu.:0.13495   1st Qu.: 315.5   1st Qu.:2497   1st Qu.:3152  
##  Median :0.18097   Median : 713.0   Median :3662   Median :4548  
##  Mean   :0.19049   Mean   : 848.2   Mean   :3656   Mean   :4504  
##  3rd Qu.:0.23321   3rd Qu.:1096.0   3rd Qu.:4776   3rd Qu.:5956  
##  Max.   :0.50746   Max.   :3410.0   Max.   :6946   Max.   :8714

Boxplot

#create boxplot
meltD <- melt(data)

## No id variables; using all as measure variables

## Warning: attributes are not identical across measure variables; they will be
## dropped

p <- ggplot(meltD, aes(factor(variable), value)) 
p + geom_boxplot() + facet_wrap(~variable, scale="free")

Histogram

hist.data.frame(data)

Correlation Plot

df <- data %>% select(!dteday)
corrplot(cor(df),
         method = 'circle', order = 'alphabet', type = 'lower', diag = FALSE, number.cex = 0.75, tl.cex = 0.5,  col=colorRampPalette(c("blue","white","red"))(200))

Methodologies

For this analysis, we will be using two methodologies: Linear Regression and Neural Network. We will be predicting the count ‘cnt’ of rental bikes based on weather conditions. These attributes include: ‘weathersit’, ‘temp’, ‘atemp’, ‘hum’, and ‘windspeed’.

weathersit : - 1: Clear, Few clouds, Partly cloudy, Partly cloudy - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)

Tracker

The tracker data frame will contain the name, RMSE, and R-Square for each model

tracker <- data.frame(matrix(vector(), 0, 3,
                dimnames=list(c(), c("Name", "RMSE", "R-Squared"))),
                stringsAsFactors=F)

Data Split

The dataset is split into a training and testing sets on a 85:15 ratio.

set.seed(123)
train_ind <- sample(seq_len(nrow(data)), size = floor(0.85 * nrow(data)))

train <- data[train_ind, ]
test <- data[-train_ind, ]

Linear Regression

lm <- lm(cnt ~ weathersit + temp + atemp + hum + windspeed, data = train)

lm

## 
## Call:
## lm(formula = cnt ~ weathersit + temp + atemp + hum + windspeed, 
##     data = train)
## 
## Coefficients:
## (Intercept)   weathersit         temp        atemp          hum    windspeed  
##      3548.4       -480.7      -5527.6      13227.0      -1873.6      -3802.3

lm_pred <- predict(lm, newdata=test)

# RMSE
lm_rmse <- sqrt(mean((test$cnt - lm_pred)^2))
#R-Square
lm_rs <- (cor(test$cnt, lm_pred))^2

tracker[nrow(tracker) + 1,] <- c("Linear Regression", lm_rmse, lm_rs)

plot(lm,col = "purple")

Nueral Network

nnetGrid <- expand.grid(.decay = c(0, 0.01, .1),
                        .size = c(1:10),
                        .bag = FALSE)
ctrl <- trainControl(method = "cv", number = 10)

nnet <- train(cnt ~ weathersit + temp + atemp + hum + windspeed, data = train,
                  method = "avNNet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl,
                  preProc = c("YeoJohnson", "center", "scale"),
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 84851,
                  maxit = 500)

## Warning: executing %dopar% sequentially: no parallel backend registered

nnet_pred <- predict(nnet, newdata = test)

# RMSE
nnet_rmse <- sqrt(mean((test$cnt - nnet_pred)^2))
#R-Square
nnet_rs <- (cor(test$cnt, nnet_pred))^2

tracker[nrow(tracker) + 1,] <- c("Neural Network", nnet_rmse, nnet_rs)

tracker %>% arrange(RMSE)

##                Name             RMSE         R.Squared
## 1    Neural Network 1297.83870692773 0.590298664438584
## 2 Linear Regression 1502.19416554519 0.445070379921359

Conclusions

From the tracker table above we can see that out of the two methodologies, Neural Network seems to be the better approach. The results seem appropriate, as Neural Networks have a better mechanism to fit and understand historical data and its patterns and are in general more robust.

The data and this analysis provides vital information that can be used to make future market decisions. Looking at the results can give insight into conditions under which more bikes are used and conditions under which business is slow. For this particular study we have only used weather related data, however one shinning aspect of this data set is the timed data. The timed data, whether it be hourly or daily allows us to perform time series analysis that can give further information that can be used to make future predictions. This analysis only focuses on a set number of attributes, but analysis on remaining ones or other combinations would provide to be useful.

Data_624_Homework_4

Krutika Patel

2023-05-24