Mateo Arcos 2023-08-14
Bike-sharing systems offer a sustainable and convenient mode of transportation for urban residents and visitors, promoting eco-friendly commuting and reducing traffic congestion.
The primary objective of this report is to develop a predictive model using linear regression to forecast the number of bike rentals in the Capital Bikeshare system based on the available data. We aim to indentify the relationships between these factors and bike rental demand. This knowledge can empower decision-makers to optimize resource allocation, anticipate surges in demand, and tailor operational strategies to accommodate varying conditions.
To develop the predictive model, we intend to follow a systematic approach. Firstly, we will conduct an exploratory data analysis to gain an understanding of the dataset’s features and their distribution. This EDA will include scatter plots for continuous variables and histograms and box plots for categorical variables. Subsequently, we will preprocess the data, addressing issues such as representation of categorical variables outliers, and feature transformations. With a refined dataset, we will employ linear regression, we must first ensure the 4 assumptions of OLS constant variance, normality of errors, independence of errors, and linearity of X and Y are not violated.
The question we aim to answer in this paper is how can be predict the demand of rental bikes in the city of DC at a given time in the year. The proposed model aligns with our overarching purpose in several ways. By accurately predicting bike rental demand, we can assist bike-sharing service providers in efficiently allocating their fleet, ensuring a satisfactory user experience during peak and off-peak times. Urban planners can utilize the insights to develop proactive strategies for enhancing cycling infrastructure and promoting biking as a viable mode of urban transportation. Furthermore, by understanding the impact of weather conditions and seasonality, we contribute to building resilient bike-sharing systems that can adapt to various external influences.
The data set this model is trained on includes the following attributes:
1. **instant:** The unique index assigned to each record, facilitating easy referencing.
2. **dteday:** The date corresponding to each record, allowing temporal analysis over the two-year span.
3. **season:** Represents the season in which the rental data was recorded, categorized as winter, spring, summer, or fall.
4. **yr:** Indicates the year of the rental data, with 0 representing 2011 and 1 representing 2012.
5. **mnth:** Refers to the numeric representation of the month (ranging from 1 to 12).
6. **hr:** Denotes the hour of the day (ranging from 0 to 23).
7. **holiday:** A binary indicator of whether the given day was a holiday or not.
8. **weekday:** Indicates the day of the week, providing insights into weekly variations.
9. **workingday:** A binary indicator that distinguishes between regular workdays (1) and weekends/holidays (0).
10. **weathersit:** Categorizes the weather conditions into four distinct classes:
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
11. **temp:** Normalized temperature in Celsius, derived based on the minimum and maximum temperatures.
12. **atemp:** Normalized feeling temperature in Celsius, similarly derived based on the minimum and maximum feeling temperatures.
13. **hum:** Normalized humidity, ranging between 0 and 1, where 1 corresponds to 100% humidity.
14. **windspeed:** Normalized wind speed, scaled between 0 and 1.
15. **casual:** The count of casual users who rented bikes.
16. **registered:** The count of registered users who rented bikes.
17. **cnt:** The total count of bike rentals, encompassing both casual and registered users.:
Distribution analysis:
To better understand the distribution of bike rentals based on factors
such as time of day, month, weather histograms were generated, this
included:
Histogram of rental count for Each Hour
Histogram of rental count for Each Month
Histogram of rental count for Each Working Day
Histogram of rental count for Each Weekday
Histogram of ’rental count for Each Hour of the Day
Histogram of rental count for Each Season
Histogram of rental count for Each Weather Situation
library(readr)
library(ggplot2)
library(ggcorrplot)
library(car)## Loading required package: carData
library(fastDummies)## Thank you for using fastDummies!
## To acknowledge our work, please cite the package:
## Kaplan, J. & Schlegel, B. (2023). fastDummies: Fast Creation of Dummy (Binary) Columns and Rows from Categorical Variables. Version 1.7.1. URL: https://github.com/jacobkap/fastDummies, https://jacobkap.github.io/fastDummies/.
bikes <- read_csv("C:/Users/mateo/OneDrive/Desktop/STA302/hour.csv")## Rows: 17379 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): instant, season, yr, mnth, hr, holiday, weekday, workingday, weat...
## date (1): dteday
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Outlines are identified using the iqr method and eliminated
Normality of response variable. It is evident after looking at a histogram of our response variable “cnt” that it’s distribution is left skewed. To ensure that our model does not violate the normality of errors assumption we perform a transformation on the response variable to achieve a normal distribution. We used the box cox method to determine an appropriate lambda to achieve this. Our optimal lambda is plugged in to the transformation function, once this transformation is applied we get a normal distribution in our response variable, this normal distribution is verified using out normal qq plot
# Perform the Yeo-Johnson transformation and find the optimal lambda
yeojohnson_result <- powerTransform(bikes$cnt+ 1) # Adding 1 to avoid issues with zeros
# Get the optimal lambda from the results
optimal_lambda <- yeojohnson_result$lambda
# Apply the Yeo-Johnson transformation with the optimal lambda
transformed_cnt <- bcPower(bikes$cnt + 1, optimal_lambda)
hist(transformed_cnt,
main = "Distribution of Transformed Bike Rental Counts",
xlab = "Transformed Bike Rental Counts",
ylab = "Frequency",
col = "skyblue")qqnorm(transformed_cnt)
qqline(transformed_cnt)
A critical aspect of developing a predictive model is assessing the relationships between different features to identify potential multicollinearity. Multicolinearity occurs when two independent variables have are highly correlated, this results in instability in model coefficients. We use a correlation plot to identify any numerical variables that are highly correlated.
categorical <- c("workingday", "weekday", "holiday", "hr", "mnth", "yr","season","weathersit")
#numerical variables
num <- c("atemp", "temp", "hum", "windspeed", "casual", "registered","cnt")
#correlation matrix
# Load necessary libraries (if not already installed)
# install.packages("ggplot2")
# install.packages("ggcorrplot")
# Load libraries
library(ggplot2)
library(ggcorrplot)
# Replace 'your_dataset' with the name of your dataset
# Replace 'num' with the vector containing column names of interest
subset_data <- bikes[num]
# Calculate the correlation matrix
cor_matrix <- cor(subset_data)
# Plot the correlation matrix using ggcorrplot
ggcorrplot(cor_matrix, type = "lower", lab = TRUE)The correlation plots shows a very high degree of correlation between adjusted temperature and temperature. The high degree of correlation between these two variables suggests the presence of multicollinearity, this lets us know that we should only include one on these on on model to avoid instability. Opting for adjusted temperature is justified due to its representation of the feeling temperature, which takes into account factors such as wind chill.
pairs(subset_data)library(ggplot2)
library(gridExtra)
# Create individual plots
plot_hour <- ggplot(bikes, aes(x = hr, y = cnt)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Histogram of 'cnt' for Each Hour",
x = "Hour of the Day",
y = "Count of Bikes")
plot_month <- ggplot(bikes, aes(x = mnth, y = cnt)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Histogram of 'cnt' for Each Month",
x = "Month",
y = "Count of Bikes")
plot_workingday <- ggplot(bikes, aes(x = workingday, y = cnt)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Histogram of 'cnt' for Each Working Day",
x = "Working Day",
y = "Count of Bikes")
plot_weekday <- ggplot(bikes, aes(x = weekday, y = cnt)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Histogram of 'cnt' for Each Weekday",
x = "Weekday",
y = "Count of Bikes")
# Arrange plots in a grid
grid <- grid.arrange(plot_hour, plot_month, plot_workingday, plot_weekday, ncol = 2)# Print the grid
print(grid)## TableGrob (2 x 2) "arrange": 4 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (2-2,1-1) arrange gtable[layout]
## 4 4 (2-2,2-2) arrange gtable[layout]
library(ggplot2)
library(gridExtra)
# Create individual plots
plot_season <- ggplot(bikes, aes(x = season, y = cnt)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Histogram of 'cnt' for Each Season",
x = "Season",
y = "Count of Bikes")
plot_weathersit <- ggplot(bikes, aes(x = weathersit, y = cnt)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Histogram of 'cnt' for Each Weather Situation",
x = "Weather Situation",
y = "Count of Bikes")
# Arrange plots in a grid with one row and two columns
grid <- grid.arrange(plot_season, plot_weathersit, nrow = 1)# Print the grid
print(grid)## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
Hourly Bike Rentals: We observe distinct peaks during the early morning hours around 7/8 am and in the late afternoon around 5/6 pm. These peaks coincide with the rush hours of typical daily commutes, suggesting a strong connection between bike rentals and regular working hours. This insight implies that bike rental experiences heightened activity during prime commuting times, aligning with users’ needs to conveniently navigate to and from work.
Working Days and Bike Demand: Further reinforcing this observation is the histogram displaying bike rentals categorized by working days. Bike rentals are considerably higher on working days compared to weekends.
Weather and Bike Demand:
Seasonality was also observed as the months of may to October saw the highest amount of bike rentals. Furthermore favorable weather conditions saw the biggest amount of bikes rentals.
As we progress with predictive modeling, these insights will be integral in selecting features that accurately capture the dynamics of bike rental demand, contributing to the creation of a robust and actionable predictive model.
library(ggplot2)
library(gridExtra)
# Create individual box plots
plot_month <- ggplot(bikes, aes(x = factor(mnth), y = cnt)) +
geom_boxplot() +
labs(title = "Box Plot: Month vs. cnt",
x = "Month",
y = "Count of Bikes")
plot_hour <- ggplot(bikes, aes(x = factor(hr), y = cnt)) +
geom_boxplot() +
labs(title = "Box Plot: Hour vs. cnt",
x = "Hour",
y = "Count of Bikes")
plot_holiday <- ggplot(bikes, aes(x = factor(holiday), y = cnt)) +
geom_boxplot() +
labs(title = "Box Plot: Holiday vs. cnt",
x = "Holiday",
y = "Count of Bikes")
plot_weekday <- ggplot(bikes, aes(x = factor(weekday), y = cnt)) +
geom_boxplot() +
labs(title = "Box Plot: Weekday vs. cnt",
x = "Weekday",
y = "Count of Bikes")
plot_workingday <- ggplot(bikes, aes(x = factor(workingday), y = cnt)) +
geom_boxplot() +
labs(title = "Box Plot: Workingday vs. cnt",
x = "Workingday",
y = "Count of Bikes")
plot_weathersit <- ggplot(bikes, aes(x = factor(weathersit), y = cnt)) +
geom_boxplot() +
labs(title = "Box Plot: Weathersit vs. cnt",
x = "Weathersit",
y = "Count of Bikes")
# Arrange plots in a grid layout (3 by 3)
grid <- grid.arrange(plot_month, plot_hour, plot_holiday,
plot_weekday, plot_workingday, plot_weathersit,
ncol = 3)# Print the grid
print(grid)## TableGrob (2 x 3) "arrange": 6 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]
## 4 4 (2-2,1-1) arrange gtable[layout]
## 5 5 (2-2,2-2) arrange gtable[layout]
## 6 6 (2-2,3-3) arrange gtable[layout]
library(ggplot2)
library(gridExtra)
# Create individual scatter plots
plot_atemp <- ggplot(bikes, aes(x = atemp, y = cnt)) +
geom_point() + geom_smooth(method = "lm") +
labs(title = "Scatter Plot: Feeling Temperature vs. Count of Bikes",
x = "Feeling Temperature",
y = "Count of Bikes")
plot_hum <- ggplot(bikes, aes(x = hum, y = cnt)) +
geom_point() + geom_smooth(method = "lm") +
labs(title = "Scatter Plot: Humidity vs. Count of Bikes",
x = "Humidity",
y = "Count of Bikes")
plot_windspeed <- ggplot(bikes, aes(x = windspeed, y = cnt)) +
geom_point() + geom_smooth(method = "lm") +
labs(title = "Scatter Plot: Windspeed vs. Count of Bikes",
x = "Windspeed",
y = "Count of Bikes")
# Arrange plots side by side
grid <- grid.arrange(plot_atemp, plot_hum, plot_windspeed, ncol = 3)## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# Print the grid
print(grid)## TableGrob (1 x 3) "arrange": 3 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]
Feeling Temperature vs. Count of Bikes: The scatter plot depicting the relationship between feeling temperature (atemp) and the count of bike rentals (cnt) reveals a distinct pattern. It is evident that there exists a relatively strong positive linear relationship between feeling temperature and bike rental count. The plotted line of best fit aligns with the upward trend, indicating that as the feeling temperature increases, bike rentals tend to rise as well.
Humidity vs. Count of Bikes: Similarly, the scatter plot representing humidity (hum) against the count of bike rentals (cnt) reveals an inverse relationship. The plotted line of best fit slopes downward, indicating that as humidity increases, bike rentals experience a slight decrease. This suggests that individuals might be less inclined to rent bikes when humidity levels are high.
Windspeed vs. Count of Bikes: In contrast, the scatter plot of windspeed against the count of bike rentals (cnt) displays a relatively weaker linear relationship. The scattered data points and the less pronounced slope of the line of best fit indicate that windspeed might not be as influential in predicting bike rental demand as compared to other factors.
Encoding categorical Variables
In the process of preparing categorical variables for analysis, each unique category within a categorical variable is transformed into a distinct binary column. In this binary representation, the presence or absence of a category for a particular observation is indicated by a value of 1 or 0, respectively.
For instance, the hour predictor was considered a categorical variable due to its discrete nature. Given the 24 potential categories corresponding to each hour of the day, a total of 24 binary columns were generated to represent this variable’s categories.
Similarly, the variable **weathersit** underwent binary encoding, producing a set of binary columns, each dedicated to one of the four weather categories.
The variable **mnth**, representing the months, was encoded with a set of binary columns corresponding to the twelve distinct months.
Likewise, the **weekday** variable was encoded, producing individual binary columns for each day of the week.
Furthermore, the **season** variable underwent encoding to generate binary columns for each of the four seasons.
Lastly, the **workingday** and **holiday** variables were encoded into two separate binary columns:
- **Workingday_1:** Indicates a regular workday (1)
- **Workingday_0:** Indicates a weekend or holiday (0)
Inclusion of Predictor Variables:
A benchmark model incorporates all available predictor variables, regardless of their individual predictive power. This approach allows us to consider all factors that could potentially influence bike rental.
Predictor Selection using p-values:
To further refine the model and select the most impactful predictor variables, we employed p-values as a criterion. The p-value is a statistical metric that helps determine the significance of individual predictors in explaining the variance in the response variable. Variables with lower p-values are deemed more significant contributors to the model’s predictive capacity. Using r we determined that the variables that least contribute to the models predictive capacity are workingday_0, weathersit_1 , weathersit_2 , weathersit_3, mnth_11, mnth_8
# Load the library
library(fastDummies)
# Using PlantGrowth dataset
data <- bikes
# Install and load necessary library (if not already installed)
# install.packages("dummy")
library(dummy)## dummy 0.1.3
## dummyNews()
data <- dummy_cols(data,
select_columns = c('hr', 'season','holiday','workingday','weathersit','mnth','weekday'))
# Remove specified columns from the data
data <- data[, !(colnames(data) %in% c('hr',"season", "holiday", "workingday", "yr", "weathersit","mnth",'weekday'))]
X_test <- data[, !(colnames(data) %in% c("registered","casual","instant","dteday",'temp'))]
X_test <- X_test[, !(colnames(X_test) %in% c("mnth_12","weathersit_4","season_4", "hr_23","weekday_5",'weekday_5','weekday_6','holiday_1','workingday_1'))]
#predictor variables, categorical vars
X <- data[, !(colnames(data) %in% c("cnt","registered","casual","instant","dteday",'temp'))]
X <- X[, !(colnames(X) %in% c("mnth_12","weathersit_4","season_4", "hr_23","weekday_5",'weekday_5','weekday_6','holiday_1','workingday_1'))]
model.all <- lm(transformed_cnt ~ ., data=X)most.predictive = lm(transformed_cnt~ hr_1 + hr_2 + hr_3 + hr_4 + hr_5 + hr_6 + hr_7 + hr_8 + hr_9 + hr_10 +
hr_11 + hr_12 + hr_13 + hr_14 + hr_15 + hr_16 + hr_17 + hr_18 + hr_19 +
hr_20 + hr_21 + hr_22 + mnth_2+ mnth_3+ mnth_5+mnth_9 +weekday_0+weekday_1+ weekday_2+weekday_3+weekday_4+ season_3 + season_2 + hum +
windspeed + atemp, data=X)Model Assumptions Evaluation:
In the pursuit of a reliable predictive model, it’s crucial to assess the assumptions that underpin linear regression. These assumptions provide a foundation for the accuracy and robustness of the model’s predictions. In this section, we delve into the evaluation of four key assumptions: Equal Variance, Linearity, Independence of Errors, and Normality of Errors.
Equal Variance (Homoscedasticity): A fundamental assumption is that the variance of residuals should remain relatively constant across the range of fitted values. To evaluate this assumption, we examined the residuals against both predicted values and their estimates. Notably, our residual vs. fitted plot demonstrates a consistent variance of residuals, indicating that the spread of errors remains relatively uniform across different fitted values.
par(mfrow=c(1,2))
plot(most.predictive, 1:2)Linearity: For non-categorical variables, the residuals should be evenly distributed around zero for most fitted values. By plotting the relationships between the predictors and the response variable, we can detect any curvature that might indicate deviations from linearity. Fortunately, our scatter plots of feeling temperature, humidity, and windspeed against the count of bikes reveal linear trends with minimal curvature, thus supporting the assumption of linearity.
Independence of Errors: Independence of errors implies that there is no relationship between residuals and predictor variables. We scrutinized residual plots against various predictor variables, including temperature, humidity, windspeed, and hr_8, for any discernible patterns. These plots validate the independence of residuals, as no systematic patterns emerge, affirming that our model satisfies this assumption.
multi.res = resid(most.predictive)
par(mfrow=c(2, 2))
plot(X$atemp, multi.res,
ylab = "Residuals", xlab = "Temperature",
main = "Residual Plot: Temperature vs. Residuals")
plot(X$hum, multi.res,
ylab = "Residuals", xlab = "Humidity",
main = "Residual Plot: Humidity vs. Residuals")
plot(X$windspeed, multi.res,
ylab = "Residuals", xlab = "Windspeed",
main = "Residual Plot: Windspeed vs. Residuals")
plot(X$hr_8, multi.res,
ylab = "Residuals", xlab = "hr_8",
main = "Residual Plot: hr_8 vs. Residuals") Normality of Errors: The assumption of normality posits that the residuals should follow a normal distribution. We assessed this by employing a normal QQ plot on the standardized residuals. Our QQ plot showcases the bulk of points closely aligning with the reference line, indicating a close adherence to normal distribution. A minor deviation is observed at the outset of the plot, but this does not signify a violation of normality assumptions.
By systematically evaluating these four key assumptions, we affirm that our predictive model adheres to them. These assessments underscore the model’s robustness, providing us with the confidence to proceed with its application in predicting bike rental demand.
#model selection
library(MASS)
final_model <- stepAIC(most.predictive, direction = "both")## Start: AIC=31619.13
## transformed_cnt ~ hr_1 + hr_2 + hr_3 + hr_4 + hr_5 + hr_6 + hr_7 +
## hr_8 + hr_9 + hr_10 + hr_11 + hr_12 + hr_13 + hr_14 + hr_15 +
## hr_16 + hr_17 + hr_18 + hr_19 + hr_20 + hr_21 + hr_22 + mnth_2 +
## mnth_3 + mnth_5 + mnth_9 + weekday_0 + weekday_1 + weekday_2 +
## weekday_3 + weekday_4 + season_3 + season_2 + hum + windspeed +
## atemp
##
## Df Sum of Sq RSS AIC
## <none> 106740 31619
## - mnth_5 1 54.2 106794 31626
## - weekday_4 1 96.9 106837 31633
## - hr_6 1 115.6 106855 31636
## - season_2 1 118.9 106858 31636
## - weekday_3 1 222.4 106962 31653
## - mnth_3 1 230.8 106970 31655
## - weekday_0 1 263.4 107003 31660
## - weekday_2 1 286.8 107026 31664
## - weekday_1 1 336.8 107076 31672
## - windspeed 1 586.3 107326 31712
## - mnth_2 1 829.3 107569 31752
## - mnth_9 1 840.0 107580 31753
## - season_3 1 1214.4 107954 31814
## - hr_22 1 2531.0 109271 32024
## - hr_1 1 2637.2 109377 32041
## - hr_5 1 4159.7 110899 32282
## - hum 1 4906.6 111646 32398
## - hr_10 1 5115.1 111855 32431
## - hr_21 1 5151.4 111891 32436
## - hr_2 1 5401.7 112141 32475
## - hr_11 1 6500.4 113240 32645
## - hr_14 1 6714.0 113454 32677
## - hr_15 1 7460.4 114200 32791
## - hr_13 1 8299.4 115039 32918
## - hr_7 1 8384.9 115124 32931
## - hr_20 1 8908.2 115648 33010
## - hr_12 1 9196.7 115936 33053
## - hr_3 1 9376.8 116116 33080
## - hr_9 1 11181.0 117921 33348
## - hr_4 1 11820.8 118560 33442
## - hr_16 1 12901.4 119641 33600
## - hr_19 1 15531.5 122271 33978
## - atemp 1 17734.0 124474 34288
## - hr_8 1 22980.3 129720 35006
## - hr_18 1 24078.7 130818 35152
## - hr_17 1 26290.1 133030 35444
summary(final_model)##
## Call:
## lm(formula = transformed_cnt ~ hr_1 + hr_2 + hr_3 + hr_4 + hr_5 +
## hr_6 + hr_7 + hr_8 + hr_9 + hr_10 + hr_11 + hr_12 + hr_13 +
## hr_14 + hr_15 + hr_16 + hr_17 + hr_18 + hr_19 + hr_20 + hr_21 +
## hr_22 + mnth_2 + mnth_3 + mnth_5 + mnth_9 + weekday_0 + weekday_1 +
## weekday_2 + weekday_3 + weekday_4 + season_3 + season_2 +
## hum + windspeed + atemp, data = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.3170 -1.5420 0.0178 1.6355 7.9526
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.71043 0.13126 51.125 < 2e-16 ***
## hr_1 -2.33687 0.11290 -20.699 < 2e-16 ***
## hr_2 -3.36050 0.11344 -29.624 < 2e-16 ***
## hr_3 -4.46896 0.11450 -39.031 < 2e-16 ***
## hr_4 -5.02253 0.11461 -43.824 < 2e-16 ***
## hr_5 -2.95298 0.11359 -25.997 < 2e-16 ***
## hr_6 0.49066 0.11320 4.335 1.47e-05 ***
## hr_7 4.16908 0.11295 36.909 < 2e-16 ***
## hr_8 6.89075 0.11277 61.103 < 2e-16 ***
## hr_9 4.80754 0.11280 42.621 < 2e-16 ***
## hr_10 3.26177 0.11315 28.828 < 2e-16 ***
## hr_11 3.69803 0.11379 32.498 < 2e-16 ***
## hr_12 4.42680 0.11452 38.655 < 2e-16 ***
## hr_13 4.22981 0.11519 36.721 < 2e-16 ***
## hr_14 3.82378 0.11577 33.028 < 2e-16 ***
## hr_15 4.03718 0.11596 34.815 < 2e-16 ***
## hr_16 5.30044 0.11577 45.783 < 2e-16 ***
## hr_17 7.52866 0.11520 65.356 < 2e-16 ***
## hr_18 7.16940 0.11463 62.547 < 2e-16 ***
## hr_19 5.71588 0.11379 50.234 < 2e-16 ***
## hr_20 4.30815 0.11324 38.044 < 2e-16 ***
## hr_21 3.26559 0.11288 28.930 < 2e-16 ***
## hr_22 2.28574 0.11272 20.278 < 2e-16 ***
## mnth_2 -0.88238 0.07602 -11.608 < 2e-16 ***
## mnth_3 -0.43273 0.07066 -6.124 9.34e-10 ***
## mnth_5 0.24123 0.08132 2.966 0.00302 **
## mnth_9 0.85824 0.07347 11.682 < 2e-16 ***
## weekday_0 -0.39765 0.06078 -6.542 6.23e-11 ***
## weekday_1 -0.45134 0.06101 -7.398 1.45e-13 ***
## weekday_2 -0.41854 0.06131 -6.826 9.02e-12 ***
## weekday_3 -0.36750 0.06113 -6.012 1.87e-09 ***
## weekday_4 -0.24241 0.06108 -3.969 7.25e-05 ***
## season_3 -0.99189 0.07061 -14.047 < 2e-16 ***
## season_2 -0.26303 0.05985 -4.395 1.11e-05 ***
## hum -3.26119 0.11550 -28.234 < 2e-16 ***
## windspeed -1.60383 0.16432 -9.760 < 2e-16 ***
## atemp 9.47022 0.17643 53.677 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.481 on 17342 degrees of freedom
## Multiple R-squared: 0.7543, Adjusted R-squared: 0.7538
## F-statistic: 1479 on 36 and 17342 DF, p-value: < 2.2e-16
final_model$anova## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## transformed_cnt ~ hr_1 + hr_2 + hr_3 + hr_4 + hr_5 + hr_6 + hr_7 +
## hr_8 + hr_9 + hr_10 + hr_11 + hr_12 + hr_13 + hr_14 + hr_15 +
## hr_16 + hr_17 + hr_18 + hr_19 + hr_20 + hr_21 + hr_22 + mnth_2 +
## mnth_3 + mnth_5 + mnth_9 + weekday_0 + weekday_1 + weekday_2 +
## weekday_3 + weekday_4 + season_3 + season_2 + hum + windspeed +
## atemp
##
## Final Model:
## transformed_cnt ~ hr_1 + hr_2 + hr_3 + hr_4 + hr_5 + hr_6 + hr_7 +
## hr_8 + hr_9 + hr_10 + hr_11 + hr_12 + hr_13 + hr_14 + hr_15 +
## hr_16 + hr_17 + hr_18 + hr_19 + hr_20 + hr_21 + hr_22 + mnth_2 +
## mnth_3 + mnth_5 + mnth_9 + weekday_0 + weekday_1 + weekday_2 +
## weekday_3 + weekday_4 + season_3 + season_2 + hum + windspeed +
## atemp
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 17342 106739.6 31619.13
final_model2 <- stepAIC(model.all, direction = "both")## Start: AIC=29873.79
## transformed_cnt ~ atemp + hum + windspeed + hr_0 + hr_1 + hr_2 +
## hr_3 + hr_4 + hr_5 + hr_6 + hr_7 + hr_8 + hr_9 + hr_10 +
## hr_11 + hr_12 + hr_13 + hr_14 + hr_15 + hr_16 + hr_17 + hr_18 +
## hr_19 + hr_20 + hr_21 + hr_22 + season_1 + season_2 + season_3 +
## holiday_0 + workingday_0 + weathersit_1 + weathersit_2 +
## weathersit_3 + mnth_1 + mnth_2 + mnth_3 + mnth_4 + mnth_5 +
## mnth_6 + mnth_7 + mnth_8 + mnth_9 + mnth_10 + mnth_11 + weekday_0 +
## weekday_1 + weekday_2 + weekday_3 + weekday_4
##
## Df Sum of Sq RSS AIC
## - weathersit_2 1 0.1 96385 29872
## - weathersit_1 1 0.2 96385 29872
## - mnth_8 1 0.6 96386 29872
## - mnth_6 1 4.0 96389 29873
## - mnth_11 1 7.4 96393 29873
## - weathersit_3 1 8.1 96393 29873
## - mnth_4 1 10.7 96396 29874
## <none> 96385 29874
## - workingday_0 1 16.0 96401 29875
## - mnth_1 1 25.7 96411 29876
## - mnth_10 1 28.2 96413 29877
## - mnth_7 1 40.4 96426 29879
## - holiday_0 1 84.2 96469 29887
## - mnth_2 1 93.4 96479 29889
## - hr_6 1 98.2 96483 29889
## - mnth_9 1 101.4 96487 29890
## - mnth_5 1 101.6 96487 29890
## - weekday_4 1 115.2 96500 29893
## - mnth_3 1 118.8 96504 29893
## - windspeed 1 150.0 96535 29899
## - weekday_0 1 163.6 96549 29901
## - weekday_3 1 177.4 96563 29904
## - weekday_1 1 225.0 96610 29912
## - weekday_2 1 258.4 96644 29918
## - season_2 1 295.5 96681 29925
## - season_3 1 335.4 96721 29932
## - hr_22 1 749.2 97134 30006
## - hr_0 1 1128.9 97514 30074
## - hum 1 1878.7 98264 30207
## - season_1 1 1909.3 98295 30213
## - hr_10 1 2166.3 98551 30258
## - hr_21 1 2187.7 98573 30262
## - hr_11 1 3100.5 99486 30422
## - hr_14 1 3596.7 99982 30508
## - hr_7 1 3735.6 100121 30533
## - hr_1 1 3872.9 100258 30556
## - hr_15 1 4119.7 100505 30599
## - hr_13 1 4528.8 100914 30670
## - hr_20 1 4580.9 100966 30679
## - hr_12 1 4965.1 101350 30745
## - hr_9 1 5593.0 101978 30852
## - hr_5 1 5643.8 102029 30861
## - atemp 1 6015.8 102401 30924
## - hr_2 1 6677.9 103063 31036
## - hr_16 1 7683.9 104069 31205
## - hr_19 1 8907.5 105293 31408
## - hr_3 1 10539.9 106925 31675
## - hr_4 1 12796.0 109181 32038
## - hr_8 1 12857.0 109242 32048
## - hr_18 1 14976.8 111362 32382
## - hr_17 1 16756.6 113142 32657
##
## Step: AIC=29871.81
## transformed_cnt ~ atemp + hum + windspeed + hr_0 + hr_1 + hr_2 +
## hr_3 + hr_4 + hr_5 + hr_6 + hr_7 + hr_8 + hr_9 + hr_10 +
## hr_11 + hr_12 + hr_13 + hr_14 + hr_15 + hr_16 + hr_17 + hr_18 +
## hr_19 + hr_20 + hr_21 + hr_22 + season_1 + season_2 + season_3 +
## holiday_0 + workingday_0 + weathersit_1 + weathersit_3 +
## mnth_1 + mnth_2 + mnth_3 + mnth_4 + mnth_5 + mnth_6 + mnth_7 +
## mnth_8 + mnth_9 + mnth_10 + mnth_11 + weekday_0 + weekday_1 +
## weekday_2 + weekday_3 + weekday_4
##
## Df Sum of Sq RSS AIC
## - mnth_8 1 0.6 96386 29870
## - mnth_6 1 4.0 96389 29871
## - mnth_11 1 7.4 96393 29871
## - mnth_4 1 10.7 96396 29872
## <none> 96385 29872
## - workingday_0 1 16.0 96401 29873
## + weathersit_2 1 0.1 96385 29874
## - weathersit_1 1 25.2 96411 29874
## - mnth_1 1 25.6 96411 29874
## - mnth_10 1 28.2 96413 29875
## - mnth_7 1 40.4 96426 29877
## - holiday_0 1 84.1 96469 29885
## - mnth_2 1 93.4 96479 29887
## - hr_6 1 98.2 96483 29888
## - mnth_9 1 101.3 96487 29888
## - mnth_5 1 101.6 96487 29888
## - weekday_4 1 115.2 96501 29891
## - mnth_3 1 118.8 96504 29891
## - windspeed 1 150.0 96535 29897
## - weekday_0 1 163.5 96549 29899
## - weekday_3 1 177.5 96563 29902
## - weekday_1 1 225.1 96610 29910
## - weekday_2 1 258.4 96644 29916
## - season_2 1 295.5 96681 29923
## - season_3 1 335.4 96721 29930
## - hr_22 1 749.2 97134 30004
## - hr_0 1 1128.9 97514 30072
## - hum 1 1880.7 98266 30206
## - season_1 1 1909.3 98295 30211
## - hr_10 1 2166.3 98552 30256
## - hr_21 1 2187.7 98573 30260
## - hr_11 1 3100.4 99486 30420
## - weathersit_3 1 3306.1 99691 30456
## - hr_14 1 3596.6 99982 30506
## - hr_7 1 3735.7 100121 30531
## - hr_1 1 3874.2 100259 30555
## - hr_15 1 4119.6 100505 30597
## - hr_13 1 4528.7 100914 30668
## - hr_20 1 4580.8 100966 30677
## - hr_12 1 4965.0 101350 30743
## - hr_9 1 5593.0 101978 30850
## - hr_5 1 5643.7 102029 30859
## - atemp 1 6017.5 102403 30922
## - hr_2 1 6677.9 103063 31034
## - hr_16 1 7686.1 104071 31203
## - hr_19 1 8907.4 105293 31406
## - hr_3 1 10539.8 106925 31673
## - hr_4 1 12795.9 109181 32036
## - hr_8 1 12857.1 109242 32046
## - hr_18 1 14981.1 111366 32381
## - hr_17 1 16756.6 113142 32655
##
## Step: AIC=29869.92
## transformed_cnt ~ atemp + hum + windspeed + hr_0 + hr_1 + hr_2 +
## hr_3 + hr_4 + hr_5 + hr_6 + hr_7 + hr_8 + hr_9 + hr_10 +
## hr_11 + hr_12 + hr_13 + hr_14 + hr_15 + hr_16 + hr_17 + hr_18 +
## hr_19 + hr_20 + hr_21 + hr_22 + season_1 + season_2 + season_3 +
## holiday_0 + workingday_0 + weathersit_1 + weathersit_3 +
## mnth_1 + mnth_2 + mnth_3 + mnth_4 + mnth_5 + mnth_6 + mnth_7 +
## mnth_9 + mnth_10 + mnth_11 + weekday_0 + weekday_1 + weekday_2 +
## weekday_3 + weekday_4
##
## Df Sum of Sq RSS AIC
## - mnth_6 1 4.2 96390 29869
## - mnth_11 1 9.3 96395 29870
## <none> 96386 29870
## - mnth_4 1 11.3 96397 29870
## - workingday_0 1 16.1 96402 29871
## + mnth_8 1 0.6 96385 29872
## + weathersit_2 1 0.1 96386 29872
## - mnth_1 1 25.0 96411 29872
## - weathersit_1 1 25.6 96412 29873
## - mnth_10 1 29.0 96415 29873
## - holiday_0 1 84.9 96471 29883
## - mnth_2 1 94.7 96481 29885
## - hr_6 1 97.8 96484 29886
## - weekday_4 1 116.1 96502 29889
## - mnth_3 1 134.2 96520 29892
## - mnth_5 1 137.6 96524 29893
## - windspeed 1 150.0 96536 29895
## - weekday_0 1 163.2 96549 29897
## - mnth_7 1 176.2 96562 29900
## - weekday_3 1 178.1 96564 29900
## - weekday_1 1 225.2 96611 29908
## - mnth_9 1 242.4 96628 29912
## - weekday_2 1 259.2 96645 29915
## - season_2 1 321.4 96707 29926
## - season_3 1 561.5 96947 29969
## - hr_22 1 749.0 97135 30002
## - hr_0 1 1128.6 97514 30070
## - hum 1 1883.4 98269 30204
## - season_1 1 1910.4 98296 30209
## - hr_10 1 2165.7 98552 30254
## - hr_21 1 2187.0 98573 30258
## - hr_11 1 3100.5 99486 30418
## - weathersit_3 1 3308.4 99694 30454
## - hr_14 1 3603.7 99990 30506
## - hr_7 1 3741.3 100127 30530
## - hr_1 1 3873.5 100259 30553
## - hr_15 1 4129.1 100515 30597
## - hr_13 1 4535.7 100922 30667
## - hr_20 1 4580.9 100967 30675
## - hr_12 1 4969.2 101355 30742
## - hr_9 1 5593.2 101979 30848
## - hr_5 1 5645.4 102031 30857
## - hr_2 1 6677.6 103064 31032
## - atemp 1 7349.6 103736 31145
## - hr_16 1 7705.2 104091 31204
## - hr_19 1 8912.5 105298 31405
## - hr_3 1 10541.4 106927 31672
## - hr_4 1 12800.7 109187 32035
## - hr_8 1 12863.5 109249 32045
## - hr_18 1 15002.2 111388 32382
## - hr_17 1 16792.6 113179 32659
##
## Step: AIC=29868.68
## transformed_cnt ~ atemp + hum + windspeed + hr_0 + hr_1 + hr_2 +
## hr_3 + hr_4 + hr_5 + hr_6 + hr_7 + hr_8 + hr_9 + hr_10 +
## hr_11 + hr_12 + hr_13 + hr_14 + hr_15 + hr_16 + hr_17 + hr_18 +
## hr_19 + hr_20 + hr_21 + hr_22 + season_1 + season_2 + season_3 +
## holiday_0 + workingday_0 + weathersit_1 + weathersit_3 +
## mnth_1 + mnth_2 + mnth_3 + mnth_4 + mnth_5 + mnth_7 + mnth_9 +
## mnth_10 + mnth_11 + weekday_0 + weekday_1 + weekday_2 + weekday_3 +
## weekday_4
##
## Df Sum of Sq RSS AIC
## - mnth_4 1 7.1 96397 29868
## - mnth_11 1 9.8 96400 29868
## <none> 96390 29869
## - workingday_0 1 16.0 96406 29870
## + mnth_6 1 4.2 96386 29870
## + mnth_8 1 0.8 96389 29871
## - mnth_1 1 21.7 96412 29871
## + weathersit_2 1 0.1 96390 29871
## - weathersit_1 1 25.8 96416 29871
## - mnth_10 1 27.1 96417 29872
## - holiday_0 1 86.2 96476 29882
## - mnth_2 1 90.7 96481 29883
## - hr_6 1 96.9 96487 29884
## - weekday_4 1 116.2 96506 29888
## - mnth_3 1 144.9 96535 29893
## - windspeed 1 149.6 96540 29894
## - weekday_0 1 163.3 96553 29896
## - weekday_3 1 178.8 96569 29899
## - mnth_5 1 199.3 96589 29903
## - mnth_7 1 217.3 96607 29906
## - weekday_1 1 225.5 96616 29907
## - mnth_9 1 240.7 96631 29910
## - weekday_2 1 259.4 96649 29913
## - season_2 1 420.2 96810 29942
## - season_3 1 557.8 96948 29967
## - hr_22 1 748.4 97138 30001
## - hr_0 1 1127.9 97518 30069
## - hum 1 1916.6 98307 30209
## - season_1 1 1991.4 98382 30222
## - hr_10 1 2163.4 98553 30252
## - hr_21 1 2185.0 98575 30256
## - hr_11 1 3096.7 99487 30416
## - weathersit_3 1 3304.4 99694 30452
## - hr_14 1 3600.5 99991 30504
## - hr_7 1 3747.8 100138 30530
## - hr_1 1 3871.4 100261 30551
## - hr_15 1 4126.6 100517 30595
## - hr_13 1 4532.5 100923 30665
## - hr_20 1 4577.2 100967 30673
## - hr_12 1 4965.3 101355 30740
## - hr_9 1 5592.9 101983 30847
## - hr_5 1 5641.3 102031 30855
## - hr_2 1 6674.4 103064 31030
## - hr_16 1 7706.6 104097 31203
## - atemp 1 8036.0 104426 31258
## - hr_19 1 8908.7 105299 31403
## - hr_3 1 10537.3 106927 31670
## - hr_4 1 12796.8 109187 32033
## - hr_8 1 12870.2 109260 32045
## - hr_18 1 15004.6 111395 32381
## - hr_17 1 16802.6 113193 32659
##
## Step: AIC=29867.96
## transformed_cnt ~ atemp + hum + windspeed + hr_0 + hr_1 + hr_2 +
## hr_3 + hr_4 + hr_5 + hr_6 + hr_7 + hr_8 + hr_9 + hr_10 +
## hr_11 + hr_12 + hr_13 + hr_14 + hr_15 + hr_16 + hr_17 + hr_18 +
## hr_19 + hr_20 + hr_21 + hr_22 + season_1 + season_2 + season_3 +
## holiday_0 + workingday_0 + weathersit_1 + weathersit_3 +
## mnth_1 + mnth_2 + mnth_3 + mnth_5 + mnth_7 + mnth_9 + mnth_10 +
## mnth_11 + weekday_0 + weekday_1 + weekday_2 + weekday_3 +
## weekday_4
##
## Df Sum of Sq RSS AIC
## - mnth_11 1 10.2 96407 29868
## <none> 96397 29868
## + mnth_4 1 7.1 96390 29869
## - workingday_0 1 16.1 96413 29869
## - mnth_1 1 17.1 96414 29869
## + mnth_8 1 0.7 96397 29870
## + weathersit_2 1 0.1 96397 29870
## + mnth_6 1 0.0 96397 29870
## - weathersit_1 1 25.3 96423 29871
## - mnth_10 1 28.0 96425 29871
## - mnth_2 1 83.9 96481 29881
## - holiday_0 1 85.2 96482 29881
## - hr_6 1 97.5 96495 29884
## - weekday_4 1 115.9 96513 29887
## - mnth_3 1 141.3 96539 29891
## - windspeed 1 146.3 96544 29892
## - weekday_0 1 162.6 96560 29895
## - weekday_3 1 178.8 96576 29898
## - mnth_7 1 214.5 96612 29905
## - weekday_1 1 225.4 96623 29907
## - mnth_5 1 225.9 96623 29907
## - mnth_9 1 241.5 96639 29909
## - weekday_2 1 258.1 96655 29912
## - season_2 1 519.7 96917 29959
## - season_3 1 550.7 96948 29965
## - hr_22 1 748.6 97146 30000
## - hr_0 1 1128.2 97525 30068
## - hum 1 1922.7 98320 30209
## - season_1 1 2010.4 98408 30225
## - hr_10 1 2163.9 98561 30252
## - hr_21 1 2186.1 98583 30256
## - hr_11 1 3098.9 99496 30416
## - weathersit_3 1 3306.6 99704 30452
## - hr_14 1 3606.6 100004 30504
## - hr_7 1 3744.9 100142 30528
## - hr_1 1 3872.7 100270 30551
## - hr_15 1 4133.5 100531 30596
## - hr_13 1 4538.6 100936 30666
## - hr_20 1 4579.6 100977 30673
## - hr_12 1 4970.0 101367 30740
## - hr_9 1 5591.8 101989 30846
## - hr_5 1 5646.0 102043 30855
## - hr_2 1 6677.3 103075 31030
## - hr_16 1 7715.8 104113 31204
## - atemp 1 8314.2 104711 31304
## - hr_19 1 8913.5 105311 31403
## - hr_3 1 10542.0 106939 31670
## - hr_4 1 12802.8 109200 32033
## - hr_8 1 12866.6 109264 32043
## - hr_18 1 15013.4 111411 32381
## - hr_17 1 16814.7 113212 32660
##
## Step: AIC=29867.8
## transformed_cnt ~ atemp + hum + windspeed + hr_0 + hr_1 + hr_2 +
## hr_3 + hr_4 + hr_5 + hr_6 + hr_7 + hr_8 + hr_9 + hr_10 +
## hr_11 + hr_12 + hr_13 + hr_14 + hr_15 + hr_16 + hr_17 + hr_18 +
## hr_19 + hr_20 + hr_21 + hr_22 + season_1 + season_2 + season_3 +
## holiday_0 + workingday_0 + weathersit_1 + weathersit_3 +
## mnth_1 + mnth_2 + mnth_3 + mnth_5 + mnth_7 + mnth_9 + mnth_10 +
## weekday_0 + weekday_1 + weekday_2 + weekday_3 + weekday_4
##
## Df Sum of Sq RSS AIC
## <none> 96407 29868
## + mnth_11 1 10.2 96397 29868
## + mnth_4 1 7.5 96400 29868
## - workingday_0 1 16.0 96423 29869
## - mnth_1 1 17.5 96425 29869
## + weathersit_2 1 0.1 96407 29870
## + mnth_8 1 0.1 96407 29870
## + mnth_6 1 0.0 96407 29870
## - weathersit_1 1 24.8 96432 29870
## - mnth_10 1 67.4 96475 29878
## - mnth_2 1 84.4 96492 29881
## - holiday_0 1 90.8 96498 29882
## - hr_6 1 97.4 96505 29883
## - weekday_4 1 115.8 96523 29887
## - mnth_3 1 141.4 96549 29891
## - windspeed 1 148.7 96556 29893
## - weekday_0 1 162.3 96570 29895
## - weekday_3 1 179.5 96587 29898
## - mnth_7 1 209.1 96617 29903
## - weekday_1 1 223.2 96631 29906
## - mnth_5 1 224.4 96632 29906
## - weekday_2 1 258.3 96666 29912
## - mnth_9 1 279.4 96687 29916
## - season_3 1 613.6 97021 29976
## - season_2 1 657.4 97065 29984
## - hr_22 1 748.6 97156 30000
## - hr_0 1 1128.2 97536 30068
## - hum 1 1915.5 98323 30208
## - hr_10 1 2164.1 98572 30252
## - hr_21 1 2186.1 98594 30255
## - season_1 1 2425.7 98833 30298
## - hr_11 1 3099.1 99507 30416
## - weathersit_3 1 3312.0 99719 30453
## - hr_14 1 3606.7 100014 30504
## - hr_7 1 3745.4 100153 30528
## - hr_1 1 3872.6 100280 30550
## - hr_15 1 4133.6 100541 30595
## - hr_13 1 4538.7 100946 30665
## - hr_20 1 4579.9 100987 30672
## - hr_12 1 4970.2 101378 30739
## - hr_9 1 5592.2 102000 30846
## - hr_5 1 5645.6 102053 30855
## - hr_2 1 6676.8 103084 31030
## - hr_16 1 7716.4 104124 31204
## - atemp 1 8393.4 104801 31317
## - hr_19 1 8914.0 105321 31403
## - hr_3 1 10541.3 106949 31669
## - hr_4 1 12802.3 109210 32033
## - hr_8 1 12867.3 109275 32043
## - hr_18 1 15014.4 111422 32381
## - hr_17 1 16815.8 113223 32660
summary(final_model2)##
## Call:
## lm(formula = transformed_cnt ~ atemp + hum + windspeed + hr_0 +
## hr_1 + hr_2 + hr_3 + hr_4 + hr_5 + hr_6 + hr_7 + hr_8 + hr_9 +
## hr_10 + hr_11 + hr_12 + hr_13 + hr_14 + hr_15 + hr_16 + hr_17 +
## hr_18 + hr_19 + hr_20 + hr_21 + hr_22 + season_1 + season_2 +
## season_3 + holiday_0 + workingday_0 + weathersit_1 + weathersit_3 +
## mnth_1 + mnth_2 + mnth_3 + mnth_5 + mnth_7 + mnth_9 + mnth_10 +
## weekday_0 + weekday_1 + weekday_2 + weekday_3 + weekday_4,
## data = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0469 -1.5204 0.0182 1.5406 7.7131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.79702 0.21790 35.783 < 2e-16 ***
## atemp 7.44762 0.19172 38.846 < 2e-16 ***
## hum -2.35796 0.12706 -18.558 < 2e-16 ***
## windspeed -0.81818 0.15826 -5.170 2.37e-07 ***
## hr_0 -1.76240 0.12374 -14.242 < 2e-16 ***
## hr_1 -3.26925 0.12390 -26.387 < 2e-16 ***
## hr_2 -4.30918 0.12437 -34.647 < 2e-16 ***
## hr_3 -5.45446 0.12529 -43.534 < 2e-16 ***
## hr_4 -6.01719 0.12542 -47.976 < 2e-16 ***
## hr_5 -3.97052 0.12463 -31.859 < 2e-16 ***
## hr_6 -0.52018 0.12429 -4.185 2.86e-05 ***
## hr_7 3.21773 0.12400 25.949 < 2e-16 ***
## hr_8 5.95501 0.12381 48.098 < 2e-16 ***
## hr_9 3.92596 0.12382 31.708 < 2e-16 ***
## hr_10 2.44778 0.12409 19.725 < 2e-16 ***
## hr_11 2.94387 0.12472 23.605 < 2e-16 ***
## hr_12 3.75074 0.12547 29.893 < 2e-16 ***
## hr_13 3.60135 0.12607 28.566 < 2e-16 ***
## hr_14 3.22367 0.12659 25.465 < 2e-16 ***
## hr_15 3.45532 0.12675 27.261 < 2e-16 ***
## hr_16 4.71123 0.12649 37.247 < 2e-16 ***
## hr_17 6.92580 0.12596 54.985 < 2e-16 ***
## hr_18 6.51438 0.12538 51.956 < 2e-16 ***
## hr_19 4.98629 0.12455 40.033 < 2e-16 ***
## hr_20 3.56044 0.12408 28.695 < 2e-16 ***
## hr_21 2.45367 0.12376 19.825 < 2e-16 ***
## hr_22 1.43454 0.12365 11.602 < 2e-16 ***
## season_1 -1.97936 0.09478 -20.884 < 2e-16 ***
## season_2 -0.75963 0.06987 -10.872 < 2e-16 ***
## season_3 -0.83139 0.07916 -10.503 < 2e-16 ***
## holiday_0 0.51989 0.12864 4.041 5.34e-05 ***
## workingday_0 -0.11318 0.06680 -1.694 0.090212 .
## weathersit_1 0.09372 0.04439 2.111 0.034753 *
## weathersit_3 -1.82337 0.07472 -24.402 < 2e-16 ***
## mnth_1 0.18959 0.10690 1.773 0.076166 .
## mnth_2 0.41262 0.10591 3.896 9.82e-05 ***
## mnth_3 0.43264 0.08579 5.043 4.64e-07 ***
## mnth_5 0.49732 0.07829 6.352 2.18e-10 ***
## mnth_7 -0.49477 0.08069 -6.132 8.88e-10 ***
## mnth_9 0.53027 0.07482 7.088 1.42e-12 ***
## mnth_10 0.27586 0.07926 3.480 0.000502 ***
## weekday_0 -0.36020 0.06668 -5.402 6.69e-08 ***
## weekday_1 -0.43390 0.06849 -6.335 2.43e-10 ***
## weekday_2 -0.45860 0.06729 -6.815 9.73e-12 ***
## weekday_3 -0.38151 0.06716 -5.681 1.36e-08 ***
## weekday_4 -0.30592 0.06706 -4.562 5.10e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.358 on 17333 degrees of freedom
## Multiple R-squared: 0.7781, Adjusted R-squared: 0.7775
## F-statistic: 1351 on 45 and 17333 DF, p-value: < 2.2e-16
final_model2$anova## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## transformed_cnt ~ atemp + hum + windspeed + hr_0 + hr_1 + hr_2 +
## hr_3 + hr_4 + hr_5 + hr_6 + hr_7 + hr_8 + hr_9 + hr_10 +
## hr_11 + hr_12 + hr_13 + hr_14 + hr_15 + hr_16 + hr_17 + hr_18 +
## hr_19 + hr_20 + hr_21 + hr_22 + season_1 + season_2 + season_3 +
## holiday_0 + workingday_0 + weathersit_1 + weathersit_2 +
## weathersit_3 + mnth_1 + mnth_2 + mnth_3 + mnth_4 + mnth_5 +
## mnth_6 + mnth_7 + mnth_8 + mnth_9 + mnth_10 + mnth_11 + weekday_0 +
## weekday_1 + weekday_2 + weekday_3 + weekday_4
##
## Final Model:
## transformed_cnt ~ atemp + hum + windspeed + hr_0 + hr_1 + hr_2 +
## hr_3 + hr_4 + hr_5 + hr_6 + hr_7 + hr_8 + hr_9 + hr_10 +
## hr_11 + hr_12 + hr_13 + hr_14 + hr_15 + hr_16 + hr_17 + hr_18 +
## hr_19 + hr_20 + hr_21 + hr_22 + season_1 + season_2 + season_3 +
## holiday_0 + workingday_0 + weathersit_1 + weathersit_3 +
## mnth_1 + mnth_2 + mnth_3 + mnth_5 + mnth_7 + mnth_9 + mnth_10 +
## weekday_0 + weekday_1 + weekday_2 + weekday_3 + weekday_4
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 17328 96385.18 29873.79
## 2 - weathersit_2 1 0.09491342 17329 96385.28 29871.81
## 3 - mnth_8 1 0.63374174 17330 96385.91 29869.92
## 4 - mnth_6 1 4.18418304 17331 96390.09 29868.68
## 5 - mnth_4 1 7.14846484 17332 96397.24 29867.96
## 6 - mnth_11 1 10.18119155 17333 96407.42 29867.80
summary(final_model2)##
## Call:
## lm(formula = transformed_cnt ~ atemp + hum + windspeed + hr_0 +
## hr_1 + hr_2 + hr_3 + hr_4 + hr_5 + hr_6 + hr_7 + hr_8 + hr_9 +
## hr_10 + hr_11 + hr_12 + hr_13 + hr_14 + hr_15 + hr_16 + hr_17 +
## hr_18 + hr_19 + hr_20 + hr_21 + hr_22 + season_1 + season_2 +
## season_3 + holiday_0 + workingday_0 + weathersit_1 + weathersit_3 +
## mnth_1 + mnth_2 + mnth_3 + mnth_5 + mnth_7 + mnth_9 + mnth_10 +
## weekday_0 + weekday_1 + weekday_2 + weekday_3 + weekday_4,
## data = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0469 -1.5204 0.0182 1.5406 7.7131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.79702 0.21790 35.783 < 2e-16 ***
## atemp 7.44762 0.19172 38.846 < 2e-16 ***
## hum -2.35796 0.12706 -18.558 < 2e-16 ***
## windspeed -0.81818 0.15826 -5.170 2.37e-07 ***
## hr_0 -1.76240 0.12374 -14.242 < 2e-16 ***
## hr_1 -3.26925 0.12390 -26.387 < 2e-16 ***
## hr_2 -4.30918 0.12437 -34.647 < 2e-16 ***
## hr_3 -5.45446 0.12529 -43.534 < 2e-16 ***
## hr_4 -6.01719 0.12542 -47.976 < 2e-16 ***
## hr_5 -3.97052 0.12463 -31.859 < 2e-16 ***
## hr_6 -0.52018 0.12429 -4.185 2.86e-05 ***
## hr_7 3.21773 0.12400 25.949 < 2e-16 ***
## hr_8 5.95501 0.12381 48.098 < 2e-16 ***
## hr_9 3.92596 0.12382 31.708 < 2e-16 ***
## hr_10 2.44778 0.12409 19.725 < 2e-16 ***
## hr_11 2.94387 0.12472 23.605 < 2e-16 ***
## hr_12 3.75074 0.12547 29.893 < 2e-16 ***
## hr_13 3.60135 0.12607 28.566 < 2e-16 ***
## hr_14 3.22367 0.12659 25.465 < 2e-16 ***
## hr_15 3.45532 0.12675 27.261 < 2e-16 ***
## hr_16 4.71123 0.12649 37.247 < 2e-16 ***
## hr_17 6.92580 0.12596 54.985 < 2e-16 ***
## hr_18 6.51438 0.12538 51.956 < 2e-16 ***
## hr_19 4.98629 0.12455 40.033 < 2e-16 ***
## hr_20 3.56044 0.12408 28.695 < 2e-16 ***
## hr_21 2.45367 0.12376 19.825 < 2e-16 ***
## hr_22 1.43454 0.12365 11.602 < 2e-16 ***
## season_1 -1.97936 0.09478 -20.884 < 2e-16 ***
## season_2 -0.75963 0.06987 -10.872 < 2e-16 ***
## season_3 -0.83139 0.07916 -10.503 < 2e-16 ***
## holiday_0 0.51989 0.12864 4.041 5.34e-05 ***
## workingday_0 -0.11318 0.06680 -1.694 0.090212 .
## weathersit_1 0.09372 0.04439 2.111 0.034753 *
## weathersit_3 -1.82337 0.07472 -24.402 < 2e-16 ***
## mnth_1 0.18959 0.10690 1.773 0.076166 .
## mnth_2 0.41262 0.10591 3.896 9.82e-05 ***
## mnth_3 0.43264 0.08579 5.043 4.64e-07 ***
## mnth_5 0.49732 0.07829 6.352 2.18e-10 ***
## mnth_7 -0.49477 0.08069 -6.132 8.88e-10 ***
## mnth_9 0.53027 0.07482 7.088 1.42e-12 ***
## mnth_10 0.27586 0.07926 3.480 0.000502 ***
## weekday_0 -0.36020 0.06668 -5.402 6.69e-08 ***
## weekday_1 -0.43390 0.06849 -6.335 2.43e-10 ***
## weekday_2 -0.45860 0.06729 -6.815 9.73e-12 ***
## weekday_3 -0.38151 0.06716 -5.681 1.36e-08 ***
## weekday_4 -0.30592 0.06706 -4.562 5.10e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.358 on 17333 degrees of freedom
## Multiple R-squared: 0.7781, Adjusted R-squared: 0.7775
## F-statistic: 1351 on 45 and 17333 DF, p-value: < 2.2e-16
Feature Selection using AIC Stepwise Selection
In the pursuit of building a predictive model that is both accurate and interpretable for stakeholders, a crucial step involves selecting a subset of predictor variables that contribute significantly to the model’s performance. This feature selection process aims to strike a balance between model complexity and predictive power. In this section, we detail our feature selection approach, which utilizes the AIC (Akaike Information Criterion) stepwise selection technique, incorporating both forward and backward selection.
Given our primary objective of prediction and the desire for a model that is easily interpretable, we opted for a simpler model without sacrificing predictive accuracy.
Model Selection Process: To achieve this, we
employed the stepAIC function from the MASS
package, which facilitates stepwise selection based on the AIC
criterion. Both foward stepwise and backward stepwise selection were
performed. We carried out this process the model.all model,
initial model with all predictors.
A significant difference was observed between the initial model and the final model selected by the stepwise AIC procedure.
Key Differences: Several predictor variables were removed after step wise selction from the initial model to create a potential final model. Notable among these are: - weathersit_2: Initially present but excluded from the final model. - mnth_4, mnth_6, mnth_8, mnth_11, mnth_7: These predictors were present in the initial model but not included in the final model.
By carefully selecting the predictor variables, we strive to strike an optimal balance between predictive power and model simplicity, aligning with our goal of creating a model that is both effective and easily understandable for stakeholders.
par(mfrow=c(1,2))
plot(final_model2, 1:2)This subset of predictor variables does not violate any assumption of OLS
Interactions
# Create an interaction plot
X_test$cnt <- transformed_cnt
interaction.plot(
x.factor = bikes$atemp,
trace.factor = bikes$season,
response = bikes$cnt,
fun = median, #metric to plot
col = c("pink", "blue","black","red"),
lty = 1, #line type
lwd = 2, #line width
xlab = "Temperature",
ylab = "Count",
main = "Interaction Plot: season vs. Count"
)# Create an interaction plot
X_test$cnt <- transformed_cnt
interaction.plot(
x.factor = bikes$hum,
trace.factor = bikes$season,
response = bikes$cnt,
fun = median, #metric to plot
col = c("pink", "blue","black","red"),
lty = 1, #line type
lwd = 2, #line width
xlab = "HUmidity",
ylab = "Count",
main = "Interaction Plot: season vs. Count"
)Model with Interactions
# Create the linear regression model with interactions
full_model_with_interactions <- lm(formula = transformed_cnt ~ atemp + hum + windspeed + hr_0 +
hr_1 + hr_2 + hr_3 + hr_4 + hr_5 + hr_6 + hr_7 + hr_8 + hr_9 +
hr_10 + hr_11 + hr_12 + hr_13 + hr_14 + hr_15 + hr_16 + hr_17 +
hr_18 + hr_19 + hr_20 + hr_21 + hr_22 + season_1 + season_2 +
season_3 + holiday_0 + workingday_0 + weathersit_1 + weathersit_3 +
mnth_1 + mnth_2 + mnth_3 + mnth_5 + mnth_7 + mnth_9 + mnth_10 +
weekday_0 + weekday_1 + weekday_2 + weekday_3 + weekday_4 + atemp:season_1 + hum:season_1,
data = X)
# Print the summary of the model with interactionssummary(final_model2)##
## Call:
## lm(formula = transformed_cnt ~ atemp + hum + windspeed + hr_0 +
## hr_1 + hr_2 + hr_3 + hr_4 + hr_5 + hr_6 + hr_7 + hr_8 + hr_9 +
## hr_10 + hr_11 + hr_12 + hr_13 + hr_14 + hr_15 + hr_16 + hr_17 +
## hr_18 + hr_19 + hr_20 + hr_21 + hr_22 + season_1 + season_2 +
## season_3 + holiday_0 + workingday_0 + weathersit_1 + weathersit_3 +
## mnth_1 + mnth_2 + mnth_3 + mnth_5 + mnth_7 + mnth_9 + mnth_10 +
## weekday_0 + weekday_1 + weekday_2 + weekday_3 + weekday_4,
## data = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0469 -1.5204 0.0182 1.5406 7.7131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.79702 0.21790 35.783 < 2e-16 ***
## atemp 7.44762 0.19172 38.846 < 2e-16 ***
## hum -2.35796 0.12706 -18.558 < 2e-16 ***
## windspeed -0.81818 0.15826 -5.170 2.37e-07 ***
## hr_0 -1.76240 0.12374 -14.242 < 2e-16 ***
## hr_1 -3.26925 0.12390 -26.387 < 2e-16 ***
## hr_2 -4.30918 0.12437 -34.647 < 2e-16 ***
## hr_3 -5.45446 0.12529 -43.534 < 2e-16 ***
## hr_4 -6.01719 0.12542 -47.976 < 2e-16 ***
## hr_5 -3.97052 0.12463 -31.859 < 2e-16 ***
## hr_6 -0.52018 0.12429 -4.185 2.86e-05 ***
## hr_7 3.21773 0.12400 25.949 < 2e-16 ***
## hr_8 5.95501 0.12381 48.098 < 2e-16 ***
## hr_9 3.92596 0.12382 31.708 < 2e-16 ***
## hr_10 2.44778 0.12409 19.725 < 2e-16 ***
## hr_11 2.94387 0.12472 23.605 < 2e-16 ***
## hr_12 3.75074 0.12547 29.893 < 2e-16 ***
## hr_13 3.60135 0.12607 28.566 < 2e-16 ***
## hr_14 3.22367 0.12659 25.465 < 2e-16 ***
## hr_15 3.45532 0.12675 27.261 < 2e-16 ***
## hr_16 4.71123 0.12649 37.247 < 2e-16 ***
## hr_17 6.92580 0.12596 54.985 < 2e-16 ***
## hr_18 6.51438 0.12538 51.956 < 2e-16 ***
## hr_19 4.98629 0.12455 40.033 < 2e-16 ***
## hr_20 3.56044 0.12408 28.695 < 2e-16 ***
## hr_21 2.45367 0.12376 19.825 < 2e-16 ***
## hr_22 1.43454 0.12365 11.602 < 2e-16 ***
## season_1 -1.97936 0.09478 -20.884 < 2e-16 ***
## season_2 -0.75963 0.06987 -10.872 < 2e-16 ***
## season_3 -0.83139 0.07916 -10.503 < 2e-16 ***
## holiday_0 0.51989 0.12864 4.041 5.34e-05 ***
## workingday_0 -0.11318 0.06680 -1.694 0.090212 .
## weathersit_1 0.09372 0.04439 2.111 0.034753 *
## weathersit_3 -1.82337 0.07472 -24.402 < 2e-16 ***
## mnth_1 0.18959 0.10690 1.773 0.076166 .
## mnth_2 0.41262 0.10591 3.896 9.82e-05 ***
## mnth_3 0.43264 0.08579 5.043 4.64e-07 ***
## mnth_5 0.49732 0.07829 6.352 2.18e-10 ***
## mnth_7 -0.49477 0.08069 -6.132 8.88e-10 ***
## mnth_9 0.53027 0.07482 7.088 1.42e-12 ***
## mnth_10 0.27586 0.07926 3.480 0.000502 ***
## weekday_0 -0.36020 0.06668 -5.402 6.69e-08 ***
## weekday_1 -0.43390 0.06849 -6.335 2.43e-10 ***
## weekday_2 -0.45860 0.06729 -6.815 9.73e-12 ***
## weekday_3 -0.38151 0.06716 -5.681 1.36e-08 ***
## weekday_4 -0.30592 0.06706 -4.562 5.10e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.358 on 17333 degrees of freedom
## Multiple R-squared: 0.7781, Adjusted R-squared: 0.7775
## F-statistic: 1351 on 45 and 17333 DF, p-value: < 2.2e-16
summary(full_model_with_interactions)##
## Call:
## lm(formula = transformed_cnt ~ atemp + hum + windspeed + hr_0 +
## hr_1 + hr_2 + hr_3 + hr_4 + hr_5 + hr_6 + hr_7 + hr_8 + hr_9 +
## hr_10 + hr_11 + hr_12 + hr_13 + hr_14 + hr_15 + hr_16 + hr_17 +
## hr_18 + hr_19 + hr_20 + hr_21 + hr_22 + season_1 + season_2 +
## season_3 + holiday_0 + workingday_0 + weathersit_1 + weathersit_3 +
## mnth_1 + mnth_2 + mnth_3 + mnth_5 + mnth_7 + mnth_9 + mnth_10 +
## weekday_0 + weekday_1 + weekday_2 + weekday_3 + weekday_4 +
## atemp:season_1 + hum:season_1, data = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.3564 -1.4924 0.0031 1.5273 7.5109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.73603 0.22684 38.511 < 2e-16 ***
## atemp 6.36944 0.22336 28.516 < 2e-16 ***
## hum -3.22522 0.14164 -22.770 < 2e-16 ***
## windspeed -0.67017 0.15773 -4.249 2.16e-05 ***
## hr_0 -1.76320 0.12301 -14.334 < 2e-16 ***
## hr_1 -3.27008 0.12316 -26.551 < 2e-16 ***
## hr_2 -4.30568 0.12364 -34.824 < 2e-16 ***
## hr_3 -5.45138 0.12456 -43.766 < 2e-16 ***
## hr_4 -6.01306 0.12469 -48.226 < 2e-16 ***
## hr_5 -3.96357 0.12389 -31.992 < 2e-16 ***
## hr_6 -0.51740 0.12355 -4.188 2.83e-05 ***
## hr_7 3.21577 0.12327 26.088 < 2e-16 ***
## hr_8 5.94291 0.12311 48.273 < 2e-16 ***
## hr_9 3.90506 0.12315 31.710 < 2e-16 ***
## hr_10 2.41893 0.12345 19.595 < 2e-16 ***
## hr_11 2.91321 0.12406 23.482 < 2e-16 ***
## hr_12 3.71297 0.12481 29.749 < 2e-16 ***
## hr_13 3.55790 0.12540 28.372 < 2e-16 ***
## hr_14 3.17680 0.12592 25.230 < 2e-16 ***
## hr_15 3.40967 0.12606 27.049 < 2e-16 ***
## hr_16 4.66495 0.12580 37.083 < 2e-16 ***
## hr_17 6.88232 0.12526 54.943 < 2e-16 ***
## hr_18 6.47758 0.12468 51.954 < 2e-16 ***
## hr_19 4.95765 0.12384 40.033 < 2e-16 ***
## hr_20 3.53780 0.12336 28.679 < 2e-16 ***
## hr_21 2.44411 0.12303 19.865 < 2e-16 ***
## hr_22 1.42846 0.12292 11.621 < 2e-16 ***
## season_1 -4.81298 0.22254 -21.627 < 2e-16 ***
## season_2 -0.66470 0.07139 -9.311 < 2e-16 ***
## season_3 -0.59026 0.08347 -7.071 1.59e-12 ***
## holiday_0 0.50370 0.12788 3.939 8.22e-05 ***
## workingday_0 -0.09386 0.06642 -1.413 0.157623
## weathersit_1 0.10253 0.04413 2.323 0.020175 *
## weathersit_3 -1.83923 0.07429 -24.758 < 2e-16 ***
## mnth_1 0.40163 0.10821 3.712 0.000207 ***
## mnth_2 0.51537 0.10554 4.883 1.05e-06 ***
## mnth_3 0.33236 0.08783 3.784 0.000155 ***
## mnth_5 0.63682 0.07842 8.120 4.96e-16 ***
## mnth_7 -0.43069 0.08068 -5.339 9.49e-08 ***
## mnth_9 0.62923 0.07469 8.425 < 2e-16 ***
## mnth_10 0.41425 0.07969 5.198 2.03e-07 ***
## weekday_0 -0.31508 0.06636 -4.748 2.07e-06 ***
## weekday_1 -0.36844 0.06826 -5.397 6.85e-08 ***
## weekday_2 -0.41591 0.06698 -6.209 5.45e-10 ***
## weekday_3 -0.34717 0.06683 -5.195 2.07e-07 ***
## weekday_4 -0.28468 0.06668 -4.270 1.97e-05 ***
## atemp:season_1 3.35639 0.41533 8.081 6.82e-16 ***
## hum:season_1 2.75290 0.21662 12.708 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.344 on 17331 degrees of freedom
## Multiple R-squared: 0.7807, Adjusted R-squared: 0.7801
## F-statistic: 1313 on 47 and 17331 DF, p-value: < 2.2e-16
Interactions in a linear model allow us to capture how the relationship between a predictor and the response variable changes based on the presence or absence of another predictor.
Temperature and Season Interaction: In this case the temperature is dependent on what season it is. For instance, during colder seasons, the increase in bike rentals with higher temperatures might be more pronounced than during warmer seasons. In other words, the effect of temperature on bike rentals might interact with the season. Humidity and Season Interaction: Similarly humidity has a different effect on the response variable, bike rentals depending on the season.
Model Comparison
Three potential models were considered in the model building process. The first comprised solely the most predictive variables identified through a thorough examination of the model’s summary in R. The second encompassed variables selected via a forward and backward step-wise AIC selection. The third model extended the second one by introducing interactions between temperature and the fall season.
Train-Test Split
To validate the accuracy of each model, the dataset was divided into an 80% training set and a 20% testing set. Given the substantial dataset, this division ensured robust evaluation. Mean Squared Error (MSE) was calculated and contrasted across the models using the test set.
Comparing Adjusted R-Squared
Among the three models, the one incorporating a transformation exhibited an adjusted R-squared value of 0.7801 , the highest in the set. The model derived from forward and backward step-wise feature selection yielded an adjusted R-squared of 0.77. Meanwhile, the model constructed from the most predictive features, lowest p value, demonstrated an adjusted R-squared of 0.7438.
AIC, BIC
predictive_aic <- AIC(most.predictive)
predictive_bic <- BIC(most.predictive)
inter_aic <- AIC(full_model_with_interactions)
inter_bic <-BIC(full_model_with_interactions)
stepwise_aic <- AIC(final_model2)
stepwise_bic <- BIC(final_model2)
model_names <- c("Predictive Model", "Interactions", "Stepwise AIC Model")
aic_values <- c(predictive_aic, inter_aic, stepwise_aic)
bic_values <- c(predictive_bic, inter_bic, stepwise_bic)
model_comparison <- data.frame(Model = model_names, AIC = aic_values, BIC = bic_values)
# Print the table
print(model_comparison)## Model AIC BIC
## 1 Predictive Model 80940.60 81235.59
## 2 Interactions 78984.48 79364.87
## 3 Stepwise AIC Model 79189.26 79554.13
Model Validation
After validating the models on the test set to assess their accuracy on unseen data, we reached the following conclusions:
The model created through AIC step-wise selection produced an MSE of 6.38.
The model based on the most predictive variables achieved an MSE of 5.54.
The model featuring an interaction term exhibited an MSE of 5.505.
Choosing a model
Adding the interactions to the simple model constructed from stepwise feature selection shows an overall improvement across MSE, R squared adjusted, AIC and BIC.
We conclude that the best final model is as follows:
# Load necessary libraries
library(caret)## Loading required package: lattice
# Create your dataset or load your data
# Replace 'your_data' with your actual data
# Set a random seed for reproducibility
set.seed(123)
# Perform the train-test split
trainIndex <- createDataPartition(X_test$cnt, p = 0.8, list = FALSE)
train_data <- X_test[trainIndex, ]
test_data <- X_test[-trainIndex, ]
predictions.finalmodel2 <- predict(final_model2, newdata = test_data)
accuracy1 <- mean((predictions.finalmodel2 - test_data$cnt)^2) # Mean Squared Error
predictions.finalmodel1 <- predict(final_model, newdata = test_data)
accuracy2 <- mean((predictions.finalmodel1 - test_data$cnt)^2) # Mean Squared Error
predictions.full_model_with_interactions <- predict(full_model_with_interactions, newdata = test_data)
accuracy3 <- mean((predictions.full_model_with_interactions - test_data$cnt)^2) accuracy1 #model aic## [1] 5.541376
accuracy2 #model most predictive## [1] 6.171635
accuracy3 #interactions## [1] 5.505514
| x | |
|---|---|
| (Intercept) | 8.7360310 |
| atemp | 6.3694392 |
| hum | -3.2252152 |
| windspeed | -0.6701685 |
| hr_0 | -1.7632018 |
| hr_1 | -3.2700784 |
| hr_2 | -4.3056782 |
| hr_3 | -5.4513768 |
| hr_4 | -6.0130602 |
| hr_5 | -3.9635660 |
| hr_6 | -0.5173983 |
| hr_7 | 3.2157680 |
| hr_8 | 5.9429135 |
| hr_9 | 3.9050611 |
| hr_10 | 2.4189285 |
| hr_11 | 2.9132143 |
| hr_12 | 3.7129743 |
| hr_13 | 3.5578955 |
| hr_14 | 3.1768046 |
| hr_15 | 3.4096720 |
| hr_16 | 4.6649531 |
| hr_17 | 6.8823209 |
| hr_18 | 6.4775755 |
| hr_19 | 4.9576486 |
| hr_20 | 3.5378034 |
| hr_21 | 2.4441070 |
| hr_22 | 1.4284566 |
| season_1 | -4.8129811 |
| season_2 | -0.6647039 |
| season_3 | -0.5902581 |
| holiday_0 | 0.5037029 |
| workingday_0 | -0.0938581 |
| weathersit_1 | 0.1025285 |
| weathersit_3 | -1.8392282 |
| mnth_1 | 0.4016336 |
| mnth_2 | 0.5153697 |
| mnth_3 | 0.3323629 |
| mnth_5 | 0.6368207 |
| mnth_7 | -0.4306901 |
| mnth_9 | 0.6292341 |
| mnth_10 | 0.4142468 |
| weekday_0 | -0.3150824 |
| weekday_1 | -0.3684438 |
| weekday_2 | -0.4159068 |
| weekday_3 | -0.3471668 |
| weekday_4 | -0.2846836 |
| atemp:season_1 | 3.3563923 |
| hum:season_1 | 2.7529042 |
Explain the Model