Logistic Regression
Analyze Motor Car Data Trends
You are a data scientist in a top dealership group in USA. Your boss, Mr. Buffet, asked you to analyze the motor trend car data. You are given a dataset containing fuel consumption and 10 aspects of automobile design and performance for 32 automobiles the file mtcars.xlsx.
Data Source: from MASS library in R.
Data Dictionary
This dataset contains the following columns:
Variable | Data Type | Description | Constraints/Rules |
---|---|---|---|
mpg |
Numeric | Miles per gallon | Positive values only (mpg > 0) |
cyl |
Integer | Number of cylinders | Categorical: {4, 6, 8} |
disp |
Numeric | Displacement (cubic inches) | Positive values only (disp > 0) |
hp |
Integer | Gross horsepower | Positive values only (hp > 0) |
drat |
Numeric | Rear axle ratio | Positive values only (drat > 0) |
wt |
Numeric | Weight (1000 lbs) | Positive values only (wt > 0) |
qsec |
Numeric | 1/4 mile time (seconds) | Positive values only (qsec > 0) |
vs |
Integer | Engine type: 0 = V-shaped, 1 = straight | Binary: {0, 1} |
am |
Integer | Transmission: 0 = automatic, 1 = manual | Binary: {0, 1} |
gear |
Integer | Number of forward gears | Categorical: {3, 4, 5} |
carb |
Integer | Number of carburetors | Positive integer values only |
Question 1:
Load the dataset mtcars.xlsx into memory and convert column
am
to a factor using factor() function.
Load the dataset mtcars.xlsx into memory
Display the dimension of the data frame
## [1] 41 12
Display first six rows of the data frame
## # A tibble: 6 × 12
## name mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 Mazda RX4 W… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 Hornet 4 Dr… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 Hornet Spor… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
Convert column am to a factor using factor() function.
Display the structure of the data frame
## tibble [41 × 12] (S3: tbl_df/tbl/data.frame)
## $ name: chr [1:41] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## $ mpg : num [1:41] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num [1:41] 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num [1:41] 160 160 108 258 360 ...
## $ hp : num [1:41] 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num [1:41] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num [1:41] 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num [1:41] 16.5 17 18.6 19.4 17 ...
## $ vs : num [1:41] 0 0 1 1 0 1 0 1 1 1 ...
## $ am : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
## $ gear: num [1:41] 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num [1:41] 4 4 1 1 2 1 4 2 2 4 ...
Display the statistical summary of the data frame
## name mpg cyl disp
## Length:41 Min. :10.40 Min. :4.000 Min. : 71.1
## Class :character 1st Qu.:15.80 1st Qu.:4.000 1st Qu.:121.0
## Mode :character Median :19.70 Median :6.000 Median :167.6
## Mean :20.15 Mean :6.098 Mean :226.4
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:318.0
## Max. :33.90 Max. :8.000 Max. :472.0
## hp drat wt qsec
## Min. : 52.0 Min. :2.760 Min. :1.513 Min. :14.50
## 1st Qu.: 97.0 1st Qu.:3.080 1st Qu.:2.620 1st Qu.:16.90
## Median :110.0 Median :3.690 Median :3.215 Median :17.82
## Mean :141.8 Mean :3.579 Mean :3.181 Mean :17.91
## 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.570 3rd Qu.:18.90
## Max. :335.0 Max. :4.930 Max. :5.424 Max. :22.90
## vs am gear carb
## Min. :0.0000 automatic:24 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 manual :17 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4634 Mean :3.659 Mean :2.707
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Question 2:
Split the data into training set and test set. The training set contains the first 35 observations, the test set containing the remaining observations.
Select the first 35 rows of the dataset to create training dataset and display the first 10 rows
## # A tibble: 10 × 12
## name mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
## 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 manu… 4 4
## 2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 manu… 4 4
## 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 manu… 4 1
## 4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 auto… 3 1
## 5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 auto… 3 2
## 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 auto… 3 1
## 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 auto… 3 4
## 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 auto… 4 2
## 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 auto… 4 2
## 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 auto… 4 4
Use the remaining rows of the dataset to create testing dataset
## # A tibble: 6 × 12
## name mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
## 1 Hornet 4 Dr… 21.4 6 258 110 3.08 3.22 19.4 1 auto… 3 1
## 2 Hornet Spor… 18.7 8 360 175 3.15 3.44 17.0 0 auto… 3 2
## 3 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 auto… 3 1
## 4 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 auto… 3 4
## 5 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 auto… 4 2
## 6 Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 manu… 4 2
Question 3:
Build a logistic regression model with the response is am and
the predictors are mpg
, cyl
, hp
,
and wt
using glm()
function
Build a logistic regression model using glm()
function
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Display the statistical summary of the model
##
## Call:
## glm(formula = am ~ mpg + cyl + hp + wt, family = binomial, data = train_dataset)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -38.75178 68.12958 -0.569 0.569
## mpg 2.33124 2.94811 0.791 0.429
## cyl 1.67965 1.55458 1.080 0.280
## hp 0.09628 0.10450 0.921 0.357
## wt -10.46900 5.67417 -1.845 0.065 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 48.2628 on 34 degrees of freedom
## Residual deviance: 8.9219 on 30 degrees of freedom
## AIC: 18.922
##
## Number of Fisher Scoring iterations: 11
Question 4:
Compute the test error on the test data set using a
confusion matrix
. Is it a good model based on test
error?
test_predictions <- predict(model.fit,
newdata = test_dataset,
type = "response")
test_pred_class <- ifelse(test_predictions > 0.5, "manual", "automatic")
Create the confusion matrix
## Warning in confusionMatrix.default(factor(test_pred_class), test_dataset$am):
## Levels are not in the same order for reference and data. Refactoring data to
## match.
Print the confusion matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction automatic manual
## automatic 5 1
## manual 0 0
##
## Accuracy : 0.8333
## 95% CI : (0.3588, 0.9958)
## No Information Rate : 0.8333
## P-Value [Acc > NIR] : 0.7368
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.8333
## Neg Pred Value : NaN
## Prevalence : 0.8333
## Detection Rate : 0.8333
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : automatic
##
The model, despite its 83.33% accuracy, is ineffective due to severe class imbalance. It correctly identifies all “automatic” cases but completely fails to predict “manual” vehicles, resulting in a sensitivity of 1.00 and specificity of 0.00. The Kappa score of 0 indicates no meaningful classification ability beyond random guessing. Additionally, the Balanced Accuracy of 0.50 and McNemar’s Test p-value of 1.000 confirm that the model lacks discrimination between classes. Overall, it overfits to “automatic” and fails to generalize, making it unreliable for classification.
Bike Sharing System
You are working as a data scientist for the city of Washington D.C. government. Currently, Washington D.C. has a bike sharing system. People can rent a bike from one location and return it to a different place. You are given a historical usage pattern with weather data contained in Excel workbook bike.csv. You are asked to forecast bike rental demand in the capital bike share program.
Data Dictionary
Data Source: The data is from Kaggle at https://www.kaggle.com/c/bike-sharing-demand, and contains the following columns:
Variable | Data Type | Description | Constraints/Rules |
---|---|---|---|
datetime |
Datetime | Hourly date and timestamp | Must follow a standard datetime format
(YYYY-MM-DD HH:MM:SS ) |
season |
Integer | Season of the year | Categorical: {1 = Spring, 2 = Summer, 3 = Fall, 4 = Winter} |
holiday |
Binary (Integer) | Whether the day is a holiday | {0 = No, 1 = Yes} |
workingday |
Binary (Integer) | Whether the day is a working day (not a weekend/holiday) | {0 = No, 1 = Yes} |
weather |
Integer | Weather condition | Categorical: {1 = Clear/Few Clouds, 2 = Mist/Cloudy, 3 = Light Snow/Rain, 4 = Heavy Rain/Snow/Fog} |
temp |
Numeric | Temperature in Celsius | Continuous values, typically within a range of [-10, 45]°C |
atemp |
Numeric | “Feels like” temperature in Celsius | Continuous values, typically within a range of [-10, 50]°C |
humidity |
Numeric | Relative humidity (%) | Ranges from 0 to 100 |
windspeed |
Numeric | Wind speed (m/s) | Non-negative values (windspeed ≥ 0) |
casual |
Integer | Number of rentals by non-registered users | Non-negative integer (casual ≥ 0) |
registered |
Integer | Number of rentals by registered users | Non-negative integer (registered ≥ 0) |
count |
Integer | Total number of rentals (casual + registered) | Non-negative integer (count = casual + registered) |
Question 1:
Build a linear model to forecast number of total rentals
(count
) using potential predictors, season
,
holiday
, workingday
, weather
,
atemp
, and registered
.
Load the dataset
Display first few rows
## datetime season holiday workingday weather temp atemp humidity
## 1 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81
## 2 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80
## 3 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80
## 4 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75
## 5 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75
## 6 2011-01-01 05:00:00 1 0 0 2 9.84 12.880 75
## windspeed casual registered count
## 1 0.0000 3 13 16
## 2 0.0000 8 32 40
## 3 0.0000 5 27 32
## 4 0.0000 3 10 13
## 5 0.0000 0 1 1
## 6 6.0032 0 1 1
Display dimension of the dataframe
## [1] 10886 12
Display column names
## [1] "datetime" "season" "holiday" "workingday" "weather"
## [6] "temp" "atemp" "humidity" "windspeed" "casual"
## [11] "registered" "count"
Convert datetime to Date format if needed
Convert season categorical variables to factors
bike.df$season = factor(bike.df$season,
levels = c(1, 2, 3, 4),
labels = c("Spring", "Summer", "Fall", "Winter")
)
Convert holiday categorical variables to factors
Convert workingday categorical variables to factors
Convert weather categorical variables to factors
bike.df$weather <- factor(bike.df$weather,
levels = c(1, 2, 3, 4),
labels = c("Clear", "Misty_cloudy",
"Light_snow", "Heavy_rain")
)
Build linear model for count prediction
linear_model <- lm(count ~ season + holiday + workingday +
weather + atemp + registered,
data = bike.df)
Display the statistical summary of the model
##
## Call:
## lm(formula = count ~ season + holiday + workingday + weather +
## atemp + registered, data = bike.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -101.215 -20.295 -3.242 14.058 265.418
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.742071 1.296078 -4.430 9.50e-06 ***
## seasonSummer -5.372699 1.192074 -4.507 6.64e-06 ***
## seasonFall -18.014488 1.448610 -12.436 < 2e-16 ***
## seasonWinter -8.639163 1.003388 -8.610 < 2e-16 ***
## holidayYes -11.827273 2.086374 -5.669 1.47e-08 ***
## workingdayYes -41.745729 0.751281 -55.566 < 2e-16 ***
## weatherMisty_cloudy -4.962998 0.780422 -6.359 2.11e-10 ***
## weatherLight_snow -8.562471 1.276496 -6.708 2.07e-11 ***
## weatherHeavy_rain 3.030360 35.080964 0.086 0.931
## atemp 2.475376 0.065339 37.885 < 2e-16 ***
## registered 1.141296 0.002414 472.758 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.07 on 10875 degrees of freedom
## Multiple R-squared: 0.9625, Adjusted R-squared: 0.9625
## F-statistic: 2.795e+04 on 10 and 10875 DF, p-value: < 2.2e-16
Generate predictions using the model and round the predictions
Show first 15 rounded predictions
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 45 65 59 41 31 22 28 28 38 44 70 70 110 99 127
The model effectively explains bike rental patterns, showing that seasonality, holidays, and working days have substantial effects. Weather conditions, particularly misty/cloudy and snowy conditions, decrease rentals, while temperature has a positive impact. Surprisingly, heavy rain does not show a significant impact, which may warrant further investigation. Finally, the number of registered users is the strongest predictor, reinforcing the idea that bike-sharing systems heavily rely on frequent users rather than occasional riders.
Question 2:
Perform best subset selection using bestglm()
function based on BIC. What’s the best model based on BIC?
Prepare data for bestglm (needs to be a dataframe with only predictors and response)
model_data <- model.matrix(~ season + holiday + workingday + weather +
atemp + registered + count, data = bike.df)
Remove the first column from the model_data dataset
Convert model_data to dataframe
Find the best model based on the BIC.
Display the statistical summary of the best model of the best_bic_model
##
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
## drop = FALSE], y = y))
##
## Residuals:
## Min 1Q Median 3Q Max
## -101.216 -20.295 -3.242 14.057 265.418
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.740447 1.295882 -4.430 9.53e-06 ***
## seasonSummer -5.373357 1.191995 -4.508 6.62e-06 ***
## seasonFall -18.014895 1.448537 -12.437 < 2e-16 ***
## seasonWinter -8.640114 1.003281 -8.612 < 2e-16 ***
## holidayYes -11.827253 2.086279 -5.669 1.47e-08 ***
## workingdayYes -41.745324 0.751232 -55.569 < 2e-16 ***
## weatherMisty_cloudy -4.963428 0.780370 -6.360 2.09e-10 ***
## weatherLight_snow -8.562900 1.276428 -6.708 2.06e-11 ***
## atemp 2.475328 0.065334 37.888 < 2e-16 ***
## registered 1.141297 0.002414 472.785 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.07 on 10876 degrees of freedom
## Multiple R-squared: 0.9625, Adjusted R-squared: 0.9625
## F-statistic: 3.106e+04 on 9 and 10876 DF, p-value: < 2.2e-16
This model suggests strong predictive performance, with seasonal, weather, holiday, and temperature-related factors all significantly influencing the outcome variable. The model’s high R – squared and significance levels indicate it is likely capturing the key drivers in the data effectively.
Question 3:
Compute the test error of the best model based on BIC using LOOCV.
Test error using LOOCV
loocv_control <- trainControl(method = "LOOCV")
loocv_model <- train(
count ~ season + holiday + workingday + weather + atemp + registered,
data = bike.df,
method = "lm",
trControl = loocv_control
)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
Print the loocv_model results
## Linear Regression
##
## 10886 samples
## 6 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 10885, 10885, 10885, 10885, 10885, 10885, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 35.09127 0.9624692 23.88103
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Access the Root Mean Squared Error (RMSE) value from the results of a Leave-One-Out Cross-Validation (LOOCV) model
## [1] 35.09127
Question 4:
Calculate the test error of the best model based on BIC using 10-fold CV.
Test error using 10-fold CV
cv_control <- trainControl(method = "cv", number = 10)
cv_model <- train(
count ~ season + holiday + workingday +
weather + atemp + registered,
data = bike.df,
method = "lm",
trControl = cv_control
)
## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases
Print the results of the cv_model
## Linear Regression
##
## 10886 samples
## 6 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 9797, 9796, 9796, 9798, 9798, 9798, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 35.06708 0.9625144 23.88112
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Access the Root Mean Squared Error (RMSE) value from the results of a Cross-Validation model
## [1] 35.06708
Note Qn3 & Qn4:
An RMSE value of 35 indicates that, on average, the
model’s predictions deviate from the actual values by 35 units. In
general, lower RMSE values signify better model performance, as they
indicate smaller discrepancies between predicted and actual values.
Question 5:
Perform best subset selection using bestglm()
function based on CV. What’s the best model based on CV?
Best subset selection based on CV
best_cv_model <- bestglm(
Xy = model_data.df,
family = gaussian,
IC = "CV",
CVArgs = list(Method = "HTF", K = 10, REP = 1)
)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
Print the results of the best_cv_model
## CV(K = 10, REP = 1)
## No BICq equivalent
## Best Model:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.334389 1.117091227 -4.775249 1.818258e-06
## workingdayYes -40.685205 0.736932772 -55.208842 0.000000e+00
## atemp 1.967543 0.042399253 46.405140 0.000000e+00
## registered 1.144745 0.002395396 477.894018 0.000000e+00
Display the statistical summary of the best model of the best_cv_model
##
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
## drop = FALSE], y = y))
##
## Residuals:
## Min 1Q Median 3Q Max
## -105.476 -20.471 -3.179 14.285 270.954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.334389 1.117091 -4.775 1.82e-06 ***
## workingdayYes -40.685205 0.736933 -55.209 < 2e-16 ***
## atemp 1.967543 0.042399 46.405 < 2e-16 ***
## registered 1.144745 0.002395 477.894 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.58 on 10882 degrees of freedom
## Multiple R-squared: 0.9614, Adjusted R-squared: 0.9614
## F-statistic: 9.042e+04 on 3 and 10882 DF, p-value: < 2.2e-16
The model’s high R-squared value and significant predictors indicate a good fit, with meaningful relationships between the predictors and the outcome. However, certain predictors, such as workingdayYes and seasonFall, show large effects that warrant careful interpretation to ensure they align with domain expectations and aren’t driven by multicollinearity. Furthermore, the broad range of residuals suggests that while the model is generally accurate, there may be outliers or variations not fully captured by the current predictors.
Question 6:
Perform the backward stepwise selection using
stepAIC()
function. What’s the best model?
Full model for count prediction
full_model <- lm(count ~ season + holiday + workingday +
weather + atemp + registered, data = bike.df)
Backward stepwise selection using stepAIC
## Start: AIC=77462.43
## count ~ season + holiday + workingday + weather + atemp + registered
##
## Df Sum of Sq RSS AIC
## <none> 13376322 77462
## - holiday 1 39527 13415849 77493
## - weather 3 89177 13465499 77529
## - season 3 263668 13639990 77669
## - atemp 1 1765415 15141737 78810
## - workingday 1 3797752 17174074 80181
## - registered 1 274906725 288283047 110885
Display the statistical summary of the best model
##
## Call:
## lm(formula = count ~ season + holiday + workingday + weather +
## atemp + registered, data = bike.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -101.215 -20.295 -3.242 14.058 265.418
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.742071 1.296078 -4.430 9.50e-06 ***
## seasonSummer -5.372699 1.192074 -4.507 6.64e-06 ***
## seasonFall -18.014488 1.448610 -12.436 < 2e-16 ***
## seasonWinter -8.639163 1.003388 -8.610 < 2e-16 ***
## holidayYes -11.827273 2.086374 -5.669 1.47e-08 ***
## workingdayYes -41.745729 0.751281 -55.566 < 2e-16 ***
## weatherMisty_cloudy -4.962998 0.780422 -6.359 2.11e-10 ***
## weatherLight_snow -8.562471 1.276496 -6.708 2.07e-11 ***
## weatherHeavy_rain 3.030360 35.080964 0.086 0.931
## atemp 2.475376 0.065339 37.885 < 2e-16 ***
## registered 1.141296 0.002414 472.758 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.07 on 10875 degrees of freedom
## Multiple R-squared: 0.9625, Adjusted R-squared: 0.9625
## F-statistic: 2.795e+04 on 10 and 10875 DF, p-value: < 2.2e-16
The analysis used backward stepwise selection to identify the best predictive model for bike rentals, retaining all variables as their removal increased the Akaike Information Criterion (AIC).