Logistic Regression

Analyze Motor Car Data Trends

You are a data scientist in a top dealership group in USA. Your boss, Mr. Buffet, asked you to analyze the motor trend car data. You are given a dataset containing fuel consumption and 10 aspects of automobile design and performance for 32 automobiles the file mtcars.xlsx.

Data Source: from MASS library in R.

Data Dictionary

This dataset contains the following columns:

Variable	Data Type	Description	Constraints/Rules
`mpg`	Numeric	Miles per gallon	Positive values only (mpg > 0)
`cyl`	Integer	Number of cylinders	Categorical: {4, 6, 8}
`disp`	Numeric	Displacement (cubic inches)	Positive values only (disp > 0)
`hp`	Integer	Gross horsepower	Positive values only (hp > 0)
`drat`	Numeric	Rear axle ratio	Positive values only (drat > 0)
`wt`	Numeric	Weight (1000 lbs)	Positive values only (wt > 0)
`qsec`	Numeric	1/4 mile time (seconds)	Positive values only (qsec > 0)
`vs`	Integer	Engine type: 0 = V-shaped, 1 = straight	Binary: {0, 1}
`am`	Integer	Transmission: 0 = automatic, 1 = manual	Binary: {0, 1}
`gear`	Integer	Number of forward gears	Categorical: {3, 4, 5}
`carb`	Integer	Number of carburetors	Positive integer values only

Question 1:

Load the dataset mtcars.xlsx into memory and convert column am to a factor using factor() function.

Load the dataset mtcars.xlsx into memory

mtcars.df <- read_excel("data/mtcars.xlsx")

Display the dimension of the data frame

dim(mtcars.df)

## [1] 41 12

Display first six rows of the data frame

head(mtcars.df)

## # A tibble: 6 × 12
##   name           mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
## 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
## 4 Hornet 4 Dr…  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
## 5 Hornet Spor…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
## 6 Valiant       18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

Convert column am to a factor using factor() function.

mtcars.df$am <- factor(mtcars.df$am,
                       levels = c(0, 1),
                       labels = c("automatic", "manual"))

Display the structure of the data frame

str(mtcars.df)

## tibble [41 × 12] (S3: tbl_df/tbl/data.frame)
##  $ name: chr [1:41] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ mpg : num [1:41] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num [1:41] 6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num [1:41] 160 160 108 258 360 ...
##  $ hp  : num [1:41] 110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num [1:41] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num [1:41] 2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num [1:41] 16.5 17 18.6 19.4 17 ...
##  $ vs  : num [1:41] 0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : Factor w/ 2 levels "automatic","manual": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: num [1:41] 4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num [1:41] 4 4 1 1 2 1 4 2 2 4 ...

Display the statistical summary of the data frame

summary(mtcars.df)

##      name                mpg             cyl             disp      
##  Length:41          Min.   :10.40   Min.   :4.000   Min.   : 71.1  
##  Class :character   1st Qu.:15.80   1st Qu.:4.000   1st Qu.:121.0  
##  Mode  :character   Median :19.70   Median :6.000   Median :167.6  
##                     Mean   :20.15   Mean   :6.098   Mean   :226.4  
##                     3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:318.0  
##                     Max.   :33.90   Max.   :8.000   Max.   :472.0  
##        hp             drat             wt             qsec      
##  Min.   : 52.0   Min.   :2.760   Min.   :1.513   Min.   :14.50  
##  1st Qu.: 97.0   1st Qu.:3.080   1st Qu.:2.620   1st Qu.:16.90  
##  Median :110.0   Median :3.690   Median :3.215   Median :17.82  
##  Mean   :141.8   Mean   :3.579   Mean   :3.181   Mean   :17.91  
##  3rd Qu.:180.0   3rd Qu.:3.920   3rd Qu.:3.570   3rd Qu.:18.90  
##  Max.   :335.0   Max.   :4.930   Max.   :5.424   Max.   :22.90  
##        vs                 am          gear            carb      
##  Min.   :0.0000   automatic:24   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   manual   :17   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000                  Median :4.000   Median :2.000  
##  Mean   :0.4634                  Mean   :3.659   Mean   :2.707  
##  3rd Qu.:1.0000                  3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000                  Max.   :5.000   Max.   :8.000

Question 2:

Split the data into training set and test set. The training set contains the first 35 observations, the test set containing the remaining observations.

Select the first 35 rows of the dataset to create training dataset and display the first 10 rows

train_dataset <- mtcars.df[1:35, ]

head(train_dataset, 10)

## # A tibble: 10 × 12
##    name          mpg   cyl  disp    hp  drat    wt  qsec    vs am     gear  carb
##    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
##  1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0 manu…     4     4
##  2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0 manu…     4     4
##  3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1 manu…     4     1
##  4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1 auto…     3     1
##  5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0 auto…     3     2
##  6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1 auto…     3     1
##  7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0 auto…     3     4
##  8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1 auto…     4     2
##  9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1 auto…     4     2
## 10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1 auto…     4     4

Use the remaining rows of the dataset to create testing dataset

test_dataset <- mtcars.df[-(1:35), ]

head(test_dataset, 10)

## # A tibble: 6 × 12
##   name           mpg   cyl  disp    hp  drat    wt  qsec    vs am     gear  carb
##   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
## 1 Hornet 4 Dr…  21.4     6  258    110  3.08  3.22  19.4     1 auto…     3     1
## 2 Hornet Spor…  18.7     8  360    175  3.15  3.44  17.0     0 auto…     3     2
## 3 Valiant       18.1     6  225    105  2.76  3.46  20.2     1 auto…     3     1
## 4 Duster 360    14.3     8  360    245  3.21  3.57  15.8     0 auto…     3     4
## 5 Merc 240D     24.4     4  147.    62  3.69  3.19  20       1 auto…     4     2
## 6 Volvo 142E    21.4     4  121    109  4.11  2.78  18.6     1 manu…     4     2

Question 3:

Build a logistic regression model with the response is am and the predictors are mpg, cyl, hp, and wt using glm() function

Build a logistic regression model using glm() function

model.fit <- glm(am ~ mpg + cyl + hp + wt,
             data = train_dataset,
             family = binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Display the statistical summary of the model

summary(model.fit)

## 
## Call:
## glm(formula = am ~ mpg + cyl + hp + wt, family = binomial, data = train_dataset)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -38.75178   68.12958  -0.569    0.569  
## mpg           2.33124    2.94811   0.791    0.429  
## cyl           1.67965    1.55458   1.080    0.280  
## hp            0.09628    0.10450   0.921    0.357  
## wt          -10.46900    5.67417  -1.845    0.065 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 48.2628  on 34  degrees of freedom
## Residual deviance:  8.9219  on 30  degrees of freedom
## AIC: 18.922
## 
## Number of Fisher Scoring iterations: 11

Question 4:

Compute the test error on the test data set using a confusion matrix. Is it a good model based on test error?

test_predictions <- predict(model.fit, 
                            newdata = test_dataset, 
                            type = "response")

test_pred_class <- ifelse(test_predictions > 0.5, "manual", "automatic")

Create the confusion matrix

conf_matrix <- confusionMatrix(factor(test_pred_class), 
                               test_dataset$am)

## Warning in confusionMatrix.default(factor(test_pred_class), test_dataset$am):
## Levels are not in the same order for reference and data. Refactoring data to
## match.

Print the confusion matrix

print(conf_matrix)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  automatic manual
##   automatic         5      1
##   manual            0      0
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.3588, 0.9958)
##     No Information Rate : 0.8333          
##     P-Value [Acc > NIR] : 0.7368          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8333          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8333          
##          Detection Rate : 0.8333          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : automatic       
##

The model, despite its 83.33% accuracy, is ineffective due to severe class imbalance. It correctly identifies all “automatic” cases but completely fails to predict “manual” vehicles, resulting in a sensitivity of 1.00 and specificity of 0.00. The Kappa score of 0 indicates no meaningful classification ability beyond random guessing. Additionally, the Balanced Accuracy of 0.50 and McNemar’s Test p-value of 1.000 confirm that the model lacks discrimination between classes. Overall, it overfits to “automatic” and fails to generalize, making it unreliable for classification.

Bike Sharing System

You are working as a data scientist for the city of Washington D.C. government. Currently, Washington D.C. has a bike sharing system. People can rent a bike from one location and return it to a different place. You are given a historical usage pattern with weather data contained in Excel workbook bike.csv. You are asked to forecast bike rental demand in the capital bike share program.

Data Dictionary

Data Source: The data is from Kaggle at https://www.kaggle.com/c/bike-sharing-demand, and contains the following columns:

Variable	Data Type	Description	Constraints/Rules
`datetime`	Datetime	Hourly date and timestamp	Must follow a standard datetime format (`YYYY-MM-DD HH:MM:SS`)
`season`	Integer	Season of the year	Categorical: {1 = Spring, 2 = Summer, 3 = Fall, 4 = Winter}
`holiday`	Binary (Integer)	Whether the day is a holiday	{0 = No, 1 = Yes}
`workingday`	Binary (Integer)	Whether the day is a working day (not a weekend/holiday)	{0 = No, 1 = Yes}
`weather`	Integer	Weather condition	Categorical: {1 = Clear/Few Clouds, 2 = Mist/Cloudy, 3 = Light Snow/Rain, 4 = Heavy Rain/Snow/Fog}
`temp`	Numeric	Temperature in Celsius	Continuous values, typically within a range of [-10, 45]°C
`atemp`	Numeric	“Feels like” temperature in Celsius	Continuous values, typically within a range of [-10, 50]°C
`humidity`	Numeric	Relative humidity (%)	Ranges from 0 to 100
`windspeed`	Numeric	Wind speed (m/s)	Non-negative values (windspeed ≥ 0)
`casual`	Integer	Number of rentals by non-registered users	Non-negative integer (casual ≥ 0)
`registered`	Integer	Number of rentals by registered users	Non-negative integer (registered ≥ 0)
`count`	Integer	Total number of rentals (casual + registered)	Non-negative integer (count = casual + registered)

Question 1:

Build a linear model to forecast number of total rentals (count) using potential predictors, season, holiday, workingday, weather, atemp, and registered.

Load the dataset

bike.df <- read.csv("data/Bike.csv")

Display first few rows

head(bike.df)

##              datetime season holiday workingday weather temp  atemp humidity
## 1 2011-01-01 00:00:00      1       0          0       1 9.84 14.395       81
## 2 2011-01-01 01:00:00      1       0          0       1 9.02 13.635       80
## 3 2011-01-01 02:00:00      1       0          0       1 9.02 13.635       80
## 4 2011-01-01 03:00:00      1       0          0       1 9.84 14.395       75
## 5 2011-01-01 04:00:00      1       0          0       1 9.84 14.395       75
## 6 2011-01-01 05:00:00      1       0          0       2 9.84 12.880       75
##   windspeed casual registered count
## 1    0.0000      3         13    16
## 2    0.0000      8         32    40
## 3    0.0000      5         27    32
## 4    0.0000      3         10    13
## 5    0.0000      0          1     1
## 6    6.0032      0          1     1

Display dimension of the dataframe

dim(bike.df)

## [1] 10886    12

Display column names

colnames(bike.df)

##  [1] "datetime"   "season"     "holiday"    "workingday" "weather"   
##  [6] "temp"       "atemp"      "humidity"   "windspeed"  "casual"    
## [11] "registered" "count"

Convert datetime to Date format if needed

bike.df$datetime <- as.POSIXct(bike.df$datetime, format = "%Y-%m-%d %H:%M:%S")

Convert season categorical variables to factors

bike.df$season = factor(bike.df$season,
                        levels = c(1, 2, 3, 4),
                        labels = c("Spring", "Summer", "Fall", "Winter")
)

Convert holiday categorical variables to factors

bike.df$holiday <- factor(bike.df$holiday, 
                          levels = c(0,1), 
                          labels = c("No", "Yes")
)

Convert workingday categorical variables to factors

bike.df$workingday <- factor(bike.df$workingday,
                             levels = c(0,1), 
                             labels = c("No", "Yes")
)

Convert weather categorical variables to factors

bike.df$weather <- factor(bike.df$weather,
                          levels = c(1, 2, 3, 4),
                          labels = c("Clear", "Misty_cloudy",
                                     "Light_snow", "Heavy_rain")
)

Build linear model for count prediction

linear_model <- lm(count ~ season + holiday + workingday +
                     weather + atemp + registered,
                   data = bike.df)

Display the statistical summary of the model

summary(linear_model)

## 
## Call:
## lm(formula = count ~ season + holiday + workingday + weather + 
##     atemp + registered, data = bike.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -101.215  -20.295   -3.242   14.058  265.418 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -5.742071   1.296078  -4.430 9.50e-06 ***
## seasonSummer         -5.372699   1.192074  -4.507 6.64e-06 ***
## seasonFall          -18.014488   1.448610 -12.436  < 2e-16 ***
## seasonWinter         -8.639163   1.003388  -8.610  < 2e-16 ***
## holidayYes          -11.827273   2.086374  -5.669 1.47e-08 ***
## workingdayYes       -41.745729   0.751281 -55.566  < 2e-16 ***
## weatherMisty_cloudy  -4.962998   0.780422  -6.359 2.11e-10 ***
## weatherLight_snow    -8.562471   1.276496  -6.708 2.07e-11 ***
## weatherHeavy_rain     3.030360  35.080964   0.086    0.931    
## atemp                 2.475376   0.065339  37.885  < 2e-16 ***
## registered            1.141296   0.002414 472.758  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.07 on 10875 degrees of freedom
## Multiple R-squared:  0.9625, Adjusted R-squared:  0.9625 
## F-statistic: 2.795e+04 on 10 and 10875 DF,  p-value: < 2.2e-16

Generate predictions using the model and round the predictions

predictions <- predict(linear_model, bike.df)
rounded_predictions <- round(predictions)

Show first 15 rounded predictions

head(rounded_predictions, 15)

##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
##  45  65  59  41  31  22  28  28  38  44  70  70 110  99 127

The model effectively explains bike rental patterns, showing that seasonality, holidays, and working days have substantial effects. Weather conditions, particularly misty/cloudy and snowy conditions, decrease rentals, while temperature has a positive impact. Surprisingly, heavy rain does not show a significant impact, which may warrant further investigation. Finally, the number of registered users is the strongest predictor, reinforcing the idea that bike-sharing systems heavily rely on frequent users rather than occasional riders.

Question 2:

Perform best subset selection using bestglm() function based on BIC. What’s the best model based on BIC?

Prepare data for bestglm (needs to be a dataframe with only predictors and response)

model_data <- model.matrix(~ season + holiday + workingday + weather + 
                             atemp + registered + count, data = bike.df)

Remove the first column from the model_data dataset

model_data <- model_data[,-1]

Convert model_data to dataframe

model_data.df <- data.frame(model_data)

Find the best model based on the BIC.

best_bic_model <- bestglm(model_data.df, IC = "BIC", family = gaussian)

Display the statistical summary of the best model of the best_bic_model

summary(best_bic_model$BestModel)

## 
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), 
##     drop = FALSE], y = y))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -101.216  -20.295   -3.242   14.057  265.418 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -5.740447   1.295882  -4.430 9.53e-06 ***
## seasonSummer         -5.373357   1.191995  -4.508 6.62e-06 ***
## seasonFall          -18.014895   1.448537 -12.437  < 2e-16 ***
## seasonWinter         -8.640114   1.003281  -8.612  < 2e-16 ***
## holidayYes          -11.827253   2.086279  -5.669 1.47e-08 ***
## workingdayYes       -41.745324   0.751232 -55.569  < 2e-16 ***
## weatherMisty_cloudy  -4.963428   0.780370  -6.360 2.09e-10 ***
## weatherLight_snow    -8.562900   1.276428  -6.708 2.06e-11 ***
## atemp                 2.475328   0.065334  37.888  < 2e-16 ***
## registered            1.141297   0.002414 472.785  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.07 on 10876 degrees of freedom
## Multiple R-squared:  0.9625, Adjusted R-squared:  0.9625 
## F-statistic: 3.106e+04 on 9 and 10876 DF,  p-value: < 2.2e-16

This model suggests strong predictive performance, with seasonal, weather, holiday, and temperature-related factors all significantly influencing the outcome variable. The model’s high R – squared and significance levels indicate it is likely capturing the key drivers in the data effectively.

Question 3:

Compute the test error of the best model based on BIC using LOOCV.

Test error using LOOCV

loocv_control <- trainControl(method = "LOOCV")
loocv_model <- train(
  count ~ season + holiday + workingday + weather + atemp + registered,
  data = bike.df,
  method = "lm",
  trControl = loocv_control
)

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

Print the loocv_model results

print(loocv_model)

## Linear Regression 
## 
## 10886 samples
##     6 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 10885, 10885, 10885, 10885, 10885, 10885, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   35.09127  0.9624692  23.88103
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Access the Root Mean Squared Error (RMSE) value from the results of a Leave-One-Out Cross-Validation (LOOCV) model

loocv_model$results$RMSE

## [1] 35.09127

Question 4:

Calculate the test error of the best model based on BIC using 10-fold CV.

Test error using 10-fold CV

cv_control <- trainControl(method = "cv", number = 10)

cv_model <- train(
  count ~ season + holiday + workingday +
    weather + atemp + registered,
  data = bike.df,
  method = "lm",
  trControl = cv_control
)

## Warning in predict.lm(modelFit, newdata): prediction from rank-deficient fit;
## attr(*, "non-estim") has doubtful cases

Print the results of the cv_model

print(cv_model)

## Linear Regression 
## 
## 10886 samples
##     6 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 9797, 9796, 9796, 9798, 9798, 9798, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   35.06708  0.9625144  23.88112
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Access the Root Mean Squared Error (RMSE) value from the results of a Cross-Validation model

cv_model$results$RMSE

## [1] 35.06708

Note Qn3 & Qn4:
An RMSE value of 35 indicates that, on average, the model’s predictions deviate from the actual values by 35 units. In general, lower RMSE values signify better model performance, as they indicate smaller discrepancies between predicted and actual values.

Question 5:

Perform best subset selection using bestglm() function based on CV. What’s the best model based on CV?

Best subset selection based on CV

best_cv_model <- bestglm(
  Xy = model_data.df,
  family = gaussian,
  IC = "CV",
  CVArgs = list(Method = "HTF", K = 10, REP = 1)
)

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

Print the results of the best_cv_model

print(best_cv_model)

## CV(K = 10, REP = 1)
## No BICq equivalent
## Best Model:
##                 Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept)    -5.334389 1.117091227  -4.775249 1.818258e-06
## workingdayYes -40.685205 0.736932772 -55.208842 0.000000e+00
## atemp           1.967543 0.042399253  46.405140 0.000000e+00
## registered      1.144745 0.002395396 477.894018 0.000000e+00

Display the statistical summary of the best model of the best_cv_model

summary(best_cv_model$BestModel)

## 
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), 
##     drop = FALSE], y = y))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -105.476  -20.471   -3.179   14.285  270.954 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5.334389   1.117091  -4.775 1.82e-06 ***
## workingdayYes -40.685205   0.736933 -55.209  < 2e-16 ***
## atemp           1.967543   0.042399  46.405  < 2e-16 ***
## registered      1.144745   0.002395 477.894  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.58 on 10882 degrees of freedom
## Multiple R-squared:  0.9614, Adjusted R-squared:  0.9614 
## F-statistic: 9.042e+04 on 3 and 10882 DF,  p-value: < 2.2e-16

The model’s high R-squared value and significant predictors indicate a good fit, with meaningful relationships between the predictors and the outcome. However, certain predictors, such as workingdayYes and seasonFall, show large effects that warrant careful interpretation to ensure they align with domain expectations and aren’t driven by multicollinearity. Furthermore, the broad range of residuals suggests that while the model is generally accurate, there may be outliers or variations not fully captured by the current predictors.

Question 6:

Perform the backward stepwise selection using stepAIC() function. What’s the best model?

Full model for count prediction

full_model <- lm(count ~ season + holiday + workingday + 
                   weather + atemp + registered, data = bike.df)

Backward stepwise selection using stepAIC

stepwise_model <- stepAIC(full_model, direction = "backward")

## Start:  AIC=77462.43
## count ~ season + holiday + workingday + weather + atemp + registered
## 
##              Df Sum of Sq       RSS    AIC
## <none>                     13376322  77462
## - holiday     1     39527  13415849  77493
## - weather     3     89177  13465499  77529
## - season      3    263668  13639990  77669
## - atemp       1   1765415  15141737  78810
## - workingday  1   3797752  17174074  80181
## - registered  1 274906725 288283047 110885

Display the statistical summary of the best model

summary(stepwise_model)

## 
## Call:
## lm(formula = count ~ season + holiday + workingday + weather + 
##     atemp + registered, data = bike.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -101.215  -20.295   -3.242   14.058  265.418 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -5.742071   1.296078  -4.430 9.50e-06 ***
## seasonSummer         -5.372699   1.192074  -4.507 6.64e-06 ***
## seasonFall          -18.014488   1.448610 -12.436  < 2e-16 ***
## seasonWinter         -8.639163   1.003388  -8.610  < 2e-16 ***
## holidayYes          -11.827273   2.086374  -5.669 1.47e-08 ***
## workingdayYes       -41.745729   0.751281 -55.566  < 2e-16 ***
## weatherMisty_cloudy  -4.962998   0.780422  -6.359 2.11e-10 ***
## weatherLight_snow    -8.562471   1.276496  -6.708 2.07e-11 ***
## weatherHeavy_rain     3.030360  35.080964   0.086    0.931    
## atemp                 2.475376   0.065339  37.885  < 2e-16 ***
## registered            1.141296   0.002414 472.758  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.07 on 10875 degrees of freedom
## Multiple R-squared:  0.9625, Adjusted R-squared:  0.9625 
## F-statistic: 2.795e+04 on 10 and 10875 DF,  p-value: < 2.2e-16

The analysis used backward stepwise selection to identify the best predictive model for bike rentals, retaining all variables as their removal increased the Akaike Information Criterion (AIC).