Introduction

Linear regression is a fundamental statistical technique used to model relationships between a dependent variable and one or more independent variables. In this tutorial, we will explore how to perform linear regression in R using the day.csv dataset.

Objectives:

  • Understand and apply linear regression models
  • Interpret model results
  • Visualize relationships and residuals
  • Make predictions using trained models

0. Loading the Data

# Load the required libraries
library(readr)
library(dplyr)
library(gplots)

# Load the dataset
day <- read_csv("Bike Sharing Dataset/day.csv")

# View the dataset structure
head(day)
## # A tibble: 6 × 16
##   instant dteday     season    yr  mnth holiday weekday workingday weathersit
##     <dbl> <date>      <dbl> <dbl> <dbl>   <dbl>   <dbl>      <dbl>      <dbl>
## 1       1 2011-01-01      1     0     1       0       6          0          2
## 2       2 2011-01-02      1     0     1       0       0          0          2
## 3       3 2011-01-03      1     0     1       0       1          1          1
## 4       4 2011-01-04      1     0     1       0       2          1          1
## 5       5 2011-01-05      1     0     1       0       3          1          1
## 6       6 2011-01-06      1     0     1       0       4          1          1
## # ℹ 7 more variables: temp <dbl>, atemp <dbl>, hum <dbl>, windspeed <dbl>,
## #   casual <dbl>, registered <dbl>, cnt <dbl>
# Skim the dataset using the skimr package
skimr::skim(day)
Data summary
Name day
Number of rows 731
Number of columns 16
_______________________
Column type frequency:
Date 1
numeric 15
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
dteday 0 1 2011-01-01 2012-12-31 2012-01-01 731

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
instant 0 1 366.00 211.17 1.00 183.50 366.00 548.50 731.00 ▇▇▇▇▇
season 0 1 2.50 1.11 1.00 2.00 3.00 3.00 4.00 ▇▇▁▇▇
yr 0 1 0.50 0.50 0.00 0.00 1.00 1.00 1.00 ▇▁▁▁▇
mnth 0 1 6.52 3.45 1.00 4.00 7.00 10.00 12.00 ▇▅▅▅▇
holiday 0 1 0.03 0.17 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
weekday 0 1 3.00 2.00 0.00 1.00 3.00 5.00 6.00 ▇▃▃▃▇
workingday 0 1 0.68 0.47 0.00 0.00 1.00 1.00 1.00 ▃▁▁▁▇
weathersit 0 1 1.40 0.54 1.00 1.00 1.00 2.00 3.00 ▇▁▅▁▁
temp 0 1 0.50 0.18 0.06 0.34 0.50 0.66 0.86 ▂▇▇▇▅
atemp 0 1 0.47 0.16 0.08 0.34 0.49 0.61 0.84 ▂▇▆▇▂
hum 0 1 0.63 0.14 0.00 0.52 0.63 0.73 0.97 ▁▁▆▇▂
windspeed 0 1 0.19 0.08 0.02 0.13 0.18 0.23 0.51 ▃▇▅▁▁
casual 0 1 848.18 686.62 2.00 315.50 713.00 1096.00 3410.00 ▇▆▂▁▁
registered 0 1 3656.17 1560.26 20.00 2497.00 3662.00 4776.50 6946.00 ▂▅▇▅▃
cnt 0 1 4504.35 1937.21 22.00 3152.00 4548.00 5956.00 8714.00 ▂▅▇▅▃

1. Data Preprocessing

Standart data preprocessing steps

  1. Handle missing values & outliers
  2. Handle non-numeric variables
    1. categorical variables:
      • convert to factors or one-hot encode
      • ordinal encoding if applicable
    2. date/time variables:
      • extract features like month, day, etc.
      • convert to numeric if needed
    3. text variables:
      • convert to numeric using techniques like TF-IDF
      • remove stop words, punctuation, etc.
  3. Handle numerical variables:
    • scale/normalize if needed
    • transform if needed (e.g., log transformation)
    • convert to categorical if needed: binning, discretization, factorization (e.g., age groups, day of the week, months)

Data preprocessing is a big topic. In this tutorial, we will do a simple preprocessing step by converting categorical variables to factors.

# Convert necessary variables to factors
day$weathersit <- as.factor(day$weathersit)
day$workingday <- as.factor(day$workingday)
day$season <- as.factor(day$season)
day$mnth <- as.factor(day$mnth)

# Attach the dataset to easily access variables
attach(day)

2. Correlation Analysis

This section analyzes correlations between numeric variables and visualizes them using a heatmap. Correlation analysis helps identify relationships between variables, which can be useful for feature selection and model building.

# Select numeric variables for correlation analysis
day_numeric <- data.frame(temp, atemp, hum, windspeed, casual, registered)

# Generate a correlation matrix
cor_matrix <- cor(day_numeric)

# Visualize the correlation matrix with a heatmap
heatmap.2(cor_matrix, trace = "none", col = bluered(100), 
          main = "Correlation Matrix", key = TRUE, symkey = FALSE, 
          density.info = "none", cellnote = round(cor_matrix, 2), 
          notecol = "black", margins = c(8, 16), cexRow = 1, cexCol = 1)

Exercise: What can you infer from the correlation matrix about the relationship between temp and atemp?

3. Linear Regression Model

We build a linear regression model to predict the total bike rentals (cnt) using atemp, hum, and windspeed.

# Build the first linear regression model with numeric variables
model1 <- lm(cnt ~ atemp + hum + windspeed, data = day)

# Display model summary
summary(model1)
## 
## Call:
## lm(formula = cnt ~ atemp + hum + windspeed, data = day)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4890  -1043    -82   1072   4384 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3774.0      342.9  11.007  < 2e-16 ***
## atemp         7504.1      330.2  22.723  < 2e-16 ***
## hum          -3167.5      383.4  -8.261 6.84e-16 ***
## windspeed    -4411.7      709.8  -6.215 8.64e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1422 on 727 degrees of freedom
## Multiple R-squared:  0.4632, Adjusted R-squared:  0.461 
## F-statistic: 209.1 on 3 and 727 DF,  p-value: < 2.2e-16

Explanation

Call

The formula used to fit the model: the dependent variable cnt is being modeled as a function of atemp, hum, and windspeed.

Residuals

Residuals are the differences between the observed values and the predicted values. The summary provides statistics about the residuals, such as the minimum, maximum, and quartiles.

Coefficients

The coefficients represent the estimated effect of each independent variable on the dependent variable.

  • Estimate column shows the coefficient values
  • Pr(>|t|) column indicates the statistical significance

Interpretation:

  • For every unit increase in atemp, the number of bike rentals (cnt) increases by 7504.1, holding other variables constant.
  • For every unit increase in hum, the number of bike rentals decreases by 3167.5, holding other variables constant.
  • For every unit increase in windspeed, the number of bike rentals decreases by 4411.7, holding other variables constant.

Regression formula: cnt = 3774.0 + 7504.1 * atemp - 3167.5 * hum - 4411.7 * windspeed

Model Summary

  • The residual standard error (RSE) measures the typical size of the residuals (errors). Here, it is 1422, meaning that on average, the model’s predictions deviate from the actual values by about 1422 bike rentals.
  • Degrees of freedom: This indicates the number of observations minus the number of estimated parameters. In this case, it’s 727 because there are 731 observations and 4 estimated parameters (3 predictors + 1 intercept).
  • R-squared (0.4632): This indicates that 46.32% of the variance in cnt is explained by the model.
  • Adjusted R-squared (0.461): This value adjusts the R-squared for the number of predictors in the model. It penalizes the model for adding variables that don’t significantly improve its fit. Since the Adjusted R-squared is slightly lower than the R-squared, it suggests that the predictors are adding useful information.
  • F-statistic (209.1) and p-value (< 2.2e-16): overall significance of the model. The p-value is very low, indicating that the model is statistically significant.

4. Improving the Model

# Add categorical variables to the model
model2 <- lm(cnt ~ atemp + hum + windspeed + weathersit + workingday + season, data = day)

# Display updated model summary
summary(model2)
## 
## Call:
## lm(formula = cnt ~ atemp + hum + windspeed + weathersit + workingday + 
##     season, data = day)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3881.0  -925.0  -255.7  1045.5  4216.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2844.0      355.8   7.993 5.21e-15 ***
## atemp         6595.5      528.1  12.489  < 2e-16 ***
## hum          -2643.0      464.1  -5.695 1.79e-08 ***
## windspeed    -2973.5      678.2  -4.384 1.34e-05 ***
## weathersit2   -229.1      128.4  -1.783  0.07494 .  
## weathersit3  -1899.8      328.7  -5.779 1.12e-08 ***
## workingday1    159.7      104.1   1.534  0.12537    
## season2        983.3      178.3   5.515 4.86e-08 ***
## season3        647.9      229.2   2.827  0.00484 ** 
## season4       1504.1      153.3   9.809  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1302 on 721 degrees of freedom
## Multiple R-squared:  0.5537, Adjusted R-squared:  0.5482 
## F-statistic: 99.41 on 9 and 721 DF,  p-value: < 2.2e-16

Explanation

New Variables:

  • weathersit3: Bad weather (weathersit3) has a significant negative effect on bike rentals (-1899.8).
  • workingday: Positive effect (159.7) but not statistically significant.

weathersit1 is not included in the model to avoid multicollinearity.It is considered the reference category, so the coefficients for weathersit2 and weathersit3 represent the difference from weathersit1.

5. Diagnostics and Residual Analysis

# Diagnostic plots for model
plot(model2)

# Calculate residuals
residuals <- resid(model2)

# Plot the density of residuals
plot(density(residuals))

# Create a PP plot for residuals
standardized_residuals <- scale(residuals)
prob_dist <- pnorm(standardized_residuals)

plot(ppoints(length(standardized_residuals)), sort(prob_dist), 
     main = 'PP Plot', xlab = 'Observed Probability', ylab = 'Expected Probability')
abline(0, 1, col = "red")

Explanation

We use diagnostic plots to check for potential issues in the model, such as non-normality of residuals or heteroscedasticity. Residuals should ideally follow a normal distribution with a mean of zero.

6. Making Predictions

# Predicting bike rentals for a new record
new_record <- data.frame(season = 1, yr = 0, mnth = 1, holiday = 0,
                         weekday = 6, workingday = 0, weathersit = 2,
                         temp = 0.344, atemp = 0.363, hum = 0.80, windspeed = 0.16)

# Convert categorical variables to factors
new_record$weathersit <- as.factor(new_record$weathersit)
new_record$workingday <- as.factor(new_record$workingday)
new_record$season <- as.factor(new_record$season)
new_record$mnth <- as.factor(new_record$mnth)

# Use the model to make a prediction
predicted_cnt <- predict(model2, newdata = new_record)

# Print the predicted value
predicted_cnt
##        1 
## 2418.928

Explanation

Here, we use the model 2 to predict the number of bike rentals (cnt) for a hypothetical day based on the provided values for weather conditions, season, etc.

7. Log-Transform Model

# Build a model with a log-transformed dependent variable
model_log <- lm(log(cnt) ~ atemp + hum + windspeed + weathersit + workingday + season, data = day)

# Display model summary
summary(model_log)
## 
## Call:
## lm(formula = log(cnt) ~ atemp + hum + windspeed + weathersit + 
##     workingday + season, data = day)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1487 -0.2126 -0.0035  0.2628  1.2551 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.75155    0.10344  74.934  < 2e-16 ***
## atemp        1.89226    0.15354  12.324  < 2e-16 ***
## hum         -0.71403    0.13493  -5.292 1.61e-07 ***
## windspeed   -0.87146    0.19720  -4.419 1.14e-05 ***
## weathersit2 -0.05056    0.03734  -1.354  0.17615    
## weathersit3 -0.94900    0.09558  -9.929  < 2e-16 ***
## workingday1  0.06677    0.03027   2.206  0.02769 *  
## season2      0.31770    0.05184   6.129 1.46e-09 ***
## season3      0.19629    0.06665   2.945  0.00333 ** 
## season4      0.47950    0.04458  10.755  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3786 on 721 degrees of freedom
## Multiple R-squared:  0.584,  Adjusted R-squared:  0.5788 
## F-statistic: 112.5 on 9 and 721 DF,  p-value: < 2.2e-16
# Diagnostic plots for log-transformed model
plot(model_log)

Explanation

Significance of Predictors:

  • Model 2: workingday1 was not significant.
  • Log-Transform Model: workingday1 becomes significant (p = 0.02769), showing its importance in the log-transformed data.

R-squared:

  • Model 2: R-squared = 0.5537, Adjusted R-squared = 0.5482.
  • Log-Transform Model: R-squared = 0.584, Adjusted R-squared = 0.5788, showing improved variance explanation.

Conclusion

In this tutorial, you learned how to:

  • Perform correlation analysis
  • Build and interpret linear regression models
  • Diagnose model issues using residual analysis
  • Make predictions with a linear regression model

Continue practicing by exploring other machine learning models in R, and remember to evaluate your models critically based on the diagnostics.