Linear regression is a fundamental statistical technique used to
model relationships between a dependent variable and one or more
independent variables. In this tutorial, we will explore how to perform
linear regression in R using the day.csv
dataset.
# Load the required libraries
library(readr)
library(dplyr)
library(gplots)
# Load the dataset
day <- read_csv("Bike Sharing Dataset/day.csv")
# View the dataset structure
head(day)
## # A tibble: 6 × 16
## instant dteday season yr mnth holiday weekday workingday weathersit
## <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2011-01-01 1 0 1 0 6 0 2
## 2 2 2011-01-02 1 0 1 0 0 0 2
## 3 3 2011-01-03 1 0 1 0 1 1 1
## 4 4 2011-01-04 1 0 1 0 2 1 1
## 5 5 2011-01-05 1 0 1 0 3 1 1
## 6 6 2011-01-06 1 0 1 0 4 1 1
## # ℹ 7 more variables: temp <dbl>, atemp <dbl>, hum <dbl>, windspeed <dbl>,
## # casual <dbl>, registered <dbl>, cnt <dbl>
# Skim the dataset using the skimr package
skimr::skim(day)
Name | day |
Number of rows | 731 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
Date | 1 |
numeric | 15 |
________________________ | |
Group variables | None |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
dteday | 0 | 1 | 2011-01-01 | 2012-12-31 | 2012-01-01 | 731 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
instant | 0 | 1 | 366.00 | 211.17 | 1.00 | 183.50 | 366.00 | 548.50 | 731.00 | ▇▇▇▇▇ |
season | 0 | 1 | 2.50 | 1.11 | 1.00 | 2.00 | 3.00 | 3.00 | 4.00 | ▇▇▁▇▇ |
yr | 0 | 1 | 0.50 | 0.50 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
mnth | 0 | 1 | 6.52 | 3.45 | 1.00 | 4.00 | 7.00 | 10.00 | 12.00 | ▇▅▅▅▇ |
holiday | 0 | 1 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
weekday | 0 | 1 | 3.00 | 2.00 | 0.00 | 1.00 | 3.00 | 5.00 | 6.00 | ▇▃▃▃▇ |
workingday | 0 | 1 | 0.68 | 0.47 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▃▁▁▁▇ |
weathersit | 0 | 1 | 1.40 | 0.54 | 1.00 | 1.00 | 1.00 | 2.00 | 3.00 | ▇▁▅▁▁ |
temp | 0 | 1 | 0.50 | 0.18 | 0.06 | 0.34 | 0.50 | 0.66 | 0.86 | ▂▇▇▇▅ |
atemp | 0 | 1 | 0.47 | 0.16 | 0.08 | 0.34 | 0.49 | 0.61 | 0.84 | ▂▇▆▇▂ |
hum | 0 | 1 | 0.63 | 0.14 | 0.00 | 0.52 | 0.63 | 0.73 | 0.97 | ▁▁▆▇▂ |
windspeed | 0 | 1 | 0.19 | 0.08 | 0.02 | 0.13 | 0.18 | 0.23 | 0.51 | ▃▇▅▁▁ |
casual | 0 | 1 | 848.18 | 686.62 | 2.00 | 315.50 | 713.00 | 1096.00 | 3410.00 | ▇▆▂▁▁ |
registered | 0 | 1 | 3656.17 | 1560.26 | 20.00 | 2497.00 | 3662.00 | 4776.50 | 6946.00 | ▂▅▇▅▃ |
cnt | 0 | 1 | 4504.35 | 1937.21 | 22.00 | 3152.00 | 4548.00 | 5956.00 | 8714.00 | ▂▅▇▅▃ |
Data preprocessing is a big topic. In this tutorial, we will do a simple preprocessing step by converting categorical variables to factors.
# Convert necessary variables to factors
day$weathersit <- as.factor(day$weathersit)
day$workingday <- as.factor(day$workingday)
day$season <- as.factor(day$season)
day$mnth <- as.factor(day$mnth)
# Attach the dataset to easily access variables
attach(day)
This section analyzes correlations between numeric variables and visualizes them using a heatmap. Correlation analysis helps identify relationships between variables, which can be useful for feature selection and model building.
# Select numeric variables for correlation analysis
day_numeric <- data.frame(temp, atemp, hum, windspeed, casual, registered)
# Generate a correlation matrix
cor_matrix <- cor(day_numeric)
# Visualize the correlation matrix with a heatmap
heatmap.2(cor_matrix, trace = "none", col = bluered(100),
main = "Correlation Matrix", key = TRUE, symkey = FALSE,
density.info = "none", cellnote = round(cor_matrix, 2),
notecol = "black", margins = c(8, 16), cexRow = 1, cexCol = 1)
Exercise: What can you infer from the correlation
matrix about the relationship between temp
and
atemp
?
We build a linear regression model to predict the total bike rentals
(cnt
) using atemp
, hum
, and
windspeed
.
# Build the first linear regression model with numeric variables
model1 <- lm(cnt ~ atemp + hum + windspeed, data = day)
# Display model summary
summary(model1)
##
## Call:
## lm(formula = cnt ~ atemp + hum + windspeed, data = day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4890 -1043 -82 1072 4384
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3774.0 342.9 11.007 < 2e-16 ***
## atemp 7504.1 330.2 22.723 < 2e-16 ***
## hum -3167.5 383.4 -8.261 6.84e-16 ***
## windspeed -4411.7 709.8 -6.215 8.64e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1422 on 727 degrees of freedom
## Multiple R-squared: 0.4632, Adjusted R-squared: 0.461
## F-statistic: 209.1 on 3 and 727 DF, p-value: < 2.2e-16
The formula used to fit the model: the dependent variable
cnt
is being modeled as a function of atemp
,
hum
, and windspeed
.
Residuals are the differences between the observed values and the predicted values. The summary provides statistics about the residuals, such as the minimum, maximum, and quartiles.
The coefficients represent the estimated effect of each independent variable on the dependent variable.
Estimate
column shows the coefficient valuesPr(>|t|)
column indicates the statistical
significanceInterpretation:
atemp
, the number of bike
rentals (cnt
) increases by 7504.1, holding other variables
constant.hum
, the number of bike
rentals decreases by 3167.5, holding other variables constant.windspeed
, the number of
bike rentals decreases by 4411.7, holding other variables constant.Regression formula:
cnt = 3774.0 + 7504.1 * atemp - 3167.5 * hum - 4411.7 * windspeed
cnt
is explained by the model.# Add categorical variables to the model
model2 <- lm(cnt ~ atemp + hum + windspeed + weathersit + workingday + season, data = day)
# Display updated model summary
summary(model2)
##
## Call:
## lm(formula = cnt ~ atemp + hum + windspeed + weathersit + workingday +
## season, data = day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3881.0 -925.0 -255.7 1045.5 4216.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2844.0 355.8 7.993 5.21e-15 ***
## atemp 6595.5 528.1 12.489 < 2e-16 ***
## hum -2643.0 464.1 -5.695 1.79e-08 ***
## windspeed -2973.5 678.2 -4.384 1.34e-05 ***
## weathersit2 -229.1 128.4 -1.783 0.07494 .
## weathersit3 -1899.8 328.7 -5.779 1.12e-08 ***
## workingday1 159.7 104.1 1.534 0.12537
## season2 983.3 178.3 5.515 4.86e-08 ***
## season3 647.9 229.2 2.827 0.00484 **
## season4 1504.1 153.3 9.809 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1302 on 721 degrees of freedom
## Multiple R-squared: 0.5537, Adjusted R-squared: 0.5482
## F-statistic: 99.41 on 9 and 721 DF, p-value: < 2.2e-16
weathersit1 is not included in the model to avoid multicollinearity.It is considered the reference category, so the coefficients for weathersit2 and weathersit3 represent the difference from weathersit1.
# Diagnostic plots for model
plot(model2)
# Calculate residuals
residuals <- resid(model2)
# Plot the density of residuals
plot(density(residuals))
# Create a PP plot for residuals
standardized_residuals <- scale(residuals)
prob_dist <- pnorm(standardized_residuals)
plot(ppoints(length(standardized_residuals)), sort(prob_dist),
main = 'PP Plot', xlab = 'Observed Probability', ylab = 'Expected Probability')
abline(0, 1, col = "red")
We use diagnostic plots to check for potential issues in the model, such as non-normality of residuals or heteroscedasticity. Residuals should ideally follow a normal distribution with a mean of zero.
# Predicting bike rentals for a new record
new_record <- data.frame(season = 1, yr = 0, mnth = 1, holiday = 0,
weekday = 6, workingday = 0, weathersit = 2,
temp = 0.344, atemp = 0.363, hum = 0.80, windspeed = 0.16)
# Convert categorical variables to factors
new_record$weathersit <- as.factor(new_record$weathersit)
new_record$workingday <- as.factor(new_record$workingday)
new_record$season <- as.factor(new_record$season)
new_record$mnth <- as.factor(new_record$mnth)
# Use the model to make a prediction
predicted_cnt <- predict(model2, newdata = new_record)
# Print the predicted value
predicted_cnt
## 1
## 2418.928
Here, we use the model 2 to predict the number of bike rentals
(cnt
) for a hypothetical day based on the provided values
for weather conditions, season, etc.
# Build a model with a log-transformed dependent variable
model_log <- lm(log(cnt) ~ atemp + hum + windspeed + weathersit + workingday + season, data = day)
# Display model summary
summary(model_log)
##
## Call:
## lm(formula = log(cnt) ~ atemp + hum + windspeed + weathersit +
## workingday + season, data = day)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1487 -0.2126 -0.0035 0.2628 1.2551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.75155 0.10344 74.934 < 2e-16 ***
## atemp 1.89226 0.15354 12.324 < 2e-16 ***
## hum -0.71403 0.13493 -5.292 1.61e-07 ***
## windspeed -0.87146 0.19720 -4.419 1.14e-05 ***
## weathersit2 -0.05056 0.03734 -1.354 0.17615
## weathersit3 -0.94900 0.09558 -9.929 < 2e-16 ***
## workingday1 0.06677 0.03027 2.206 0.02769 *
## season2 0.31770 0.05184 6.129 1.46e-09 ***
## season3 0.19629 0.06665 2.945 0.00333 **
## season4 0.47950 0.04458 10.755 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3786 on 721 degrees of freedom
## Multiple R-squared: 0.584, Adjusted R-squared: 0.5788
## F-statistic: 112.5 on 9 and 721 DF, p-value: < 2.2e-16
# Diagnostic plots for log-transformed model
plot(model_log)
In this tutorial, you learned how to:
Continue practicing by exploring other machine learning models in R, and remember to evaluate your models critically based on the diagnostics.