library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.2.1 ✔ dplyr 1.1.1
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
day_data <- read_csv("day.csv")
## Rows: 731 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (15): instant, season, yr, mnth, holiday, weekday, workingday, weathers...
## date (1): dteday
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
For the a linear regression, I would have the response variable be cnt, or the count of total rental bikes both casual and registered. I would do this because the cnt represents the total demand for bikes, and it can be used to show how the other variables in the data set may effect this demand for bikes.
lm_model <- lm(day_data$cnt ~ temp, data = day_data)
lm_model
##
## Call:
## lm(formula = day_data$cnt ~ temp, data = day_data)
##
## Coefficients:
## (Intercept) temp
## 1215 6641
plot(day_data$temp, day_data$cnt, xlab = "Normalized Temperature (Celsius)", ylab = "Count of Rental Bikes",
main = "Scatter Plot of Temperature vs. Rental Bike Count", col = as.numeric(day_data$season), pch = 16)
fit <- lm(cnt ~ temp, data = day_data)
abline(fit, col = "red")
legend("topleft", legend = levels(factor(day_data$season)), col = 1:4, pch = 16, title = "Season")
summary(lm_model)
##
## Call:
## lm(formula = day_data$cnt ~ temp, data = day_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4615.3 -1134.9 -104.4 1044.3 3737.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1214.6 161.2 7.537 1.43e-13 ***
## temp 6640.7 305.2 21.759 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1509 on 729 degrees of freedom
## Multiple R-squared: 0.3937, Adjusted R-squared: 0.3929
## F-statistic: 473.5 on 1 and 729 DF, p-value: < 2.2e-16
The model is statistically significant because it has a very small p-value of 2.2e-16
correlation_coef <- cor(day_data$temp, day_data$cnt)
cat("Correlation coefficient between temperature and total rental bikes:", correlation_coef)
## Correlation coefficient between temperature and total rental bikes: 0.627494
r_squared <- summary(lm_model)$r.squared
cat("R-squared value of the linear model:", r_squared)
## R-squared value of the linear model: 0.3937487
This means there is a linear relationship between the variable but the relationship bewteen them may be affected by another variable or multiple other variables.
resid <- resid(lm_model)
plot(day_data$temp, resid)
The regression is linear, the residuals are independent, normally distributed, have constant variance, do no display multicollinearity, there are no outliers, and each observation is independent of one another. Therefore the linear regression is valid.