Read in the CSV file

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.2.1     ✔ dplyr   1.1.1
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
day_data <- read_csv("day.csv")
## Rows: 731 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (15): instant, season, yr, mnth, holiday, weekday, workingday, weathers...
## date  (1): dteday
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

4

For the a linear regression, I would have the response variable be cnt, or the count of total rental bikes both casual and registered. I would do this because the cnt represents the total demand for bikes, and it can be used to show how the other variables in the data set may effect this demand for bikes.

5

lm_model <- lm(day_data$cnt ~ temp, data = day_data)
lm_model
## 
## Call:
## lm(formula = day_data$cnt ~ temp, data = day_data)
## 
## Coefficients:
## (Intercept)         temp  
##        1215         6641

6

plot(day_data$temp, day_data$cnt, xlab = "Normalized Temperature (Celsius)", ylab = "Count of Rental Bikes",
     main = "Scatter Plot of Temperature vs. Rental Bike Count", col = as.numeric(day_data$season), pch = 16)
fit <- lm(cnt ~ temp, data = day_data)
abline(fit, col = "red")
legend("topleft", legend = levels(factor(day_data$season)), col = 1:4, pch = 16, title = "Season")

7

summary(lm_model)
## 
## Call:
## lm(formula = day_data$cnt ~ temp, data = day_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4615.3 -1134.9  -104.4  1044.3  3737.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1214.6      161.2   7.537 1.43e-13 ***
## temp          6640.7      305.2  21.759  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1509 on 729 degrees of freedom
## Multiple R-squared:  0.3937, Adjusted R-squared:  0.3929 
## F-statistic: 473.5 on 1 and 729 DF,  p-value: < 2.2e-16

The model is statistically significant because it has a very small p-value of 2.2e-16

8

correlation_coef <- cor(day_data$temp, day_data$cnt)
cat("Correlation coefficient between temperature and total rental bikes:", correlation_coef)
## Correlation coefficient between temperature and total rental bikes: 0.627494

9

r_squared <- summary(lm_model)$r.squared
cat("R-squared value of the linear model:", r_squared)
## R-squared value of the linear model: 0.3937487

This means there is a linear relationship between the variable but the relationship bewteen them may be affected by another variable or multiple other variables.

10

resid <- resid(lm_model)
plot(day_data$temp, resid)

11

The regression is linear, the residuals are independent, normally distributed, have constant variance, do no display multicollinearity, there are no outliers, and each observation is independent of one another. Therefore the linear regression is valid.