Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
I will use a dataset from an old project where I completed a linear regression ( my research question was: Does the temperature predict the quantity of daily bike rentals?) and I will add more variables to it to complete the discussion board.
# load data
library (readr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
urlfile="https://raw.githubusercontent.com/MarjeteV/606-Project/main/Dataset%20Days"
mydata<-read_csv(url(urlfile))
## Rows: 731 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (15): instant, season, yr, mnth, holiday, weekday, workingday, weathers...
## date (1): dteday
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(mydata)
## instant dteday season yr
## Min. : 1.0 Min. :2011-01-01 Min. :1.000 Min. :0.0000
## 1st Qu.:183.5 1st Qu.:2011-07-02 1st Qu.:2.000 1st Qu.:0.0000
## Median :366.0 Median :2012-01-01 Median :3.000 Median :1.0000
## Mean :366.0 Mean :2012-01-01 Mean :2.497 Mean :0.5007
## 3rd Qu.:548.5 3rd Qu.:2012-07-01 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :731.0 Max. :2012-12-31 Max. :4.000 Max. :1.0000
## mnth holiday weekday workingday
## Min. : 1.00 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.: 4.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000
## Median : 7.00 Median :0.00000 Median :3.000 Median :1.000
## Mean : 6.52 Mean :0.02873 Mean :2.997 Mean :0.684
## 3rd Qu.:10.00 3rd Qu.:0.00000 3rd Qu.:5.000 3rd Qu.:1.000
## Max. :12.00 Max. :1.00000 Max. :6.000 Max. :1.000
## weathersit temp atemp hum
## Min. :1.000 Min. :0.05913 Min. :0.07907 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200
## Median :1.000 Median :0.49833 Median :0.48673 Median :0.6267
## Mean :1.395 Mean :0.49538 Mean :0.47435 Mean :0.6279
## 3rd Qu.:2.000 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302
## Max. :3.000 Max. :0.86167 Max. :0.84090 Max. :0.9725
## windspeed casual registered cnt
## Min. :0.02239 Min. : 2.0 Min. : 20 Min. : 22
## 1st Qu.:0.13495 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152
## Median :0.18097 Median : 713.0 Median :3662 Median :4548
## Mean :0.19049 Mean : 848.2 Mean :3656 Mean :4504
## 3rd Qu.:0.23321 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956
## Max. :0.50746 Max. :3410.0 Max. :6946 Max. :8714
#temp was rescaled so the range of these values is between 0-1
# how do I determine quadratic variable? look for potential non-linear relationship
#I decided to use humidity as the quadratic variable
modela <- lm(cnt ~ hum + I(hum^2), data = mydata)
# Plot the data and the quadratic regression line
ggplot(mydata, aes(x = hum, y = cnt)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, col = "blue") +
labs(title = "Scatter Plot with Quadratic Regression Line",
x = "hum",
y = "Bike Rentals")
summary(modela)
##
## Call:
## lm(formula = cnt ~ hum + I(hum^2), data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4281.4 -1252.3 2.3 1515.0 3936.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1245.5 954.2 -1.305 0.192
## hum 21021.7 3093.3 6.796 2.24e-11 ***
## I(hum^2) -17971.9 2452.3 -7.329 6.19e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1863 on 728 degrees of freedom
## Multiple R-squared: 0.07814, Adjusted R-squared: 0.07561
## F-statistic: 30.86 on 2 and 728 DF, p-value: 1.372e-13
#Dischotomous Term is a binary - yes or no/ 1 or 0; working day and holiday could be selected from the dataset.
#One Dichotomous vs Quantitative Interaction Term- will try to use cnt and working day
modelb <- lm(cnt ~ temp * workingday, data = mydata)
summary(modelb)
##
## Call:
## lm(formula = cnt ~ temp * workingday, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4652.8 -1121.4 -100.5 1059.8 3781.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 850.0 274.2 3.100 0.00201 **
## temp 7232.1 531.1 13.618 < 2e-16 ***
## workingday 560.1 338.9 1.653 0.09880 .
## temp:workingday -907.1 649.2 -1.397 0.16277
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1508 on 727 degrees of freedom
## Multiple R-squared: 0.3962, Adjusted R-squared: 0.3937
## F-statistic: 159 on 3 and 727 DF, p-value: < 2.2e-16
#Next I will create a dataframe & add new columns
dfa <- mydata %>%
mutate( hum_squared = `hum`^2,
DvsQ= `temp` * `workingday`) #this seems off since any working day = 0 will make the new variable 0
#I want to remove some columns that seem unlikely to help in MLR
remove <- c("instant", "dteday", "yr")
df <- dfa [, !(names(dfa) %in% remove)]
glimpse(df)
## Rows: 731
## Columns: 15
## $ season <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mnth <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ weekday <dbl> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4…
## $ workingday <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1…
## $ weathersit <dbl> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2…
## $ temp <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.2269570, 0.2…
## $ atemp <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.2292700, 0.2…
## $ hum <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0.518261…
## $ windspeed <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.1869000, 0.0…
## $ casual <dbl> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25, 38, 5…
## $ registered <dbl> 654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 1280, 12…
## $ cnt <dbl> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1321, 12…
## $ hum_squared <dbl> 0.6493668, 0.4845371, 0.1912077, 0.3486135, 0.1909314, 0.2…
## $ DvsQ <dbl> 0.0000000, 0.0000000, 0.1963640, 0.2000000, 0.2269570, 0.2…
pairs(df, gap = .5)
model.all <- lm(cnt ~ ., data = df)
summary(model.all)
## Warning in summary.lm(model.all): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = cnt ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.466e-11 -3.890e-13 -4.800e-14 2.430e-13 4.394e-11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.378e-13 1.362e-12 -6.150e-01 0.53867
## season 6.988e-13 1.445e-13 4.837e+00 1.61e-06 ***
## mnth -8.440e-14 4.400e-14 -1.918e+00 0.05548 .
## holiday -1.194e-12 5.188e-13 -2.302e+00 0.02163 *
## weekday 4.148e-14 4.211e-14 9.850e-01 0.32500
## workingday -1.472e-12 5.088e-13 -2.894e+00 0.00392 **
## weathersit 1.932e-12 2.141e-13 9.027e+00 < 2e-16 ***
## temp -1.073e-12 3.758e-12 -2.850e-01 0.77538
## atemp -6.743e-12 4.085e-12 -1.651e+00 0.09921 .
## hum 3.650e-12 4.013e-12 9.090e-01 0.36343
## windspeed -3.425e-13 1.190e-12 -2.880e-01 0.77354
## casual 1.000e+00 2.535e-16 3.944e+15 < 2e-16 ***
## registered 1.000e+00 8.880e-17 1.126e+16 < 2e-16 ***
## hum_squared -1.772e-12 3.280e-12 -5.400e-01 0.58907
## DvsQ 6.401e-13 1.083e-12 5.910e-01 0.55474
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.234e-12 on 716 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.92e+31 on 14 and 716 DF, p-value: < 2.2e-16
#Warning: essentially perfect fit: summary may be unreliable
#Backward Elimination- take the predictor with the largest p-value & recompute
model1 <- update(model.all, .~. - temp, data = df)
summary(model1)
## Warning in summary.lm(model1): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = cnt ~ season + mnth + holiday + weekday + workingday +
## weathersit + atemp + hum + windspeed + casual + registered +
## hum_squared + DvsQ, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.560e-11 -3.010e-13 -5.500e-14 1.850e-13 4.446e-11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.096e-12 1.334e-12 -2.320e+00 0.0206 *
## season 1.807e-13 1.416e-13 1.276e+00 0.2024
## mnth -3.370e-15 4.312e-14 -7.800e-02 0.9377
## holiday -7.950e-13 5.079e-13 -1.565e+00 0.1179
## weekday 2.822e-14 4.126e-14 6.840e-01 0.4943
## workingday 4.734e-13 4.888e-13 9.680e-01 0.3331
## weathersit 2.587e-12 2.097e-13 1.234e+01 <2e-16 ***
## atemp 2.452e-12 1.159e-12 2.115e+00 0.0348 *
## hum -5.865e-13 3.931e-12 -1.490e-01 0.8815
## windspeed -2.814e-12 1.149e-12 -2.448e+00 0.0146 *
## casual 1.000e+00 2.443e-16 4.094e+15 <2e-16 ***
## registered 1.000e+00 8.675e-17 1.153e+16 <2e-16 ***
## hum_squared -4.034e-13 3.214e-12 -1.260e-01 0.9001
## DvsQ -1.240e-12 1.019e-12 -1.217e+00 0.2241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.19e-12 on 717 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.394e+31 on 13 and 717 DF, p-value: < 2.2e-16
model2 <- update(model1, .~. - mnth, data = df)
summary(model2)
## Warning in summary.lm(model2): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = cnt ~ season + holiday + weekday + workingday +
## weathersit + atemp + hum + windspeed + casual + registered +
## hum_squared + DvsQ, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.376e-11 -2.570e-13 1.400e-14 2.780e-13 4.822e-11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.629e-12 1.515e-12 -1.075e+00 0.282663
## season 1.238e-13 9.725e-14 1.273e+00 0.203571
## holiday 7.136e-14 5.780e-13 1.230e-01 0.901777
## weekday 9.401e-14 4.695e-14 2.002e+00 0.045616 *
## workingday -2.760e-12 5.565e-13 -4.960e+00 8.81e-07 ***
## weathersit 1.604e-12 2.388e-13 6.714e+00 3.85e-11 ***
## atemp -3.566e-12 1.319e-12 -2.702e+00 0.007047 **
## hum -1.725e-12 4.453e-12 -3.870e-01 0.698671
## windspeed 4.958e-12 1.309e-12 3.788e+00 0.000164 ***
## casual 1.000e+00 2.777e-16 3.600e+15 < 2e-16 ***
## registered 1.000e+00 9.879e-17 1.012e+16 < 2e-16 ***
## hum_squared 2.866e-12 3.646e-12 7.860e-01 0.432075
## DvsQ 5.786e-13 1.160e-12 4.990e-01 0.618180
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.494e-12 on 718 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.67e+31 on 12 and 718 DF, p-value: < 2.2e-16
model3 <- update(model2, .~. - holiday, data = df)
summary(model3)
## Warning in summary.lm(model3): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = cnt ~ season + weekday + workingday + weathersit +
## atemp + hum + windspeed + casual + registered + hum_squared +
## DvsQ, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.563e-11 -2.830e-13 1.000e-15 2.440e-13 4.546e-11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.061e-12 1.442e-12 -7.360e-01 0.4621
## season 8.882e-14 9.258e-14 9.590e-01 0.3377
## weekday -4.392e-14 4.454e-14 -9.860e-01 0.3245
## workingday -2.527e-12 5.257e-13 -4.808e+00 1.86e-06 ***
## weathersit -3.967e-13 2.274e-13 -1.745e+00 0.0814 .
## atemp 5.181e-13 1.255e-12 4.130e-01 0.6799
## hum 5.617e-13 4.233e-12 1.330e-01 0.8945
## windspeed 7.650e-12 1.246e-12 6.140e+00 1.36e-09 ***
## casual 1.000e+00 2.632e-16 3.800e+15 < 2e-16 ***
## registered 1.000e+00 9.404e-17 1.063e+16 < 2e-16 ***
## hum_squared 5.959e-13 3.464e-12 1.720e-01 0.8635
## DvsQ 5.285e-13 1.104e-12 4.790e-01 0.6324
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.374e-12 on 719 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.419e+31 on 11 and 719 DF, p-value: < 2.2e-16
model4 <- update(model3, .~. - hum, data = df)
summary(model4)
## Warning in summary.lm(model4): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = cnt ~ season + weekday + workingday + weathersit +
## atemp + windspeed + casual + registered + hum_squared + DvsQ,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.614e-11 -3.110e-13 -2.800e-14 1.780e-13 4.346e-11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.421e-12 5.987e-13 4.044e+00 5.81e-05 ***
## season 2.010e-14 8.603e-14 2.340e-01 0.81531
## weekday 5.060e-15 4.139e-14 1.220e-01 0.90274
## workingday -4.131e-12 4.882e-13 -8.460e+00 < 2e-16 ***
## weathersit -6.090e-13 2.069e-13 -2.943e+00 0.00335 **
## atemp 3.443e-12 1.164e-12 2.958e+00 0.00320 **
## windspeed -7.838e-13 1.147e-12 -6.840e-01 0.49442
## casual 1.000e+00 2.442e-16 4.095e+15 < 2e-16 ***
## registered 1.000e+00 8.664e-17 1.154e+16 < 2e-16 ***
## hum_squared -1.323e-14 6.526e-13 -2.000e-02 0.98383
## DvsQ 1.625e-12 1.024e-12 1.586e+00 0.11312
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.206e-12 on 720 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 5.629e+31 on 10 and 720 DF, p-value: < 2.2e-16
model5 <- update(model4, .~. - hum_squared, data = df)
summary(model5)
## Warning in summary.lm(model5): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = cnt ~ season + weekday + workingday + weathersit +
## atemp + windspeed + casual + registered + DvsQ, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.614e-11 -3.110e-13 -2.700e-14 1.780e-13 4.346e-11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.421e-12 5.914e-13 4.094e+00 4.71e-05 ***
## season 2.010e-14 8.500e-14 2.360e-01 0.813112
## weekday 5.060e-15 4.128e-14 1.230e-01 0.902465
## workingday -4.131e-12 4.879e-13 -8.467e+00 < 2e-16 ***
## weathersit -6.090e-13 1.604e-13 -3.797e+00 0.000159 ***
## atemp 3.443e-12 1.143e-12 3.012e+00 0.002686 **
## windspeed -7.838e-13 1.098e-12 -7.140e-01 0.475751
## casual 1.000e+00 2.430e-16 4.116e+15 < 2e-16 ***
## registered 1.000e+00 8.603e-17 1.162e+16 < 2e-16 ***
## DvsQ 1.626e-12 1.023e-12 1.590e+00 0.112332
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.205e-12 on 721 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 6.263e+31 on 9 and 721 DF, p-value: < 2.2e-16
model6 <- update(model5, .~. - weekday, data = df)
summary(model6)
## Warning in summary.lm(model6): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = cnt ~ season + workingday + weathersit + atemp +
## windspeed + casual + registered + DvsQ, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.238e-11 -2.400e-13 -2.400e-14 1.780e-13 4.631e-11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.159e-12 5.140e-13 -2.256e+00 0.0244 *
## season 3.272e-13 7.528e-14 4.346e+00 1.58e-05 ***
## workingday -6.873e-13 4.321e-13 -1.591e+00 0.1121
## weathersit 9.839e-13 1.417e-13 6.941e+00 8.66e-12 ***
## atemp -1.420e-12 1.006e-12 -1.412e+00 0.1585
## windspeed 1.868e-12 9.725e-13 1.921e+00 0.0552 .
## casual 1.000e+00 2.131e-16 4.692e+15 < 2e-16 ***
## registered 1.000e+00 7.619e-17 1.313e+16 < 2e-16 ***
## DvsQ 2.590e-13 9.036e-13 2.870e-01 0.7745
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.952e-12 on 722 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 8.983e+31 on 8 and 722 DF, p-value: < 2.2e-16
model7 <- update(model6, .~. - DvsQ, data = df)
summary(model7)
## Warning in summary.lm(model7): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = cnt ~ season + workingday + weathersit + atemp +
## windspeed + casual + registered, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.238e-11 -2.440e-13 -2.500e-14 1.740e-13 4.632e-11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.257e-12 4.228e-13 -2.972e+00 0.00306 **
## season 3.264e-13 7.505e-14 4.349e+00 1.56e-05 ***
## workingday -5.426e-13 2.521e-13 -2.153e+00 0.03169 *
## weathersit 9.839e-13 1.415e-13 6.953e+00 8.03e-12 ***
## atemp -1.420e-12 6.199e-13 -2.291e+00 0.02225 *
## windspeed 1.868e-12 9.710e-13 1.924e+00 0.05479 .
## casual 1.000e+00 1.937e-16 5.164e+15 < 2e-16 ***
## registered 1.000e+00 7.328e-17 1.365e+16 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.951e-12 on 723 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.028e+32 on 7 and 723 DF, p-value: < 2.2e-16
summary(model7)$r.squared
## Warning in summary.lm(model7): essentially perfect fit: summary may be
## unreliable
## [1] 1
par(mmfrow = c(2.2))
## Warning in par(mmfrow = c(2.2)): "mmfrow" is not a graphical parameter
plot(model7)
##conclusion
Rsquared continues to be 1 in all the models so obviously something went wrong, making interpretation of residuals not worth it since there are errors in the model , and I do not feel confident in stating either way if MLR is a good model for this data.