homework1.Rmd file from the course
website.homework1.Rmd in RStudio.homework1.Rmd file from the course
website.homework1.Rmd in RStudio.This portion of the lab gets you to carry out the Lab in §3.6 of ISLR (Pages 109 - 118). You will want to have the textbook Lab open in front you as you go through these exercises. The ISLR Lab provides much more context and explanation for what you’re doing.
Please run all of the code indicated in §3.6 of ISLR, even if I don’t explicitly ask you to do so in this document.
Note: You may want to use the
View(Boston) command instead of
fix(Boston).
dim() command to figure out the number of
rows and columns in the Boston housing datalibrary(MASS) # contains the Boston dataset
data("Boston") # load it into your session
# quick sanity checks
dim(Boston) # rows & cols
## [1] 506 14
names(Boston) # variable names
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
head(Boston, 3) # first few rows
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## medv
## 1 24.0
## 2 21.6
## 3 34.7
nrow() and ncol() commands to
figure out the number of rows and columns in the Boston housing
data.nrow(Boston)
## [1] 506
ncol(Boston)
## [1] 14
names() command to see which variables
exist in the data. Which of these variables is our response variable?
What does this response variable refer to? How many input variables do
we have?names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
lm() function to a fit linear regression of
medv on lstat. Save the output of your linear
regression in a variable called lm.fit.lm.fit <- lm(medv ~ lstat, data = Boston)
summary() command on your
lm.fit object to get a print-out of your regression
resultssummary(lm.fit)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
library(knitr)
kable(summary(lm.fit)$coefficients, digits = c(4, 5, 2, 4))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 34.5538 | 0.56263 | 61.42 | 0 |
| lstat | -0.9500 | 0.03873 | -24.53 | 0 |
names() on lm.fit to explore what
values this linear model object contains.names(lm.fit)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
coef() function to get the estimated
coefficients. What is the estimated Intercept? What is the coefficient
of lstat in the model? Interpret this coefficient.coef(lm.fit)
## (Intercept) lstat
## 34.5538409 -0.9500494
mdev
vs. lstat. Make sure to produce more meaningful axis
labels. Does the linear model appear to fit the data well? Explain.library(ggplot2)
ggplot(Boston, aes(x = lstat, y = medv)) +
geom_point(color = "steelblue", alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Linear Regression of medv on lstat (Boston Housing Data)",
x = "Percent Lower-Status Population (lstat)",
y = "Median Home Value (medv, in $1000s)"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# 95% confidence interval for the slope and intercept
confint(lm.fit)
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lstat -1.026148 -0.8739505
# Example: predict median value for specific lstat values
new.data <- data.frame(lstat = c(5, 10, 15))
predict(lm.fit, newdata = new.data, interval = "confidence")
## fit lwr upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
predict(lm.fit, newdata = new.data, interval = "prediction")
## fit lwr upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310 8.077742 32.52846
?Boston to figure out what the
age variable means. What does age mean in the
Boston Housing data??Boston # opens help page (or use help(“Boston”, package = “MASS”)) # Answer in words (put this in your Rmd text, not code): # age = proportion of owner-occupied units built prior to 1940 (in percent).
qplot() command to construct a scatterplot of
medv veruses age. Make sure to specify
meaningful x and y axis names. Overlay a linear regression line. Does a
linear relationship appear to hold between the two variables?library(ggplot2)
qplot(
x = age, y = medv, data = Boston,
geom = c("point", "smooth"),
method = "lm", se = TRUE,
xlab = "Proportion of owner-occupied units built before 1940 (%)",
ylab = "Median home value (medv, $1000s)",
main = "Boston Housing: medv vs age with linear fit"
)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning in geom_point(method = "lm", se = TRUE): Ignoring unknown parameters:
## `method` and `se`
## `geom_smooth()` using formula = 'y ~ x'
# The relationship is weakly positive and not perfectly linear.
lm() command to a fit a linear regression
of medv on lstat and age. Save
your regression model in a variable called lm.fit.lm.fit2 <- lm(medv ~ lstat + age, data = Boston)
summary(lm.fit2)
##
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.981 -3.978 -1.283 1.968 23.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
## lstat -1.03207 0.04819 -21.416 < 2e-16 ***
## age 0.03454 0.01223 2.826 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
age in your model?
Interpret this coefficient.coef(lm.fit2)["age"]
## age
## 0.03454434
summary(lm.fit2)$coefficients["age", ]
## Estimate Std. Error t value Pr(>|t|)
## 0.034544339 0.012225467 2.825604893 0.004906776
confint(lm.fit2)["age", ]
## 2.5 % 97.5 %
## 0.01052507 0.05856361
medv ~ . syntax to fit a model regressing
medv on all the other variables. Use the
summary() and kable() functions to produce a
coefficients table in nice formatting.lm.fit.full <- lm(medv ~ ., data = Boston)
summary(lm.fit.full)
##
## Call:
## lm(formula = medv ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.4594884 5.1034588 7.144 3.28e-12 ***
## crim -0.1080114 0.0328650 -3.287 0.001087 **
## zn 0.0464205 0.0137275 3.382 0.000778 ***
## indus 0.0205586 0.0614957 0.334 0.738288
## chas 2.6867338 0.8615798 3.118 0.001925 **
## nox -17.7666112 3.8197437 -4.651 4.25e-06 ***
## rm 3.8098652 0.4179253 9.116 < 2e-16 ***
## age 0.0006922 0.0132098 0.052 0.958229
## dis -1.4755668 0.1994547 -7.398 6.01e-13 ***
## rad 0.3060495 0.0663464 4.613 5.07e-06 ***
## tax -0.0123346 0.0037605 -3.280 0.001112 **
## ptratio -0.9527472 0.1308268 -7.283 1.31e-12 ***
## black 0.0093117 0.0026860 3.467 0.000573 ***
## lstat -0.5247584 0.0507153 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
library(knitr)
kable(summary(lm.fit.full)$coefficients, digits = c(4, 5, 2, 4))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 36.4595 | 5.10346 | 7.14 | 0.0000 |
| crim | -0.1080 | 0.03286 | -3.29 | 0.0011 |
| zn | 0.0464 | 0.01373 | 3.38 | 0.0008 |
| indus | 0.0206 | 0.06150 | 0.33 | 0.7383 |
| chas | 2.6867 | 0.86158 | 3.12 | 0.0019 |
| nox | -17.7666 | 3.81974 | -4.65 | 0.0000 |
| rm | 3.8099 | 0.41793 | 9.12 | 0.0000 |
| age | 0.0007 | 0.01321 | 0.05 | 0.9582 |
| dis | -1.4756 | 0.19945 | -7.40 | 0.0000 |
| rad | 0.3060 | 0.06635 | 4.61 | 0.0000 |
| tax | -0.0123 | 0.00376 | -3.28 | 0.0011 |
| ptratio | -0.9527 | 0.13083 | -7.28 | 0.0000 |
| black | 0.0093 | 0.00269 | 3.47 | 0.0006 |
| lstat | -0.5248 | 0.05072 | -10.35 | 0.0000 |
summary(lm.fit.full)$coefficients[, 4]
## (Intercept) crim zn indus chas nox
## 3.283438e-12 1.086810e-03 7.781097e-04 7.382881e-01 1.925030e-03 4.245644e-06
## rm age dis rad tax ptratio
## 1.979441e-18 9.582293e-01 6.013491e-13 5.070529e-06 1.111637e-03 1.308835e-12
## black lstat
## 5.728592e-04 7.776912e-23
sig.vars <- summary(lm.fit.full)$coefficients[
summary(lm.fit.full)$coefficients[, 4] < 0.05, ]
kable(sig.vars, digits = 4)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 36.4595 | 5.1035 | 7.1441 | 0.0000 |
| crim | -0.1080 | 0.0329 | -3.2865 | 0.0011 |
| zn | 0.0464 | 0.0137 | 3.3816 | 0.0008 |
| chas | 2.6867 | 0.8616 | 3.1184 | 0.0019 |
| nox | -17.7666 | 3.8197 | -4.6513 | 0.0000 |
| rm | 3.8099 | 0.4179 | 9.1161 | 0.0000 |
| dis | -1.4756 | 0.1995 | -7.3980 | 0.0000 |
| rad | 0.3060 | 0.0663 | 4.6129 | 0.0000 |
| tax | -0.0123 | 0.0038 | -3.2800 | 0.0011 |
| ptratio | -0.9527 | 0.1308 | -7.2825 | 0.0000 |
| black | 0.0093 | 0.0027 | 3.4668 | 0.0006 |
| lstat | -0.5248 | 0.0507 | -10.3471 | 0.0000 |
medv onto a quadratic
polynomial of lstat by using the formula
medv ~ lstat + I(lstat^2). Use the summary()
function to display the estimated coefficients. Is the coefficient of
the squared term statistically significant?lm.fit.poly <- lm(medv ~ lstat + I(lstat^2), data = Boston)
summary(lm.fit.poly)
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## lstat -2.332821 0.123803 -18.84 <2e-16 ***
## I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
medv ~ lstat + lstat^2
instead. What happens?lm.fit.poly2 <- lm(medv ~ lstat + lstat^2, data = Boston)
summary(lm.fit.poly2)
##
## Call:
## lm(formula = medv ~ lstat + lstat^2, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
# Without the I(), R interprets the ^ symbol differently—it expands interactions and polynomial terms automatically (creating additional combinations).
medv ~ poly(lstat, 2). Compare your
results to part (a).lm.fit.poly3 <- lm(medv ~ poly(lstat, 2), data = Boston)
summary(lm.fit.poly3)
##
## Call:
## lm(formula = medv ~ poly(lstat, 2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.5328 0.2456 91.76 <2e-16 ***
## poly(lstat, 2)1 -152.4595 5.5237 -27.60 <2e-16 ***
## poly(lstat, 2)2 64.2272 5.5237 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
# Both models confirm strong non-linearity between lstat and medv.
# The orthogonal polynomial (poly) approach is numerically cleaner, but the raw form (I(lstat²)) is easier to interpret.
ggplot’s
stat_smoothcommand allows us to visualize simple regression models in a really easy way. This set of problems helps you get accustomed to specifying polynomial and step function formulas for the purpose of visualization.
For this problem, please refer to the code posted dealing with the oj.R where I ass polynomials and step functions.
ggplot graphics to construct a scatterplot of
medv vs lstat, overlaying a 2nd degree
polynomial. Does this appear to be a good model of the data? Construct
plots with higher degree polynomial fits. Do any of them appear to
describe the data particularly well?library(ggplot2)
ggplot(Boston, aes(x = lstat, y = medv)) +
geom_point(color = "steelblue", alpha = 0.6) +
stat_smooth(method = "lm", formula = y ~ poly(x, 2), se = TRUE, color = "red", linewidth = 1) +
labs(
title = "Boston Housing: medv vs lstat with 2nd-Degree Polynomial Fit",
x = "Percent Lower-Status Population (lstat)",
y = "Median Home Value (medv, $1000s)"
) +
theme_minimal()
# The 2nd-degree polynomial captures the smooth, non-linear decline in home values as lstat rises. It fits the curvature well without overfitting—showing that housing values drop quickly at low lstat and level off at higher values. Higher-degree polynomials add noise rather than improvement.
ggplot(Boston, aes(x = lstat, y = medv)) +
geom_point(color = "gray40", alpha = 0.6) +
stat_smooth(
method = "lm",
formula = y ~ cut(x, breaks = c(0, 10, 20, 30)),
se = FALSE,
color = "darkred",
linewidth = 1
) +
labs(
title = "Step Function Fit: medv vs lstat",
x = "Percent Lower-Status Population (lstat)",
y = "Median Home Value (medv, $1000s)"
) +
theme_minimal()
## Warning: Removed 18 rows containing missing values or values outside the scale range
## (`geom_smooth()`).
# The step function captures broad downward shifts but oversimplifies the trend, missing the smooth curvature. While easy to interpret by groups, it’s less accurate than the polynomial model. The polynomial provides a smoother and more realistic fit overall.
ptratio as the
x-axis variable, and medv still as the y-axis
variable.ggplot(Boston, aes(x = ptratio, y = medv)) +
geom_point(color = "steelblue", alpha = 0.6) +
stat_smooth(method = "lm", formula = y ~ poly(x, 2), se = TRUE, color = "red", size = 1) +
labs(
title = "Boston Housing: medv vs ptratio with 2nd-Degree Polynomial Fit",
x = "Pupil-Teacher Ratio (ptratio)",
y = "Median Home Value (medv, $1000s)"
) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ptratio instead of
lstat.ggplot(Boston, aes(x = ptratio, y = medv)) +
geom_point(color = "gray40", alpha = 0.6) +
stat_smooth(
method = "lm",
formula = y ~ cut(x, breaks = c(10, 15, 20, 25)),
se = FALSE,
color = "darkred",
size = 1
) +
labs(
title = "Step Function Fit: medv vs ptratio",
x = "Pupil-Teacher Ratio (ptratio)",
y = "Median Home Value (medv, $1000s)"
) +
theme_minimal()
library(ggplot2)
library(plyr)
library(ISLR)
library(MASS)
library(knitr)
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
options(scipen = 4)
For this problem we’ll be working with two years of bikeshare data from the Capital Bikeshare system in Washington DC. This is in the same folder as this assignment on the course website. The dataset contains daily bikeshare counts, along with daily measurements on environmental and seasonal information that may affect the bikesharing.
Let’s start by loading the data.
bikes <- read.csv("~/Downloads/bikes (1).csv", header = TRUE)
str(bikes)
## 'data.frame': 731 obs. of 16 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : chr "1/1/2011" "1/2/2011" "1/3/2011" "1/4/2011" ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 0 1 2 3 4 5 6 0 1 ...
## $ workingday: int 0 0 1 1 1 1 1 0 0 1 ...
## $ weathersit: int 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num 0.344 0.363 0.196 0.2 0.227 ...
## $ atemp : num 0.364 0.354 0.189 0.212 0.229 ...
## $ hum : num 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed : num 0.16 0.249 0.248 0.16 0.187 ...
## $ casual : int 331 131 120 108 82 88 148 68 54 41 ...
## $ registered: int 654 670 1229 1454 1518 1518 1362 891 768 1280 ...
## $ cnt : int 985 801 1349 1562 1600 1606 1510 959 822 1321 ...
head(bikes)
## instant dteday season yr mnth holiday weekday workingday weathersit
## 1 1 1/1/2011 1 0 1 0 6 0 2
## 2 2 1/2/2011 1 0 1 0 0 0 2
## 3 3 1/3/2011 1 0 1 0 1 1 1
## 4 4 1/4/2011 1 0 1 0 2 1 1
## 5 5 1/5/2011 1 0 1 0 3 1 1
## 6 6 1/6/2011 1 0 1 0 4 1 1
## temp atemp hum windspeed casual registered cnt
## 1 0.344167 0.363625 0.805833 0.1604460 331 654 985
## 2 0.363478 0.353739 0.696087 0.2485390 131 670 801
## 3 0.196364 0.189405 0.437273 0.2483090 120 1229 1349
## 4 0.200000 0.212122 0.590435 0.1602960 108 1454 1562
## 5 0.226957 0.229270 0.436957 0.1869000 82 1518 1600
## 6 0.204348 0.233209 0.518261 0.0895652 88 1518 1606
bikes <- transform(bikes,
temp = 47 * temp - 8, # Convert normalized temp back to Celsius
atemp = 66 * atemp - 16, # Convert normalized feeling temp back to Celsius
hum = 100 * hum, # Convert humidity to percentage
windspeed = 67 * windspeed # Convert normalized windspeed to actual scale
)
library(plyr)
bikes <- transform(bikes,
season = mapvalues(season, c(1, 2, 3, 4),
c("Winter", "Spring", "Summer", "Fall"))
)
str(bikes)
## 'data.frame': 731 obs. of 16 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : chr "1/1/2011" "1/2/2011" "1/3/2011" "1/4/2011" ...
## $ season : chr "Winter" "Winter" "Winter" "Winter" ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 0 1 2 3 4 5 6 0 1 ...
## $ workingday: int 0 0 1 1 1 1 1 0 0 1 ...
## $ weathersit: int 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num 8.18 9.08 1.23 1.4 2.67 ...
## $ atemp : num 7.999 7.347 -3.499 -2 -0.868 ...
## $ hum : num 80.6 69.6 43.7 59 43.7 ...
## $ windspeed : num 10.7 16.7 16.6 10.7 12.5 ...
## $ casual : int 331 131 120 108 82 88 148 68 54 41 ...
## $ registered: int 654 670 1229 1454 1518 1518 1362 891 768 1280 ...
## $ cnt : int 985 801 1349 1562 1600 1606 1510 959 822 1321 ...
summary(bikes)
## instant dteday season yr
## Min. : 1.0 Length:731 Length:731 Min. :0.0000
## 1st Qu.:183.5 Class :character Class :character 1st Qu.:0.0000
## Median :366.0 Mode :character Mode :character Median :1.0000
## Mean :366.0 Mean :0.5007
## 3rd Qu.:548.5 3rd Qu.:1.0000
## Max. :731.0 Max. :1.0000
## mnth holiday weekday workingday
## Min. : 1.00 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.: 4.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000
## Median : 7.00 Median :0.00000 Median :3.000 Median :1.000
## Mean : 6.52 Mean :0.02873 Mean :2.997 Mean :0.684
## 3rd Qu.:10.00 3rd Qu.:0.00000 3rd Qu.:5.000 3rd Qu.:1.000
## Max. :12.00 Max. :1.00000 Max. :6.000 Max. :1.000
## weathersit temp atemp hum
## Min. :1.000 Min. :-5.221 Min. :-10.781 Min. : 0.00
## 1st Qu.:1.000 1st Qu.: 7.843 1st Qu.: 6.298 1st Qu.:52.00
## Median :1.000 Median :15.422 Median : 16.124 Median :62.67
## Mean :1.395 Mean :15.283 Mean : 15.307 Mean :62.79
## 3rd Qu.:2.000 3rd Qu.:22.805 3rd Qu.: 24.168 3rd Qu.:73.02
## Max. :3.000 Max. :32.498 Max. : 39.499 Max. :97.25
## windspeed casual registered cnt
## Min. : 1.500 Min. : 2.0 Min. : 20 Min. : 22
## 1st Qu.: 9.042 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152
## Median :12.125 Median : 713.0 Median :3662 Median :4548
## Mean :12.763 Mean : 848.2 Mean :3656 Mean :4504
## 3rd Qu.:15.625 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956
## Max. :34.000 Max. :3410.0 Max. :6946 Max. :8714
Let’s look at some boxplots of how bikeshare ride count varies with season.
qplot(data = bikes, x = season, y = cnt, fill = I(cbPalette[3]), geom = "boxplot")
There’s something funny going on here. Instead of showing up in seasonal order, the seasons in the plot are showing up in alphabetical order. The following command reorders the seasons appropriately.
bikes <- transform(bikes, season = factor(season, levels = c("Winter", "Spring", "Summer", "Fall")))
Now let’s try that plot again.
qplot(data = bikes, x = season, y = cnt, fill = I(cbPalette[3]), geom = "boxplot")
Here’s information on what the variables mean.
The Season variable is an example of what’s called a qualitative or categorical predictor. In R, such variables are called
factors. This problem gets to fit a model with a qualitative predictor and to interpret the findings. (This is like how we handled race in class)
cnt as the response and season as the input.
Use the summary() and kable() commands to
produce a nice looking coefficients table.lm.season <- lm(cnt ~ season, data = bikes)
library(knitr)
kable(summary(lm.season)$coefficients, digits = 3)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 2604.133 | 116.598 | 22.334 | 0 |
| seasonSpring | 2388.199 | 164.221 | 14.543 | 0 |
| seasonSummer | 3040.171 | 163.352 | 18.611 | 0 |
| seasonFall | 2124.030 | 165.588 | 12.827 | 0 |
1 intercept (the baseline category: Winter)
3 dummy coefficients for Spring, Summer, and Fall
This is because season has 4 levels, and R automatically omits one (Winter) as the reference group.
season variable?seasonSpring
seasonSummer
seasonFall
Each represents the difference in the mean number of total rides (cnt) compared to Winter.
season in the model.The intercept represents the average number of total rides during Winter, which is approximately 2604.
The coefficient for Spring (2388.19) means that, on average, there are about 2,388 more rides per day in Spring compared to Winter.
The coefficient for Summer (3040.17) indicates that, on average, there are about 3,040 more rides per day in Summer compared to Winter — this is the largest seasonal increase.
The coefficient for Fall (2124.03) shows that, on average, there are about 2,124 more rides per day in Fall compared to Winter.
All coefficients are statistically significant (p < 0.001), suggesting that bike rentals increase substantially across all other seasons relative to Winter, with Summer having the highest average ridership.
Hint: If you have not previously studied how to interpret qualitative variables in regressions, begin by reading through the relevant sections of the Suggested readings for the Week 1 lectures
In this problem we’ll practice fitting and interpreting the results of a multiple linear regression.
cnt as
the response and the following variables as inputs: temp,
atemp, mnth, hum,
windspeed. Use the summary() and
kable() commands to produce a nice looking coefficients
table.lm.multi <- lm(cnt ~ temp + atemp + mnth + hum + windspeed, data = bikes)
summary(lm.multi)
##
## Call:
## lm(formula = cnt ~ temp + atemp + mnth + hum + windspeed, data = bikes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4829.6 -992.7 -188.5 1089.4 3615.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5057.781 342.624 14.762 < 2e-16 ***
## temp 45.387 47.414 0.957 0.3388
## atemp 72.014 38.138 1.888 0.0594 .
## mnth 95.040 15.742 6.037 0.0000000025 ***
## hum -35.262 3.801 -9.277 < 2e-16 ***
## windspeed -59.159 10.601 -5.580 0.0000000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1389 on 725 degrees of freedom
## Multiple R-squared: 0.4895, Adjusted R-squared: 0.486
## F-statistic: 139 on 5 and 725 DF, p-value: < 2.2e-16
library(knitr)
kable(summary(lm.multi)$coefficients, digits = 3)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 5057.781 | 342.624 | 14.762 | 0.000 |
| temp | 45.387 | 47.414 | 0.957 | 0.339 |
| atemp | 72.014 | 38.138 | 1.888 | 0.059 |
| mnth | 95.040 | 15.742 | 6.037 | 0.000 |
| hum | -35.262 | 3.801 | -9.277 | 0.000 |
| windspeed | -59.159 | 10.601 | -5.580 | 0.000 |
mnth, windspeed and atemp in the
model.The coefficient for ‘mnth’ (95.04) indicates that, on average, bike ridership increases by about 95 rides per day with each additional month in the year, holding other variables constant.
The coefficient for ‘windspeed’ (-59.16) means that for every unit increase in normalized windspeed, ridership decreases by about 59 rides per day, holding all other variables constant.
The coefficient for ‘atemp’ (72.01) suggests that for each one-unit increase in the apparent temperature (feels-like temperature), ridership increases by roughly 72 rides per day, controlling for other predictors.
Predictors with positive coefficients (associated with increased ridership): ‘temp’, ‘atemp’, and ‘mnth’
Predictors with negative coefficients (associated with decreased ridership): ‘hum’ and ‘windspeed’
From the p-values in the output, the predictors ‘mnth’, ‘hum’, and ‘windspeed’ are statistically significant at the 0.05 level (p < 0.001).
Both ‘temp’ and ‘atemp’ are not significant (p > 0.05), indicating they do not have a statistically meaningful impact on ridership in this model.
As one of your recent econometrics class covered, collinear or highly correlated predictors can make interpreting regression coefficients problematic. In this problem you will try to diagnose and address collinearity issues in the data.
temp variable. Display the
coefficients table for this model.lm.multi.noTemp <- lm(cnt ~ atemp + mnth + hum + windspeed, data = bikes)
summary(lm.multi.noTemp)
##
## Call:
## lm(formula = cnt ~ atemp + mnth + hum + windspeed, data = bikes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4864.6 -993.4 -171.3 1102.3 4114.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5186.624 315.061 16.462 < 2e-16 ***
## atemp 108.210 4.970 21.774 < 2e-16 ***
## mnth 95.016 15.741 6.036 2.51e-09 ***
## hum -35.446 3.796 -9.338 < 2e-16 ***
## windspeed -57.397 10.440 -5.498 5.33e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1389 on 726 degrees of freedom
## Multiple R-squared: 0.4889, Adjusted R-squared: 0.486
## F-statistic: 173.6 on 4 and 726 DF, p-value: < 2.2e-16
library(knitr)
kable(summary(lm.multi.noTemp)$coefficients, digits = 3)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 5186.624 | 315.061 | 16.462 | 0 |
| atemp | 108.210 | 4.970 | 21.774 | 0 |
| mnth | 95.016 | 15.741 | 6.036 | 0 |
| hum | -35.446 | 3.796 | -9.338 | 0 |
| windspeed | -57.397 | 10.440 | -5.498 | 0 |
atemp
in this new model? Is it very different from the atemp
coefficient estimated in part (b)? Is it statistically
significant? Explain your findings.The coefficient of ‘atemp’ in this new model is 108.210 and is statistically significant (p < 0.001).
Compared to the earlier model (where ‘atemp’ = 72.014, p = 0.259), the coefficient increased and became significant after removing ‘temp’.
This indicates that the strong collinearity between ‘temp’ and ‘atemp’ in the previous model had masked the true effect of ‘atemp’.
Without ‘temp’, ‘atemp’ now clearly captures the positive relationship between warmer apparent temperatures and increased bike ridership.