Run all the code and show all the output (2pts).
We’ll use the Boston dataset from the MASS
package. It contains information on housing prices and neighborhood
features in Boston suburbs.
library(MASS)
# Load data
data(Boston)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
We will focus on
# Summary
summary(Boston$medv)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 17.02 21.20 22.53 25.00 50.00
summary(Boston$lstat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.73 6.95 11.36 12.65 16.95 37.97
# Visualize
plot(Boston$lstat, Boston$medv,
main = "Housing Prices vs. % Lower Status Population",
xlab = "% Lower Status (lstat)", ylab = "Median House Price ($1000s)",
pch = 19, col = "darkgreen")
Question 1 (2pts): Based on the scatterplot, describe the
relationship between lstat and medv. Is it
linear? The scatterplot is semi linear, the scatterplot is more of a
negative logarithmic function.
model <- lm(medv ~ lstat, data = Boston)
summary(model)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
Question 2 (2pts): Write out the estimated regression equation. medv=34.55384−0.95005⋅lstat
Question 3 (2pts): Interpret the slope. What does it mean in the context of housing? the slope is -0.95005, meaning for every 1 percentage point increase in the percentage of lower status population (lstat), the median value of owner-occupied homes (medv) is expected to decrease by approximately 0.95 thousand.
Question 4 (2pts): Is lstat a statistically significant
predictor of housing prices? yes, ‘lstat’ is a statistically significant
predictor of housing prices.
Question 5 (2pts): What is the R square? Is this model a good fit? R-squared = 0.5441, This is a moderately good fit for a single-variable regression.
# Residual plot
plot(model$fitted.values, model$residuals,
main = "Residuals vs. Fitted Values",
xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "gray")
# QQ Plot
qqnorm(model$residuals)
qqline(model$residuals)
Question 6 (4pts): Look at the residual plot and QQ plot. Are the assumptions of linear regression reasonably met? No, the assumptions of linear regression are not met, the simple linear regression model is not be appropriate for this data.
Remember that 4 Key Assumptions of Simple Linear Regression:
Question 7 (4pts): What is the predicted value of a house when lstat = 12? Provide a confidence interval. The predicted value when lstat = 12 is 23.15, a 95% confidence interval would be (22.61, 23.70).
# Predict medv for a neighborhood with lstat = 12
new_data <- data.frame(lstat = 12)
predict(model, new_data, interval = "confidence")
## fit lwr upr
## 1 23.15325 22.60809 23.69841