Run all the code and show all the output (2pts).

Dataset: Boston Housing Data

We’ll use the Boston dataset from the MASS package. It contains information on housing prices and neighborhood features in Boston suburbs.

library(MASS)

Part 1: Explore the Dataset

# Load data
data(Boston)
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

We will focus on

# Summary
summary(Boston$medv)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   17.02   21.20   22.53   25.00   50.00
summary(Boston$lstat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.73    6.95   11.36   12.65   16.95   37.97
# Visualize
plot(Boston$lstat, Boston$medv,
     main = "Housing Prices vs. % Lower Status Population",
     xlab = "% Lower Status (lstat)", ylab = "Median House Price ($1000s)",
     pch = 19, col = "darkgreen")

Question 1 (2pts): Based on the scatterplot, describe the relationship between lstat and medv. Is it linear? The scatterplot is semi linear, the scatterplot is more of a negative logarithmic function.

Part 2: Fit a Simple Linear Regression

model <- lm(medv ~ lstat, data = Boston)
summary(model)
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

Question 2 (2pts): Write out the estimated regression equation. medv=34.55384−0.95005⋅lstat

Question 3 (2pts): Interpret the slope. What does it mean in the context of housing? the slope is -0.95005, meaning for every 1 percentage point increase in the percentage of lower status population (lstat), the median value of owner-occupied homes (medv) is expected to decrease by approximately 0.95 thousand.

Question 4 (2pts): Is lstat a statistically significant predictor of housing prices? yes, ‘lstat’ is a statistically significant predictor of housing prices.

Question 5 (2pts): What is the R square? Is this model a good fit? R-squared = 0.5441, This is a moderately good fit for a single-variable regression.

Part 3: Analyze Fit

# Residual plot
plot(model$fitted.values, model$residuals,
     main = "Residuals vs. Fitted Values",
     xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "gray")

# QQ Plot
qqnorm(model$residuals)
qqline(model$residuals)

Question 6 (4pts): Look at the residual plot and QQ plot. Are the assumptions of linear regression reasonably met? No, the assumptions of linear regression are not met, the simple linear regression model is not be appropriate for this data.

How to Check These Assumptions with Plots

Remember that 4 Key Assumptions of Simple Linear Regression:

Residual Plot: Check Linearity & Homoscedasticity

  • Random scatter (no pattern) around 0 line: Linearity and homoscedasticity likely met
  • Curved pattern (e.g., smile or frown): Violation of linearity
  • Funnel shape (wider spread as fitted values increase): Violation of equal variance (heteroscedasticity)
  • Clusters or gaps: May suggest missing variables or non-independence

QQ Plot: Check Normality of Residuals

  • Points follow straight line: Residuals are approximately normal
  • S-shaped curve or heavy tail divergence: Violation of normality (outliers or skewed residuals)

Part 4: Prediction

Question 7 (4pts): What is the predicted value of a house when lstat = 12? Provide a confidence interval. The predicted value when lstat = 12 is 23.15, a 95% confidence interval would be (22.61, 23.70).

# Predict medv for a neighborhood with lstat = 12
new_data <- data.frame(lstat = 12)
predict(model, new_data, interval = "confidence")
##        fit      lwr      upr
## 1 23.15325 22.60809 23.69841