Kaggle Link: https://www.kaggle.com/datasets/sakshisatre/the-boston-housing-dataset

file = "C:\\Users\\Jonathan Burns\\OneDrive\\Documents\\Masters Data Science\\Spring 2024\\DATA 605\\Boston (1).csv"
df <- read.csv(file)

There were a bunch more variables but I picked the 5 I thought would be the most impactful.

model <- lm(MEDV ~ CRIM + AGE + DIS + RM + TAX, data = df)
summary(model)
## 
## Call:
## lm(formula = MEDV ~ CRIM + AGE + DIS + RM + TAX, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.614  -2.911  -0.833   1.987  40.959 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.796901   3.236248  -3.645 0.000295 ***
## CRIM         -0.140933   0.037531  -3.755 0.000194 ***
## AGE          -0.079425   0.014272  -5.565 4.28e-08 ***
## DIS          -0.944654   0.194106  -4.867 1.52e-06 ***
## RM            7.731058   0.390879  19.779  < 2e-16 ***
## TAX          -0.011553   0.002144  -5.389 1.09e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.856 on 500 degrees of freedom
## Multiple R-squared:  0.5986, Adjusted R-squared:  0.5946 
## F-statistic: 149.2 on 5 and 500 DF,  p-value: < 2.2e-16

Looking at the summary table the pvalues for my predictor variables are promising. Moreover the median values are close to zero and the Multiple R-Squared shows that around 60% of the variance is explained by the predicator variables. Lastly the Adjusted R-Squared shows that all of the variables are providing to the regression.

plot(model, which = 1) # Residual vs. fitted plot

Residuals vs Fitted:

The linear model works pretty well here but the shape makes me think that there is a non-linear relationship happening within the data and linear regression might not be the best way to model this.

plot(model, which = 2)

## QQ Plot:

This shows that there is a deviation in residuals, with some serious deviation toward the end of the plot.

plot(model, which = 3)

Scale Location

Shows that there is one, an issue with clustering of residuals and two a trend line that is not linear.

plot(model, which = 4)

plot(model, which = 5)

Planning to ask about Cook’s Distance and Residuals vs. leverage in class**

hist(resid(model), breaks = 15, main = "Resid Hist")

This data shows right skewness but the bulk of the data is normally distributed.

Was this the correct model?

I do not believe this was the correct model and a polynomial regression should be considered for any further research.