Kaggle Link: https://www.kaggle.com/datasets/sakshisatre/the-boston-housing-dataset
file = "C:\\Users\\Jonathan Burns\\OneDrive\\Documents\\Masters Data Science\\Spring 2024\\DATA 605\\Boston (1).csv"
df <- read.csv(file)
There were a bunch more variables but I picked the 5 I thought would be the most impactful.
model <- lm(MEDV ~ CRIM + AGE + DIS + RM + TAX, data = df)
summary(model)
##
## Call:
## lm(formula = MEDV ~ CRIM + AGE + DIS + RM + TAX, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.614 -2.911 -0.833 1.987 40.959
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.796901 3.236248 -3.645 0.000295 ***
## CRIM -0.140933 0.037531 -3.755 0.000194 ***
## AGE -0.079425 0.014272 -5.565 4.28e-08 ***
## DIS -0.944654 0.194106 -4.867 1.52e-06 ***
## RM 7.731058 0.390879 19.779 < 2e-16 ***
## TAX -0.011553 0.002144 -5.389 1.09e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.856 on 500 degrees of freedom
## Multiple R-squared: 0.5986, Adjusted R-squared: 0.5946
## F-statistic: 149.2 on 5 and 500 DF, p-value: < 2.2e-16
Looking at the summary table the pvalues for my predictor variables are promising. Moreover the median values are close to zero and the Multiple R-Squared shows that around 60% of the variance is explained by the predicator variables. Lastly the Adjusted R-Squared shows that all of the variables are providing to the regression.
plot(model, which = 1) # Residual vs. fitted plot
The linear model works pretty well here but the shape makes me think that there is a non-linear relationship happening within the data and linear regression might not be the best way to model this.
plot(model, which = 2)
## QQ Plot:
This shows that there is a deviation in residuals, with some serious deviation toward the end of the plot.
plot(model, which = 3)
Shows that there is one, an issue with clustering of residuals and two a trend line that is not linear.
plot(model, which = 4)
plot(model, which = 5)
Planning to ask about Cook’s Distance and Residuals vs. leverage in class**
hist(resid(model), breaks = 15, main = "Resid Hist")
This data shows right skewness but the bulk of the data is normally distributed.
I do not believe this was the correct model and a polynomial regression should be considered for any further research.