summary(house)
## longitude latitude housing_median_age total_rooms
## Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
## 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
## Median :-118.5 Median :34.26 Median :29.00 Median : 2127
## Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
## 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
## Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
##
## total_bedrooms population households median_income
## Min. : 1.0 Min. : 3 Min. : 1.0 Min. : 0.4999
## 1st Qu.: 296.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.: 2.5634
## Median : 435.0 Median : 1166 Median : 409.0 Median : 3.5348
## Mean : 537.9 Mean : 1425 Mean : 499.5 Mean : 3.8707
## 3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.: 4.7432
## Max. :6445.0 Max. :35682 Max. :6082.0 Max. :15.0001
## NA's :207
## median_house_value ocean_proximity
## Min. : 14999 Length:20640
## 1st Qu.:119600 Class :character
## Median :179700 Mode :character
## Mean :206856
## 3rd Qu.:264725
## Max. :500001
##
Since there are originally over 20,000 observations, we will randomly narrow the data down to 2000 values. After that, we will divide the data into a train and test set, with a ratio of 7:3.
house <- na.omit(house)
set.seed(123)
# Randomly select 2000 values.
house <- house %>% sample_n(2000)
# Create training and testing sets using the indices
train_indices <- sample(seq_len(nrow(house)), size = floor(0.7 * nrow(house)))
train <- house[train_indices, ]
test <- house[-train_indices, ]
summary(train)
## longitude latitude housing_median_age total_rooms
## Min. :-124.2 Min. :32.57 Min. : 2.00 Min. : 2
## 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1424
## Median :-118.4 Median :34.23 Median :30.00 Median : 2045
## Mean :-119.5 Mean :35.58 Mean :29.11 Mean : 2591
## 3rd Qu.:-118.0 3rd Qu.:37.68 3rd Qu.:38.00 3rd Qu.: 3062
## Max. :-114.5 Max. :41.88 Max. :52.00 Max. :22128
## total_bedrooms population households median_income
## Min. : 2.0 Min. : 6 Min. : 2.0 Min. : 0.4999
## 1st Qu.: 292.0 1st Qu.: 777 1st Qu.: 279.8 1st Qu.: 2.5777
## Median : 419.5 Median : 1132 Median : 392.5 Median : 3.5878
## Mean : 529.1 Mean : 1398 Mean : 491.5 Mean : 3.8525
## 3rd Qu.: 625.5 3rd Qu.: 1663 3rd Qu.: 581.0 3rd Qu.: 4.6935
## Max. :4457.0 Max. :11139 Max. :4204.0 Max. :15.0001
## median_house_value ocean_proximity
## Min. : 14999 Length:1400
## 1st Qu.:115800 Class :character
## Median :181950 Mode :character
## Mean :205992
## 3rd Qu.:263800
## Max. :500001
Let’s take a look at the histograms of our data and decide which variables to drop and transform.
I decided to not to include the coordinates variable (longitude & latitude) for the model because they may not have any linear relationship with our dependent variable, which is the median housing price.
Since our y value is not that badly skewed, we can apply sqrt() transformation to it to achieve normality.
train %>% ggplot(aes(sqrt(median_house_value)))+
geom_histogram(bins = 30, fill = "turquoise", color = "black")
For other variables, here are the plots of them against our y variable after tranformation.
train %>% ggplot(aes(sqrt(total_rooms), sqrt(median_house_value)))+
geom_point()+geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
train %>% ggplot(aes(sqrt(total_bedrooms), sqrt(median_house_value)))+
geom_point()+geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
train %>% ggplot(aes(sqrt(population), sqrt(median_house_value)))+
geom_point()+geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
train %>% ggplot(aes(sqrt(households), sqrt(median_house_value)))+
geom_point()+geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
train %>% ggplot(aes(log(median_income), sqrt(median_house_value)))+
geom_point()+geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
We can see from these scatter plots that the transformed variables still maintains a linear relationship with our y variable.
We also figured that there is an interaction term between income and total rooms.
income1 <- train %>% filter(median_income < 4)
income2 <- train %>% filter(median_income >= 4 & median_income < 8)
income3 <- train %>% filter(median_income >= 8)
income1 %>% ggplot(aes(total_rooms,median_house_value)) +
geom_point() +geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
income2 %>% ggplot(aes(total_rooms,median_house_value)) +
geom_point() +geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
income3 %>% ggplot(aes(total_rooms,median_house_value)) +
geom_point() +geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
For the first interval (income < 4), there is a noticeable slope for the best fit line which indicates that the median housing price is not simply affected by the total rooms but also income. As for the 2 remaining intervals, a upward slope can also be seen but more marginal.
# Outliers removal
exclude_rows_with_outliers <- function(df, columns, k = 1.5) {
outlier_indices <- logical(nrow(df))
for (col in columns) {
Q1 <- quantile(df[[col]], 0.25, na.rm = TRUE)
Q3 <- quantile(df[[col]], 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - k * IQR
upper_bound <- Q3 + k * IQR
outlier_indices <- outlier_indices | (df[[col]] < lower_bound | df[[col]] > upper_bound)
}
return(df[!outlier_indices, ])
}
columns_with_outliers <- c("total_rooms", "total_bedrooms", "population", "households", "median_income", "median_house_value")
H1 <- exclude_rows_with_outliers(train, columns_with_outliers, k = 1.5)
# Transform data
H1 <- train %>%
mutate(
SRoom = sqrt(total_rooms),
SBed = sqrt(total_bedrooms),
SPop = sqrt(population),
SHouse = sqrt(households),
LIncome = log(median_income),
SValue = sqrt(median_house_value),
)
H2 <- H1 %>%
dplyr::select(SValue, longitude, latitude, housing_median_age,SRoom, SBed, SPop, SHouse ,LIncome )
# Update test set
T1 <- test %>%
mutate(
SRoom = sqrt(total_rooms),
SBed = sqrt(total_bedrooms),
SPop = sqrt(population),
SHouse = sqrt(households),
LIncome = log(median_income),
SValue = sqrt(median_house_value),
)
T2 <- T1 %>%
dplyr::select(SValue, longitude, latitude, housing_median_age,SRoom, SBed, SPop, SHouse ,LIncome )
We then do best subsets and stepwise regression to find the best variables for our model.
best<-regsubsets(SValue ~ housing_median_age + SRoom + SBed+ SPop +SHouse +LIncome, H2)
summary(best)
## Subset selection object
## Call: regsubsets.formula(SValue ~ housing_median_age + SRoom + SBed +
## SPop + SHouse + LIncome, H2)
## 6 Variables (and intercept)
## Forced in Forced out
## housing_median_age FALSE FALSE
## SRoom FALSE FALSE
## SBed FALSE FALSE
## SPop FALSE FALSE
## SHouse FALSE FALSE
## LIncome FALSE FALSE
## 1 subsets of each size up to 6
## Selection Algorithm: exhaustive
## housing_median_age SRoom SBed SPop SHouse LIncome
## 1 ( 1 ) " " " " " " " " " " "*"
## 2 ( 1 ) "*" " " " " " " " " "*"
## 3 ( 1 ) " " "*" "*" " " " " "*"
## 4 ( 1 ) "*" " " " " "*" "*" "*"
## 5 ( 1 ) "*" "*" " " "*" "*" "*"
## 6 ( 1 ) "*" "*" "*" "*" "*" "*"
stepwise <- step(lm(SValue ~ housing_median_age + SRoom + SBed+ SPop +SHouse +LIncome, H2))
## Start: AIC=12318.39
## SValue ~ housing_median_age + SRoom + SBed + SPop + SHouse +
## LIncome
##
## Df Sum of Sq RSS AIC
## <none> 9184968 12318
## - SBed 1 126535 9311504 12336
## - SHouse 1 193002 9377971 12346
## - SRoom 1 439999 9624967 12382
## - housing_median_age 1 442803 9627772 12382
## - SPop 1 518330 9703299 12393
## - LIncome 1 7517068 16702037 13154
As both tests gave the same results, our best model will include housing_median_age + SRoom + SBed + Spop + SHouse + LIncome.
We then run a cor test to see the correlations between these variables.
cor(H2)
## SValue longitude latitude housing_median_age
## SValue 1.00000000 -0.04990532 -0.169918377 0.087492767
## longitude -0.04990532 1.00000000 -0.922857820 -0.104619587
## latitude -0.16991838 -0.92285782 1.000000000 0.003312031
## housing_median_age 0.08749277 -0.10461959 0.003312031 1.000000000
## SRoom 0.19542198 0.03877581 -0.026579782 -0.409897025
## SBed 0.08981654 0.07665027 -0.063943521 -0.361738410
## SPop 0.02502464 0.10923136 -0.112296442 -0.344859591
## SHouse 0.12149813 0.06780751 -0.079055226 -0.333302955
## LIncome 0.69112079 -0.05079379 -0.069642869 -0.107500239
## SRoom SBed SPop SHouse LIncome
## SValue 0.19542198 0.08981654 0.02502464 0.12149813 0.69112079
## longitude 0.03877581 0.07665027 0.10923136 0.06780751 -0.05079379
## latitude -0.02657978 -0.06394352 -0.11229644 -0.07905523 -0.06964287
## housing_median_age -0.40989702 -0.36173841 -0.34485959 -0.33330296 -0.10750024
## SRoom 1.00000000 0.93761712 0.86797360 0.92859091 0.26636941
## SBed 0.93761712 1.00000000 0.89641159 0.97977673 0.02891163
## SPop 0.86797360 0.89641159 1.00000000 0.92404805 0.05007196
## SHouse 0.92859091 0.97977673 0.92404805 1.00000000 0.06705520
## LIncome 0.26636941 0.02891163 0.05007196 0.06705520 1.00000000
We can see that the 4 variables SRoom, SBed, SHouse and LIncome has the highest correlation, so we will pick one of them to drop.
reg1 <- lm(SValue ~ housing_median_age + SBed + SPop + SHouse + LIncome, H2)
summary(reg1)
##
## Call:
## lm(formula = SValue ~ housing_median_age + SBed + SPop + SHouse +
## LIncome, data = H2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -382.32 -58.36 -4.89 51.05 464.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 115.1335 12.7244 9.048 < 2e-16 ***
## housing_median_age 1.8878 0.1959 9.635 < 2e-16 ***
## SBed -0.6015 1.5475 -0.389 0.698
## SPop -5.0754 0.4861 -10.441 < 2e-16 ***
## SHouse 10.9851 1.8771 5.852 6.04e-09 ***
## LIncome 183.9799 4.8616 37.844 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83.09 on 1394 degrees of freedom
## Multiple R-squared: 0.5578, Adjusted R-squared: 0.5562
## F-statistic: 351.7 on 5 and 1394 DF, p-value: < 2.2e-16
reg2 <- lm(SValue ~ housing_median_age + SRoom + SPop + SHouse + LIncome, H2)
summary(reg2)
##
## Call:
## lm(formula = SValue ~ housing_median_age + SRoom + SPop + SHouse +
## LIncome, data = H2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -399.68 -56.60 -6.37 49.81 441.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 109.4260 12.1076 9.038 < 2e-16 ***
## housing_median_age 1.5329 0.1955 7.841 8.84e-15 ***
## SRoom -2.9825 0.4346 -6.862 1.02e-11 ***
## SPop -4.7935 0.4741 -10.111 < 2e-16 ***
## SHouse 16.1889 1.1701 13.836 < 2e-16 ***
## LIncome 205.6736 5.6080 36.675 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81.73 on 1394 degrees of freedom
## Multiple R-squared: 0.5722, Adjusted R-squared: 0.5706
## F-statistic: 372.9 on 5 and 1394 DF, p-value: < 2.2e-16
reg3 <- lm(SValue ~ housing_median_age + SRoom + SBed + SHouse + LIncome, H2)
summary(reg3)
##
## Call:
## lm(formula = SValue ~ housing_median_age + SRoom + SBed + SHouse +
## LIncome, data = H2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -429.32 -59.84 -7.85 51.45 325.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.0159 12.7419 4.867 1.26e-06 ***
## housing_median_age 1.7593 0.1992 8.831 < 2e-16 ***
## SRoom -5.2383 0.5322 -9.843 < 2e-16 ***
## SBed 11.9504 1.8479 6.467 1.38e-10 ***
## SHouse 1.3146 1.6026 0.820 0.412
## LIncome 231.4206 6.5962 35.084 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83.43 on 1394 degrees of freedom
## Multiple R-squared: 0.5542, Adjusted R-squared: 0.5526
## F-statistic: 346.6 on 5 and 1394 DF, p-value: < 2.2e-16
reg4 <- lm(SValue ~ housing_median_age + SRoom + SBed + SPop + LIncome, H2)
summary(reg4)
##
## Call:
## lm(formula = SValue ~ housing_median_age + SRoom + SBed + SPop +
## LIncome, data = H2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -431.48 -55.95 -5.25 50.61 406.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.6959 12.6830 6.047 1.89e-09 ***
## housing_median_age 1.7756 0.1939 9.159 < 2e-16 ***
## SRoom -4.5130 0.5324 -8.477 < 2e-16 ***
## SBed 15.8598 1.1815 13.424 < 2e-16 ***
## SPop -2.9077 0.4151 -7.004 3.86e-12 ***
## LIncome 227.7889 6.4634 35.243 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 82.02 on 1394 degrees of freedom
## Multiple R-squared: 0.5691, Adjusted R-squared: 0.5676
## F-statistic: 368.3 on 5 and 1394 DF, p-value: < 2.2e-16
We can see that dropping SBed has the least affect on our R-square value, so we will stick with it.
reg <- lm(SValue ~ housing_median_age + SRoom + SPop + SHouse + LIncome, H2)
summary(reg)
##
## Call:
## lm(formula = SValue ~ housing_median_age + SRoom + SPop + SHouse +
## LIncome, data = H2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -399.68 -56.60 -6.37 49.81 441.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 109.4260 12.1076 9.038 < 2e-16 ***
## housing_median_age 1.5329 0.1955 7.841 8.84e-15 ***
## SRoom -2.9825 0.4346 -6.862 1.02e-11 ***
## SPop -4.7935 0.4741 -10.111 < 2e-16 ***
## SHouse 16.1889 1.1701 13.836 < 2e-16 ***
## LIncome 205.6736 5.6080 36.675 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81.73 on 1394 degrees of freedom
## Multiple R-squared: 0.5722, Adjusted R-squared: 0.5706
## F-statistic: 372.9 on 5 and 1394 DF, p-value: < 2.2e-16
We will take a look at 2 main assumptions: Normality and Homoscedasticity.
For normality, we will look at the histogram of residuals and conduct a Shapirp Wilk test.
H3 <- H2 %>% mutate(res=residuals(reg),fit=fitted.values(reg))
H3 %>% ggplot(aes(res))+
geom_histogram(bins = 30, fill = "turquoise", color = "black") + labs(title = "Histogram of residuals")
shapiro.test(reg$residuals)
##
## Shapiro-Wilk normality test
##
## data: reg$residuals
## W = 0.98519, p-value = 9.033e-11
Even though we can observe normality through the bell-shape of the histogram, the Shapiro test gave a p-value lower then 0.05, so the null hypothesis of normality is rejected. However, since the Shapiro test is very sensitive and can give different results at higher sample sizes, we can safely conclude that normality is satisfied in our case since we can observe it visually.
We will now look at a scatter plot for residuals and ncvTest to check for homoscedasticity.
H3 %>% ggplot(aes(fit,res))+
geom_point(color = "turquoise") + labs(title = "Scatterplot of residuals")
ncvTest(reg)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 5.479215, Df = 1, p = 0.019244
The scatter plot seems like it has residuals scattered randomly around the 0 line, which suggest homoscedasticity. However, the ncvTest did not show the same result as it gave a p-value lower than 0.05, meaning that the model is showing heteroskedasticity instead as the null hypothesis of homoscedasticity is rejected. We will now do a coeftest to try and fix this.
coeftest(reg, vcov = vcovHC(reg, type = "HC4"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 109.42604 13.19987 8.2899 2.635e-16 ***
## housing_median_age 1.53286 0.20924 7.3257 4.006e-13 ***
## SRoom -2.98247 0.52721 -5.6570 1.866e-08 ***
## SPop -4.79349 0.78874 -6.0774 1.574e-09 ***
## SHouse 16.18892 1.65877 9.7596 < 2.2e-16 ***
## LIncome 205.67356 6.18572 33.2497 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After trying to remove each variable, we have successfully dealt with heteroskedasticity by removing SPop. If we conduct the ncovTest again, we can see that the p-value is now over 0.05, which measn that we have homoscedasticity. Hence we have our final model.
reg <- lm(SValue ~ housing_median_age + SRoom + SHouse + LIncome, H2)
ncvTest(reg)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 1.188404, Df = 1, p = 0.27565
summary(reg)
##
## Call:
## lm(formula = SValue ~ housing_median_age + SRoom + SHouse + LIncome,
## data = H2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -401.53 -61.31 -7.98 52.35 360.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 86.8748 12.3247 7.049 2.83e-12 ***
## housing_median_age 1.6901 0.2018 8.374 < 2e-16 ***
## SRoom -3.3249 0.4488 -7.409 2.20e-13 ***
## SHouse 9.4917 0.9989 9.502 < 2e-16 ***
## LIncome 210.0397 5.7907 36.272 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84.64 on 1395 degrees of freedom
## Multiple R-squared: 0.5408, Adjusted R-squared: 0.5395
## F-statistic: 410.7 on 4 and 1395 DF, p-value: < 2.2e-16
Intercept (86.8748): When all variables are zero, the expected value of the square rooted median housing value (SValue) is approximately 86.8748.
housing_median_age (1.6901): For each one-unit increase in the median age of the area, the value of SValue is expected to increase by approximately 1.6901 units, assuming other variables are held constant.
SRoom (-3.3249): For each one-unit increase in square rooted total rooms, the SValue decreases by about 3.3249 units, holding other variables constant.
SHouse (9.4917): For each one-unit increase in the square rooted number of houses in the area, the value of SValue is expected to increase by approximately 9.4917 units, assuming other variables are held constant.
LIncome (210.0397): Each one-unit increase in the log transfomred median income is associated with an increase of about 210.0397 units in SValue, keeping other variables constant.
Residual Standard Error (RSE): The RSE of 84.64 suggests that on average, the actual values deviate from the predicted SValue by approximately 84.64 units.
Multiple R-squared: With a value of 0.5408 , the model explains 54.08% of the variability in SValue, which is acceptable as a good fit.
Adjusted R-squared: Adjusted for the number of predictors, the value of 0.5395 still indicates a good model fit.
F-Statistic: The high F-statistic of 410.7 and the very low p-value indicate that the model’s predictive capability is statistically significant.
We will now take a look at the 95% confidence interval of our coefficients.
confint(reg)
## 2.5 % 97.5 %
## (Intercept) 62.697843 111.051683
## housing_median_age 1.294200 2.086035
## SRoom -4.205203 -2.444502
## SHouse 7.532154 11.451200
## LIncome 198.680244 221.399075
We can see that none of these intervals include 0 in them, which is a good sign for our predictive variables.
We will now make a prediction on a new set of predictive values from our model.
new <- data.frame(
housing_median_age = c(29),
SRoom = c(47),
SHouse = c(8),
LIncome = c(1.2)
)
predictions <- predict(reg, newdata = new, interval = "prediction", level = 0.95)
print(predictions)
## fit lwr upr
## 1 307.6011 139.6507 475.5515
We can see that the since interval is much wider than the confint, it represents uncertainty in our model. However, this could also be due to the nature of a single value prediction.
regTest <- lm(SValue ~ housing_median_age + SRoom + SHouse + LIncome , T2)
summary(regTest)
##
## Call:
## lm(formula = SValue ~ housing_median_age + SRoom + SHouse + LIncome,
## data = T2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -215.40 -57.28 -10.59 46.43 662.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.1427 19.6257 4.338 1.69e-05 ***
## housing_median_age 2.0775 0.3044 6.826 2.16e-11 ***
## SRoom -2.2938 0.6737 -3.405 0.000706 ***
## SHouse 6.6075 1.5578 4.242 2.57e-05 ***
## LIncome 210.8892 9.2086 22.901 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 86 on 595 degrees of freedom
## Multiple R-squared: 0.5319, Adjusted R-squared: 0.5287
## F-statistic: 169 on 4 and 595 DF, p-value: < 2.2e-16
We can observe that there has been some slight changes to our coefficients as well as the model’s fit.
Intercept (85.1427): When all variables are zero, the expected value of the square rooted median housing value (SValue) is approximately 85.1427.
housing_median_age (2.0775): For each one-unit increase in the median age of the area, the value of SValue is expected to increase by approximately 2.0775 units, assuming other variables are held constant.
SRoom (-2.2938): For each one-unit increase in square rooted total rooms, the SValue decreases by about 2.2938 units, holding other variables constant.
SHouse (6.6075): For each one-unit increase in the square rooted number of houses in the area, the value of SValue is expected to increase by approximately 6.6075 units, assuming other variables are held constant.
LIncome 210.8892): Each one-unit increase in the log transfomred median income is associated with an increase of about 210.8892 units in SValue, keeping other variables constant.
Residual Standard Error (RSE): The RSE has slightly increased to 86 suggests that on average, the actual values deviate from the predicted SValue by approximately 86 units.
Adjusted R-squared: The adjusted R-square has dropped slightly to 0.5287, which could be due to overfitting.
Overfitting due to multicolinearity: we can observe that there were multiple variables that were highly correlated, which could potentially cause multicolinearity and then eventually leads to overfitting. Overfitting can make the model perform poorly when introduced with unseen data and therefore reduce the accuracy of prediction. To fix this, we can drop variables that are highly correlated, but due to the nature of this dataset with so little variables, other methods needs to be considered.
Small sample size: to match what we’re used to in class, I have to reduce the sample size from 20,000 to 2000, which can affect the model’s accuracy.
Outdated data set: this data set is collected in 1990, so in order to do any applications from this model, we need to first update the data to match current date’s prices.