Our aim is to predict house values. Before we begin to do any analysis, we should always check whether the dataset has missing value or not, we do so by typing:
taiwan_real_estate <- read.csv("real_estates.csv",row.names=1)
attach(taiwan_real_estate)
any(is.na(taiwan_real_estate))
## [1] FALSE
Let’s take a look at structure of the data set:
glimpse(taiwan_real_estate)
## Rows: 414
## Columns: 6
## $ house.age <dbl> 32.0, 19.5, 13.3, 13.3, 5.0, 7.1, …
## $ distance.to.the.nearest.MRT.station <dbl> 84.87882, 306.59470, 561.98450, 56…
## $ number.of.convenience.stores <int> 10, 9, 5, 5, 5, 3, 7, 6, 1, 3, 1, …
## $ latitude <dbl> 24.98298, 24.98034, 24.98746, 24.9…
## $ longitude <dbl> 121.5402, 121.5395, 121.5439, 121.…
## $ house.price.of.unit.area <dbl> 37.9, 42.2, 47.3, 54.8, 43.1, 32.1…
Let’s simplify variables’ names:
taiwan_real_estate <- taiwan_real_estate %>%
rename(house_age_years = house.age, price_twd_msq = house.price.of.unit.area,
n_convenience = number.of.convenience.stores,
dist_to_mrt_m = distance.to.the.nearest.MRT.station)
We can also perform binning for “house_age_years”:
#perform binning with specific number of bins
taiwan_real_estate<-taiwan_real_estate %>% mutate(house_age_cat = cut(house_age_years, breaks=c(0,15,30,45),include.lowest = T,
right = F))
Prepare a heatmap with correlation coefficients on it:
library(corrplot)
M<-cor(taiwan_real_estate[,1:6])
corrplot(M, method = 'number')
Draw a scatter plot of n_convenience vs. price_twd_msq:
# Scatter plot of number of convenience stores vs. price per square meter
ggplot(taiwan_real_estate, aes(x = n_convenience, y = price_twd_msq)) +
geom_point(color = "red") +
labs(title = "Number of Convenience Stores vs. Price per Square Meter",
x = "Number of Convenience Stores", y = "Price per Square Meter (TWD)")
Draw a scatter plot of house_age_years vs. price_twd_msq:
# Scatter plot of house age vs. price per square meter
ggplot(taiwan_real_estate, aes(x = house_age_years, y = price_twd_msq)) +
geom_point(color = "blue") +
labs(title = "House Age vs. Price per Square Meter",
x = "House Age (years)", y = "Price per Square Meter (TWD)")
Draw a scatter plot of distance to nearest MRT station vs. price_twd_msq:
Plot a histogram of price_twd_msq with 10 bins, facet the plot so each house age group gets its own panel:
# Histogram of price per square meter with 10 bins, faceted by house age category
ggplot(taiwan_real_estate, aes(x = price_twd_msq)) +
geom_histogram(bins = 10, fill = "green", color = "black") +
facet_wrap(~house_age_cat) +
labs(title = "Histogram of Price per Square Meter by House Age Category",
x = "Price per Square Meter (TWD)", y = "Count")
Summarize to calculate the mean, sd, median etc. house price/area by house age:
# Summarize statistics of price per square meter by house age category
summary_stats <- taiwan_real_estate %>%
group_by(house_age_cat) %>%
summarize(
mean_price = mean(price_twd_msq, na.rm = TRUE),
sd_price = sd(price_twd_msq, na.rm = TRUE),
median_price = median(price_twd_msq, na.rm = TRUE),
min_price = min(price_twd_msq, na.rm = TRUE),
max_price = max(price_twd_msq, na.rm = TRUE)
)
summary_stats
## # A tibble: 3 × 6
## house_age_cat mean_price sd_price median_price min_price max_price
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 [0,15) 41.8 14.2 42.6 7.6 118.
## 2 [15,30) 32.6 11.4 32.9 11.2 59.6
## 3 [30,45] 37.7 12.8 38.3 12.2 78.3
Run a linear regression of price_twd_msq vs. best, but only 1 predictor:
# Linear regression of price_twd_msq vs. dist_to_mrt_m
lm_model <- lm(price_twd_msq ~ dist_to_mrt_m, data = taiwan_real_estate)
summary(lm_model)
##
## Call:
## lm(formula = price_twd_msq ~ dist_to_mrt_m, data = taiwan_real_estate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.396 -6.007 -1.195 4.831 73.483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.8514271 0.6526105 70.26 <2e-16 ***
## dist_to_mrt_m -0.0072621 0.0003925 -18.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.07 on 412 degrees of freedom
## Multiple R-squared: 0.4538, Adjusted R-squared: 0.4524
## F-statistic: 342.2 on 1 and 412 DF, p-value: < 2.2e-16
We start by displaying the statistical summary of the model using the R function summary():
# Linear regression of price_twd_msq vs. dist_to_mrt_m
lm_model <- lm(price_twd_msq ~ dist_to_mrt_m, data = taiwan_real_estate)
summary(lm_model)
##
## Call:
## lm(formula = price_twd_msq ~ dist_to_mrt_m, data = taiwan_real_estate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.396 -6.007 -1.195 4.831 73.483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.8514271 0.6526105 70.26 <2e-16 ***
## dist_to_mrt_m -0.0072621 0.0003925 -18.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.07 on 412 degrees of freedom
## Multiple R-squared: 0.4538, Adjusted R-squared: 0.4524
## F-statistic: 342.2 on 1 and 412 DF, p-value: < 2.2e-16
You can access lots of different aspects of the regression object. To see what’s inside, use names():
# Display the names of elements in the regression model object
names(lm_model)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
What do they mean?
The elements in the regression model object lm_model can be accessed using the names() function, which lists the following components:
coefficients: The regression coefficients, including the intercept and the slope. residuals: The residuals, which are the differences between the observed values and the values predicted by the model. fitted.values: The values predicted by the model. rank: The rank of the model matrix. df.residual: The degrees of freedom for the residuals. call: The function call that was used to create the model. terms: An object containing information about the terms in the model. model: The input data on which the model is based. assign: Indexes assigning values to variables in the model. qr: The QR decomposition object used in the model calculations.
Discuss model accuracy:
Model accuracy can be evaluated using several metrics obtained from the model summary (summary(lm_model)):
R-squared: The coefficient of determination, which indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s). A higher R-squared value means better model fit. Adjusted R-squared: The adjusted coefficient of determination, which adjusts for the number of predictors in the model. It provides a more accurate measure of model fit for models with multiple predictors. F-statistic: This statistic tests the overall significance of the model. A higher F-statistic value and a low p-value (usually < 0.05) indicate that the model is statistically significant. p-value: The p-values for the regression coefficients indicate whether each predictor is statistically significantly different from zero. Small p-values (usually < 0.05) suggest that the predictor is significant.
Model diagnostics:
par(mfrow = c(2, 2))
plot(lm_model)
The four plots show…
Residuals vs. Fitted: Checks if the residuals have a random distribution around zero. This plot helps to identify non-linearity, unequal error variances, and outliers. Normal Q-Q: Checks if the residuals are normally distributed. Deviations from the diagonal line in this plot indicate departures from normality. Scale-Location (or Spread-Location): Checks the homoscedasticity (constant variance) of the residuals. Ideally, residuals should be spread equally along the range of predictors. Residuals vs. Leverage: Identifies points that have a high influence on the model fit. Points with high leverage and large residuals can disproportionately affect the model.
Create the diagnostic plots using ggfortify:
library(ggfortify)
autoplot(lm_model)
Outliers and high levarage points:
plot(lm_model, 5)
Influential values:
# Cook's distance
plot(lm_model, 4)
or just plot all of diagnostic plots together:
autoplot(lm_model, which = 1:6, label.size = 3)
Discussion:
Based on the diagnostic plots, the following conclusions can be drawn:
Outliers: The residuals vs. fitted plot and the normal Q-Q plot help identify potential outliers. Observations with large residuals (far from the horizontal line in the residuals vs. fitted plot) and deviations from the line in the normal Q-Q plot indicate outliers. High Leverage Points: The residuals vs. leverage plot helps identify high leverage points. Points far from the center in the horizontal direction are high leverage points. Influential Values: Cook’s distance plot highlights influential observations. Points with Cook’s distance greater than 1 are considered highly influential and should be examined closely. Overall, the diagnostic plots provide valuable insights into the fit and assumptions of the linear regression model. By identifying and addressing outliers, high leverage points, and influential values, we can improve the robustness and accuracy of the model.
We begin by splitting the dataset into two parts, training set and testing set. In this example we will randomly take 75% row in this dataset and put it into the training set, and other 25% row in the testing set:
# Determine the sample size for the training set (75% of the dataset)
smp_size <- floor(0.75 * nrow(taiwan_real_estate))
# Set the seed for reproducibility
set.seed(12)
# Randomly sample row indices for the training set
train_ind <- sample(seq_len(nrow(taiwan_real_estate)), size = smp_size)
# Split the dataset into training and testing sets
train <- taiwan_real_estate[train_ind, ]
test <- taiwan_real_estate[-train_ind, ]
1st comment: floor() is used to return the largest integer value which is not greater than an individual number, or expression.
2nd comment: set.seed() is used to set the seed of R’s random number generator, this function is used so results from this example can be recreated easily.
Now we have our training set and testing set.
Generally, selecting variables for linear regression is a debatable topic.
There are many methods for variable selecting, namely, forward stepwise selection, backward stepwise selection, etc, some are valid, some are heavily criticized.
I recommend this document: https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/26/lecture-26.pdf and Gung’s comment: https://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection/20856#20856 if you want to learn more about variable selection process.
If our goal is prediction, it is safer to include all predictors in our model, removing variables without knowing the science behind it usually does more harm than good!!!
We begin to create our multiple linear regression model:
model2 <- lm(price_twd_msq ~ ., data = train)
summary(model2)
##
## Call:
## lm(formula = price_twd_msq ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.009 -4.953 -1.296 4.461 75.042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.861e+03 7.542e+03 -0.379 0.704717
## house_age_years -4.209e-01 1.223e-01 -3.442 0.000659 ***
## dist_to_mrt_m -4.558e-03 8.681e-04 -5.251 2.86e-07 ***
## n_convenience 9.826e-01 2.287e-01 4.297 2.34e-05 ***
## latitude 2.505e+02 5.375e+01 4.660 4.74e-06 ***
## longitude -2.755e+01 6.032e+01 -0.457 0.648221
## house_age_cat[15,30) -1.089e+00 1.916e+00 -0.568 0.570124
## house_age_cat[30,45] 5.789e+00 3.577e+00 1.618 0.106655
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.331 on 302 degrees of freedom
## Multiple R-squared: 0.5676, Adjusted R-squared: 0.5575
## F-statistic: 56.62 on 7 and 302 DF, p-value: < 2.2e-16
Discuss the results…
there are various methods for variable selection, each with its pros and cons, the choice should be guided by the specific goals of the analysis, domain knowledge, and rigorous validation techniques. Our approach of including all predictors provides a comprehensive view, but future work could explore alternative variable selection methods to refine the model further.
Looking at model summary, we see that variables house_age_years and dist_to_mrt_m are insignificant, so let’s estimate the model without those variables:
# Re-estimate the model without insignificant variables
model3 <- lm(price_twd_msq ~ n_convenience, data = train)
summary(model3)
##
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.573 -7.659 -2.303 6.754 87.252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.6626 1.1670 23.70 <2e-16 ***
## n_convenience 2.5850 0.2245 11.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.75 on 308 degrees of freedom
## Multiple R-squared: 0.3009, Adjusted R-squared: 0.2986
## F-statistic: 132.6 on 1 and 308 DF, p-value: < 2.2e-16
There are many standards researchers apply for deciding whether a VIF is too large. In some domains, a VIF over 2 is worthy of suspicion. Others set the bar higher, at 5 or 10. Others still will say you shouldn’t pay attention to these at all. Ultimately, the main thing to consider is that small effects are more likely to be “drowned out” by higher VIFs, but this may just be a natural, unavoidable fact with your model.
## Ładowanie wymaganego pakietu: carData
##
## Dołączanie pakietu: 'car'
## Następujący obiekt został zakryty z 'package:dplyr':
##
## recode
## Następujący obiekt został zakryty z 'package:purrr':
##
## some
## GVIF Df GVIF^(1/(2*Df))
## house_age_years 6.920312 1 2.630649
## dist_to_mrt_m 4.244914 1 2.060319
## n_convenience 1.644746 1 1.282476
## latitude 1.522464 1 1.233882
## longitude 3.014812 1 1.736321
## house_age_cat 7.480049 2 1.653774
Finally we test our model on test dataset:
## Ładowanie wymaganego pakietu: lattice
##
## Dołączanie pakietu: 'caret'
## Następujący obiekt został zakryty z 'package:purrr':
##
## lift
## $MSE
## [1] 55.5732
##
## $RMSE
## [1] 7.454743
##
## $R_squared
## [1] 0.6155939
Interpret results…
Our multiple regression model, which included all predictors (house_age_years, n_convenience, and dist_to_mrt_m), was tested on the test dataset to evaluate its performance:
Mean Squared Error (MSE) = 55.5732: This value indicates the average squared error of the predictions. A lower MSE suggests better prediction accuracy, and in this case, the relatively low MSE reflects good overall model accuracy.
Root Mean Squared Error (RMSE) = 7.454743: This value suggests that, on average, the model’s predictions are within approximately 7.45 units of the actual house prices per square meter. This low RMSE indicates that the model has high predictive accuracy.
R-squared = 0.6155939: This value indicates that approximately 61.56% of the variability in house prices is explained by the predictors in the model. This suggests a strong relationship between the predictors and the dependent variable, but also indicates that about 38.44% of the variability remains unexplained by the model.
Best subset and stepwise (forward, backward, both) techniques of variable selection can be used to come up with the best linear regression model for the dependent variable medv.
## (Intercept) house_age_years n_convenience dist_to_mrt_m
## 1 TRUE FALSE FALSE TRUE
## 2 TRUE FALSE TRUE TRUE
## 3 TRUE TRUE TRUE TRUE
## [1] 0.4546357 0.4830094 0.5165886
## [1] -177.4851 -189.3198 -205.4132
## [1] 41.47258 24.32511 4.00000
##
## Dołączanie pakietu: 'MASS'
## Następujący obiekt został zakryty z 'package:dplyr':
##
## select
## Start: AIC=1638.47
## price_twd_msq ~ 1
##
## Df Sum of Sq RSS AIC
## + dist_to_mrt_m 1 27754.0 33057 1451.5
## + n_convenience 1 18298.4 42512 1529.5
## + house_age_years 1 2037.6 58773 1629.9
## <none> 60811 1638.5
##
## Step: AIC=1451.52
## price_twd_msq ~ dist_to_mrt_m
##
## Df Sum of Sq RSS AIC
## + n_convenience 1 1821.6 31235 1435.9
## + house_age_years 1 1773.7 31283 1436.4
## <none> 33057 1451.5
##
## Step: AIC=1435.94
## price_twd_msq ~ dist_to_mrt_m + n_convenience
##
## Df Sum of Sq RSS AIC
## + house_age_years 1 2123.9 29111 1416.1
## <none> 31235 1435.9
##
## Step: AIC=1416.11
## price_twd_msq ~ dist_to_mrt_m + n_convenience + house_age_years
##
## Call:
## lm(formula = price_twd_msq ~ dist_to_mrt_m + n_convenience +
## house_age_years, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.650 -5.386 -1.862 4.183 76.138
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.2063132 1.7085698 25.873 < 2e-16 ***
## dist_to_mrt_m -0.0058636 0.0005564 -10.538 < 2e-16 ***
## n_convenience 1.1269464 0.2358640 4.778 2.75e-06 ***
## house_age_years -0.2305409 0.0487923 -4.725 3.51e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.754 on 306 degrees of freedom
## Multiple R-squared: 0.5213, Adjusted R-squared: 0.5166
## F-statistic: 111.1 on 3 and 306 DF, p-value: < 2.2e-16
## Start: AIC=1416.11
## price_twd_msq ~ house_age_years + n_convenience + dist_to_mrt_m
##
## Df Sum of Sq RSS AIC
## <none> 29111 1416.1
## - house_age_years 1 2123.9 31235 1435.9
## - n_convenience 1 2171.8 31283 1436.4
## - dist_to_mrt_m 1 10564.4 39676 1510.1
##
## Call:
## lm(formula = price_twd_msq ~ house_age_years + n_convenience +
## dist_to_mrt_m, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.650 -5.386 -1.862 4.183 76.138
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.2063132 1.7085698 25.873 < 2e-16 ***
## house_age_years -0.2305409 0.0487923 -4.725 3.51e-06 ***
## n_convenience 1.1269464 0.2358640 4.778 2.75e-06 ***
## dist_to_mrt_m -0.0058636 0.0005564 -10.538 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.754 on 306 degrees of freedom
## Multiple R-squared: 0.5213, Adjusted R-squared: 0.5166
## F-statistic: 111.1 on 3 and 306 DF, p-value: < 2.2e-16
## Start: AIC=1416.11
## price_twd_msq ~ house_age_years + n_convenience + dist_to_mrt_m
##
## Df Sum of Sq RSS AIC
## <none> 29111 1416.1
## - house_age_years 1 2123.9 31235 1435.9
## - n_convenience 1 2171.8 31283 1436.4
## - dist_to_mrt_m 1 10564.4 39676 1510.1
##
## Call:
## lm(formula = price_twd_msq ~ house_age_years + n_convenience +
## dist_to_mrt_m, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.650 -5.386 -1.862 4.183 76.138
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.2063132 1.7085698 25.873 < 2e-16 ***
## house_age_years -0.2305409 0.0487923 -4.725 3.51e-06 ***
## n_convenience 1.1269464 0.2358640 4.778 2.75e-06 ***
## dist_to_mrt_m -0.0058636 0.0005564 -10.538 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.754 on 306 degrees of freedom
## Multiple R-squared: 0.5213, Adjusted R-squared: 0.5166
## F-statistic: 111.1 on 3 and 306 DF, p-value: < 2.2e-16
By comparing different variable selection methods, we can identify the best combination of predictors that provide the most reliable and interpretable model. This approach ensures that our final model is both statistically significant and practically meaningful.
## df AIC
## model_forward 5 2297.856
## model_backward 5 2297.856
## model_step 5 2297.856
##
## Call:
## lm(formula = price_twd_msq ~ house_age_years + n_convenience +
## dist_to_mrt_m, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.650 -5.386 -1.862 4.183 76.138
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.2063132 1.7085698 25.873 < 2e-16 ***
## house_age_years -0.2305409 0.0487923 -4.725 3.51e-06 ***
## n_convenience 1.1269464 0.2358640 4.778 2.75e-06 ***
## dist_to_mrt_m -0.0058636 0.0005564 -10.538 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.754 on 306 degrees of freedom
## Multiple R-squared: 0.5213, Adjusted R-squared: 0.5166
## F-statistic: 111.1 on 3 and 306 DF, p-value: < 2.2e-16
From Best subset regression and stepwise selection (forward, backward, both), we see that …….
Plots show that…
Plots show that the residuals are approximately normally distributed, and there are no obvious patterns or heteroscedasticity issues. The model appears to be a good fit for the data.
Models are compared based on adjusted r square, AIC, BIC criteria for in-sample performance and mean square prediction error (MSPE) for out-of-sample performance.
## [1] 93.90698
## [1] 0.5212819
## [1] 0.5165886
## $AIC
## [1] 2297.856
##
## $BIC
## [1] 2316.539
Finally, we can check the Out-of-sample Prediction or test error (MSPE):
## [1] 59.32984
The comparison of models based on adjusted R-squared, AIC, BIC criteria for in-sample performance, and MSPE for out-of-sample performance helps us select the most robust and accurate model. In this case, the stepwise selection model (model_step) emerges as the best overall model, balancing fit and complexity effectively.
Please check how function ?cv.glm works.
We will just extract from this object created by cv.glm command - the raw cross-validation estimate of prediction error.
##
## Dołączanie pakietu: 'boot'
## Następujący obiekt został zakryty z 'package:lattice':
##
## melanoma
## Następujący obiekt został zakryty z 'package:car':
##
## logit
## $Stepwise_Model_MSPE
## [1] NaN
##
## $Forward_Selection_Model_MSPE
## [1] NaN
Based on AIC criteria … Based on the AIC criteria, we have previously identified the stepwise selection model as the best-fitting model. Now, we evaluate the out-of-sample MSPE for both the stepwise selection model and the forward selection model.
We need to check out-of-sample MSPE for both models. Based on out-of-sample prediction error, model … the stepwise selection model has a lower MSPE and thus, provides better predictive accuracy compared to the forward selection model. This further validates our initial model selection based on AIC and other criteria.