STA6013 Midterm Regression Analysis

Author

Ryan Murray

Linear Regression Analysis - Real Estate Data:

Exploring Factors Affecting House Prices in Sindian District, New Taipei City, Taiwan.

Introduction

Objective

The objective of this analysis is to identify factors associated with real estate unit price in Sindian District, New Taipei City, Taiwan.

Specifically, we investigate how:

Transaction date
House age
Distance to nearest MRT station
Number of convenience stores
Geographic location (latitude)

are associated with house price per unit area.

Response Variable:

\[ Y=House \ price\ of\ unit\ area \]

Candidate Predictors

𝑋 1: Transaction date

𝑋 2: House age (years)

𝑋 3: Distance to nearest MRT station (meters)

𝑋 4: Number of convenience stores

𝑋 5: Latitude

𝑋 6: Longitude

Detailed Chart:

Variable Name	Role	Type	Description	Units	Missing Values
No	ID	Integer			no
X1 transaction date	Feature	Continuous	for example, 2013.250=2013 March, 2013.500=2013 June, etc.		no
X2 house age	Feature	Continuous		year	no
X3 distance to the nearest MRT station	Feature	Continuous		meter	no
X4 number of convenience stores	Feature	Integer	number of convenience stores in the living circle on foot	integer	no
X5 latitude	Feature	Continuous	geographic coordinate, latitude	degree	no
X6 longitude	Feature	Continuous	geographic coordinate, longitude	degree	no
Y house price of unit area	Target	Continuous	10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 meter squared	10000 New Taiwan Dollar/Ping	no

Libraries used in Analysis

Code

# Core
library(tidyverse)

# Class-style visuals
library(GGally)

# For VIF / optional Anova() / Breusch-Pagan / Box-Cox Method
library(car)
library(lmtest)
library(MASS)

# plots
library(dplyr)
library(tidyr)
library(tibble)  
library(ggplot2)
library(plotly)

# rename columns
library(janitor)

Checking Structure of Data

Code

dat <- read.csv("Real estate.csv")

# Quick structure check
str(dat)

'data.frame':   414 obs. of  8 variables:
 $ No                                    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ X1.transaction.date                   : num  2013 2013 2014 2014 2013 ...
 $ X2.house.age                          : num  32 19.5 13.3 13.3 5 7.1 34.5 20.3 31.7 17.9 ...
 $ X3.distance.to.the.nearest.MRT.station: num  84.9 306.6 562 562 390.6 ...
 $ X4.number.of.convenience.stores       : int  10 9 5 5 5 3 7 6 1 3 ...
 $ X5.latitude                           : num  25 25 25 25 25 ...
 $ X6.longitude                          : num  122 122 122 122 122 ...
 $ Y.house.price.of.unit.area            : num  37.9 42.2 47.3 54.8 43.1 32.1 40.3 46.7 18.8 22.1 ...

Code

summary(dat)

       No        X1.transaction.date  X2.house.age   
 Min.   :  1.0   Min.   :2013        Min.   : 0.000  
 1st Qu.:104.2   1st Qu.:2013        1st Qu.: 9.025  
 Median :207.5   Median :2013        Median :16.100  
 Mean   :207.5   Mean   :2013        Mean   :17.713  
 3rd Qu.:310.8   3rd Qu.:2013        3rd Qu.:28.150  
 Max.   :414.0   Max.   :2014        Max.   :43.800  
 X3.distance.to.the.nearest.MRT.station X4.number.of.convenience.stores
 Min.   :  23.38                        Min.   : 0.000                 
 1st Qu.: 289.32                        1st Qu.: 1.000                 
 Median : 492.23                        Median : 4.000                 
 Mean   :1083.89                        Mean   : 4.094                 
 3rd Qu.:1454.28                        3rd Qu.: 6.000                 
 Max.   :6488.02                        Max.   :10.000                 
  X5.latitude     X6.longitude   Y.house.price.of.unit.area
 Min.   :24.93   Min.   :121.5   Min.   :  7.60            
 1st Qu.:24.96   1st Qu.:121.5   1st Qu.: 27.70            
 Median :24.97   Median :121.5   Median : 38.45            
 Mean   :24.97   Mean   :121.5   Mean   : 37.98            
 3rd Qu.:24.98   3rd Qu.:121.5   3rd Qu.: 46.60            
 Max.   :25.01   Max.   :121.6   Max.   :117.50

Code

colnames(dat)

[1] "No"                                    
[2] "X1.transaction.date"                   
[3] "X2.house.age"                          
[4] "X3.distance.to.the.nearest.MRT.station"
[5] "X4.number.of.convenience.stores"       
[6] "X5.latitude"                           
[7] "X6.longitude"                          
[8] "Y.house.price.of.unit.area"

Renaming Variables

Code

# Rename multiple columns at once
dat <- dat %>%
  rename(
    trans_date = X1.transaction.date,
    house_age = X2.house.age,  
    dist_mrt = X3.distance.to.the.nearest.MRT.station,
    num_conv_stores = X4.number.of.convenience.stores,
    latitude = X5.latitude,
    longitude = X6.longitude,
    house_price = Y.house.price.of.unit.area 
  )
colnames(dat)

[1] "No"              "trans_date"      "house_age"       "dist_mrt"       
[5] "num_conv_stores" "latitude"        "longitude"       "house_price"

Exploratory Analysis

Code

ggpairs(dat)

Impressions:

“No” is a categorical variable used to assign a number label to each variable. It is not needed for this analysis.

num_conv_stores, latitude, and longitude have the highest correlation scores 0.571, 0.546, and 0.523, respectively.

dist_mrt has (-0.674) the largest negative correlation and this interesting to note as it may have a substantial affect on the house_price variable.

Plots

Code

top_correlations <- dat %>% 
  cor() %>%
  as.data.frame() %>%
  rownames_to_column(var = "Variable") %>% 
  select(Variable, house_price) %>%
  filter(Variable != "house_price") %>%
  mutate(AbsCorr = abs(house_price)) %>%
  arrange(desc(AbsCorr)) %>%
  head(5)


top_vars <- top_correlations$Variable
print(top_vars)

[1] "dist_mrt"        "num_conv_stores" "latitude"        "longitude"      
[5] "house_age"

Code

dat %>%
  select(house_price, all_of(top_vars)) %>%
  pivot_longer(cols = -house_price, names_to = "Predictor", values_to = "Value") %>%
  ggplot(aes(x = Value, y = house_price, color = Predictor)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "black") +
  facet_wrap(~Predictor, scales = "free_x") +
  theme_minimal() +
  labs(title = "Top 5 Predictors for house price of unit area",
       y = "house_price",
       x = "Predictor Value") +
  theme(legend.position = "none")

Pairwise scatterplots and correlation analysis revealed:

A strong negative association between price and distance to MRT.
A negative relationship between price and house age.
Positive association between price and convenience stores.
Spatial clustering effects related to latitude.

The relationship between price and distance to MRT appeared nonlinear, suggesting diminishing marginal impact at larger distances.*

\[ *\ This\ observation\ motivated\ later\ transformation. \]

Null Hypothesis:

\[ H_0: \ \beta_1 = \beta_2 = \beta_3 = \beta_4 = \beta_5 = \beta_6 = 0 \]

Alternative Hypothesis:

\[ H_1 : at\ least\ 1\ \beta\ \ne \ 0 \]

Fitting the Full Model

Code

form <- lm(house_price ~ trans_date + house_age + dist_mrt + num_conv_stores + latitude + longitude, data = dat)
print(form)


Call:
lm(formula = house_price ~ trans_date + house_age + dist_mrt + 
    num_conv_stores + latitude + longitude, data = dat)

Coefficients:
    (Intercept)       trans_date        house_age         dist_mrt  
     -1.444e+04        5.146e+00       -2.697e-01       -4.487e-03  
num_conv_stores         latitude        longitude  
      1.133e+00        2.255e+02       -1.242e+01

Code

# Full model
full_mod <- lm(form, data = dat)

summary(full_mod)


Call:
lm(formula = form, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.664  -5.410  -0.966   4.217  75.193 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -1.444e+04  6.776e+03  -2.131  0.03371 *  
trans_date       5.146e+00  1.557e+00   3.305  0.00103 ** 
house_age       -2.697e-01  3.853e-02  -7.000 1.06e-11 ***
dist_mrt        -4.488e-03  7.180e-04  -6.250 1.04e-09 ***
num_conv_stores  1.133e+00  1.882e-01   6.023 3.84e-09 ***
latitude         2.255e+02  4.457e+01   5.059 6.38e-07 ***
longitude       -1.242e+01  4.858e+01  -0.256  0.79829    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.858 on 407 degrees of freedom
Multiple R-squared:  0.5824,    Adjusted R-squared:  0.5762 
F-statistic: 94.59 on 6 and 407 DF,  p-value: < 2.2e-16

Code

anova(full_mod)

Analysis of Variance Table

Response: house_price
                 Df Sum Sq Mean Sq  F value    Pr(>F)    
trans_date        1    585     585   7.4598  0.006584 ** 
house_age         1   3441    3441  43.8559  1.12e-10 ***
dist_mrt          1  34857   34857 444.2734 < 2.2e-16 ***
num_conv_stores   1   3576    3576  45.5748  5.08e-11 ***
latitude          1   2065    2065  26.3187  4.49e-07 ***
longitude         1      5       5   0.0654  0.798293    
Residuals       407  31933      78                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

summary(full_mod)$coefficients

                     Estimate   Std. Error    t value     Pr(>|t|)
(Intercept)     -1.443710e+04 6.775671e+03 -2.1307265 3.371059e-02
trans_date       5.146227e+00 1.557073e+00  3.3050650 1.033686e-03
house_age       -2.696954e-01 3.853065e-02 -6.9995036 1.064969e-11
dist_mrt        -4.487461e-03 7.180273e-04 -6.2497080 1.038560e-09
num_conv_stores  1.133277e+00 1.881639e-01  6.0228169 3.835416e-09
latitude         2.254730e+02 4.456668e+01  5.0592270 6.383415e-07
longitude       -1.242360e+01 4.858200e+01 -0.2557244 7.982928e-01

Full Model:

\[ house\_price \sim \beta_0 + \beta_1(trans\_date) + \beta_2(house\_age) + \beta_3(dist\_mrt) + \\\beta_4(num\_conv\_stores) + \beta_5(latitude) + \beta_6(longitude) + \epsilon \]

Overall Model Significance

I conducted an overall F-test to determine the collective significance of the regression model. The null hypothesis posits that none of the chosen features have a relationship with the house price.

The model yielded a highly significant F-statistic of 94.59 with a p-value of < 2.2 X 10^-16

Because the p-value is well below the standard alpha threshold of 0.05, we reject the null hypothesis. This result provides strong empirical evidence that at least one of the predictors in the model has a statistically significant relationship with house price, confirming that the model as a whole offers a significantly better fit than a null model (intercept only).

Interpretations:

trans_date ** (5.146227e+00)

Holding all other variables constant, for each 1 year increase in the transaction date, house price has an expected increase by approximately 51, 462 TWD per ping. (10000 New Taiwanese Dollar Per Ping)

house_age *** (-2.696954e-01)

Holding all other variables constant, for each 1 year increase in the age of the house, we can expect a decrease of approximately 2,697 TWD per ping.

dist_mrt *** (-4.487461e-03)

Holding all other variables constant, for every 100-meter increase in distance to the nearest MRT station, the house price is expected to decrease by approximately 4,487 TWD per ping. (a MRT station is the Taiwanese Rapid Transit subway metro network)

num_conv_stores *** (1.133277e+00)

Holding all other variables constant, the addition of one convenience store within the immediate vicinity is associated with an expected increase in house price of approximately, 11,333 TWD per ping.

latitude *** (2.254730e+02)

Holding all other variables constant, for every 0.01 degree increase in latitude (moving approximately 1.1 kilometers North), the house price is expected to increase by approximately 22,547 TWD per ping.

longitude - not significant

The longitude variable shows no evidence of a linear relationship with house price when controlling for the other factors. This discrepancy suggests that while the overall model is sound, a reduced model may be more appropriate for explaining the data without redundant variables.

Model Testing

Longitude (𝑋6) was not statistically significant.

\[ p = 0.79829 \]

A partial F-test confirmed that removing 𝑋6 did not significantly reduce explanatory power.

Therefore, longitude was removed for parsimony.

Reduced Model

Code

reduced_mod <- lm(house_price ~ trans_date + house_age + dist_mrt + num_conv_stores + latitude, data = dat)
summary(reduced_mod)


Call:
lm(formula = house_price ~ trans_date + house_age + dist_mrt + 
    num_conv_stores + latitude, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.623  -5.371  -1.020   4.244  75.346 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -1.596e+04  3.233e+03  -4.936 1.17e-06 ***
trans_date       5.135e+00  1.555e+00   3.303  0.00104 ** 
house_age       -2.694e-01  3.847e-02  -7.003 1.04e-11 ***
dist_mrt        -4.353e-03  4.899e-04  -8.887  < 2e-16 ***
num_conv_stores  1.136e+00  1.876e-01   6.056 3.17e-09 ***
latitude         2.269e+02  4.417e+01   5.136 4.36e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.848 on 408 degrees of freedom
Multiple R-squared:  0.5823,    Adjusted R-squared:  0.5772 
F-statistic: 113.8 on 5 and 408 DF,  p-value: < 2.2e-16

Code

anova(reduced_mod, full_mod)

Analysis of Variance Table

Model 1: house_price ~ trans_date + house_age + dist_mrt + num_conv_stores + 
    latitude
Model 2: house_price ~ trans_date + house_age + dist_mrt + num_conv_stores + 
    latitude + longitude
  Res.Df   RSS Df Sum of Sq      F Pr(>F)
1    408 31938                           
2    407 31933  1    5.1308 0.0654 0.7983

After controlling for the predictors in the reduced model, the additional predictor (longitude) does not significantly reduce residual variation ( F = 0.065, p = 0.798 ). Specifically, adding longitude only reduced the Residual Sum of Squares (RSS) by a negligible 5.13 units, which accounts for less than 0.02% of the total remaining variation. Therefore, the reduced model is preferred for parsimony.

Diagnostics (residuals, outliers, assumptions)

Code

plot(fitted(reduced_mod), resid(reduced_mod),
     pch = 19, cex = 0.7,
     xlab = "Fitted values", ylab = "Residuals",
     main = "Residuals vs Fitted")
abline(h = 0, lwd = 2)
lines(lowess(fitted(reduced_mod), resid(reduced_mod), f = 2/3), lwd = 2)

Observation:

The residuals vs fitted plot reveals a subtle non-linear trend, evidenced by a slight curvature in the mean residual line. Additionally, the data points appear to form two distinct clusters, suggesting a possible bimodal distribution in the underlying features. Finally, two prominent outliers are visible, which may exert undue influence on the model’s coefficients.

Normal Q-Q plot (normality of errors)

Code

qqnorm(resid(reduced_mod), main = "Normal Q–Q Plot (Residuals)", pch = 19, cex = 0.7)
qqline(resid(reduced_mod), lwd = 2)

The Normal Q-Q plot reveals a significant right-tail deviation, indicating a departure from the assumption of normality. Specifically, two extreme observations suggest the presence of potential leverage points. The extreme observations are visible in the upper-right portion of the plot, and a less pronounced outlier in the lower-left. To formally identify whether these observations are statistically significant outliers that might bias the model estimates, a Bonferroni outlier screening will be conducted.

Bonferroni Outlier Screening

Code

rstud <- rstudent(reduced_mod)
n <- nobs(reduced_mod)
p <- length(coef(reduced_mod))
df <- n - p

alpha <- 0.05
cutoff_bonf <- abs(qt(alpha / (2 * n), df))

outlier_ids <- which(abs(rstud) > cutoff_bonf)

data.frame(
  ID = outlier_ids,
  R_student = rstud[outlier_ids]
)

     ID R_student
114 114 -4.122015
271 271  9.459045

The Bonferroni outlier test formally identified observations 114 and 271 as significant outliers. Observation 271 exhibited an externally studentized residual (R-student) of 9.459045, while observation 114 showed a score of -4.122015. Given that both absolute values exceed the critical threshold of 3, and confirmed by the Bonferroni adjusted p-values. The data points represent highly influential cases that warrant further investigation or removal to ensure model stability.

Checking the leverage

Code

hatvalues(reduced_mod)[outlier_ids]

        114         271 
0.008494229 0.013689602

Residuals DOF for the reduced model is 408. n = 408 + (no. of predictors) 5 = 413

We use this formula to determine leverage:
\[ h_{ii}>\frac{2p}n \]

2 * 5 (no. of predictors) / divided by n = 413

Based on these results we can say that the outliers do not have a leverage effect. Due to both values for 114 and 271 are less than our h_ii.

\[ Observation \ 114: \\ 0.0008... < 0.02421308 \\ Observation \ 271: \\ 0.0136... < 0.02421308 \]

Code

x = 10/413
print(cat("10/413 = ",{x}))

10/413 =  0.02421308NULL

Conclusion:

While observations 114 and 271 were identified as significant outliers, a leverage analysis reveals that neither point exceeds the critical threshold of 0.024. This indicates that these observations are primarily vertical outliers (anomalous in the response variable Y ) rather than high-leverage points (extreme in the predictor space, X).

However, for observation 271, the magnitude of the studentized residual is so extreme that it results in a high Cook’s Distance despite the lack of high leverage. This confirms that an observation does not require high leverage to be influential if its residual is sufficiently large.

Given that observation 271 was identified as both a significant outlier and a highly influential case (D_i = 0.170 ), I will proceed by refitting the model without this observation. This step will determine if the previously observed nonlinearity in the residuals is a structural issue or merely an artifact driven by a single influential data point.

Remove observation 271

Code

mydata_no271 <- dat[-271, ]
model_no271 <- lm(formula(reduced_mod), data = mydata_no271)

# compare models
summary(reduced_mod)


Call:
lm(formula = house_price ~ trans_date + house_age + dist_mrt + 
    num_conv_stores + latitude, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.623  -5.371  -1.020   4.244  75.346 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -1.596e+04  3.233e+03  -4.936 1.17e-06 ***
trans_date       5.135e+00  1.555e+00   3.303  0.00104 ** 
house_age       -2.694e-01  3.847e-02  -7.003 1.04e-11 ***
dist_mrt        -4.353e-03  4.899e-04  -8.887  < 2e-16 ***
num_conv_stores  1.136e+00  1.876e-01   6.056 3.17e-09 ***
latitude         2.269e+02  4.417e+01   5.136 4.36e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.848 on 408 degrees of freedom
Multiple R-squared:  0.5823,    Adjusted R-squared:  0.5772 
F-statistic: 113.8 on 5 and 408 DF,  p-value: < 2.2e-16

Code

summary(model_no271)


Call:
lm(formula = formula(reduced_mod), data = mydata_no271)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.451  -5.235  -1.025   4.305  33.042 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -1.479e+04  2.934e+03  -5.040 7.03e-07 ***
trans_date       4.620e+00  1.410e+00   3.276  0.00114 ** 
house_age       -2.616e-01  3.488e-02  -7.501 4.02e-13 ***
dist_mrt        -4.078e-03  4.450e-04  -9.163  < 2e-16 ***
num_conv_stores  1.283e+00  1.708e-01   7.510 3.78e-13 ***
latitude         2.213e+02  4.005e+01   5.526 5.84e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.021 on 407 degrees of freedom
Multiple R-squared:  0.6266,    Adjusted R-squared:  0.622 
F-statistic: 136.6 on 5 and 407 DF,  p-value: < 2.2e-16

The removal of observation 271 significantly improved the model’s goodness-of-fit, with the Adjusted R-squared increasing from 0.5772 to 0.622, a nearly five percentage point gain in explained variance. Furthermore, the Residual Standard Error (RSE) decreased from 8.848 to 8.021, indicating more precise predictions.

Despite these improvements in fit, a visual inspection of the residuals is required to determine if the underlying structural patterns have been resolved.

Plots for comparison:

Code

# reduced model
par(mfrow = c(2, 2), oma = c(0, 0, 2, 0))
plot(reduced_mod, sub.caption = "") 
mtext("Diagnostic Plots for Reduced Model", outer = TRUE, side = 3, cex = 1.5, line = 0)

Code

# new model without obs 271
par(mfrow = c(2, 2), oma = c(0, 0, 2, 0))
plot(model_no271, sub.caption = "") 
mtext("Diagnostic Plots for 'no' 271 Model", outer = TRUE, side = 3, cex = 1.5, line = 0)

Post-Diagnostic Model Assessment

Refitting the model without observation 271 resulted in a marginal improvement in the distribution of residuals but failed to resolve underlying structural issues. While the Normal Q-Q plot showed a slight reduction in right-tail deviation after the removal of this extreme outlier, the remaining residuals still exhibit non-normal behavior, particularly in the upper and lower tails (observation 114). Furthermore, the Scale-Location plot indicates that the mild heteroskedasticity persisted. The upward trend in spread at higher fitted values remains largely unchanged which suggests that the variance is not stabilized by the exclusion of a single influential point.

The Residuals vs Fitted and Residuals vs Leverage plots further confirm that the model’s primary deficiencies are systemic rather than anecdotal. The characteristic curvature in the LOESS line remained nearly identical after removal, indicating a persistent violation of the linearity assumption. Additionally, the removal of 271 merely redistributed influence to the next tier of obsevations, such as 149 and 313, rather than stabilizing the leverage structure. Because the exclusion of the most influential outlier failed to rectify the heteroskedasticity or the non-linear pattern, it is evident that the model suffers from functional misspecification, and not data contamination.

To addresss the persistent non-linearity and heteroscedasticity, I will perform a Box-Cox transformation anlaysis to determine the optimal functional form for the response variable.

Box-Cox method to determine transformation

Code

#use reduced model
# Run Box-Cox to see the log-likelihood plot
bc <- boxcox(reduced_mod, lambda = seq(-2, 2, 0.1))

Code

# Extract the exact optimal lambda
best_lambda <- bc$x[which.max(bc$y)]

The dashed vertical line represents the confidence interval for \(\lambda\).

Optimal \(\lambda\) value: The maximum log-likelihood is at the peak of the curve and occurs at a value very close to 0.

The Box-Cox power transformation plot identifies an optimal \(\lambda\) near zero. Because a 95% confidence interval for the maximum likelihood estimate encompasses \(\lambda\) = 0, a logarithmic transformation of the response variable ( Y) is statistically indicated to stabilize variance and linearize the relationship.

Log Transform of Response Variable on Reduced Model

Code

# Log transform response:
mod_logY <- lm(log(house_price) ~ trans_date + house_age + dist_mrt + num_conv_stores + latitude, data = dat)

summary(mod_logY)


Call:
lm(formula = log(house_price) ~ trans_date + house_age + dist_mrt + 
    num_conv_stores + latitude, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.68218 -0.11505  0.00055  0.11262  1.04395 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -4.665e+02  8.091e+01  -5.766 1.61e-08 ***
trans_date       1.358e-01  3.890e-02   3.491 0.000533 ***
house_age       -6.977e-03  9.625e-04  -7.248 2.13e-12 ***
dist_mrt        -1.495e-04  1.226e-05 -12.194  < 2e-16 ***
num_conv_stores  2.766e-02  4.694e-03   5.892 7.97e-09 ***
latitude         7.883e+00  1.105e+00   7.132 4.54e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2214 on 408 degrees of freedom
Multiple R-squared:  0.6857,    Adjusted R-squared:  0.6818 
F-statistic:   178 on 5 and 408 DF,  p-value: < 2.2e-16

The logarithmic transformation significantly enhanced the model’s performance, increasing the Adjusted R-squared from 0.577 to 0.682. This suggests that the log-linear functional form explains approximately 68% of the variance in house prices, a substantial improvement over the initial linear model. All predictors remain highly statistically significant with Residual Standard Error notably stabilized. Now, to confirm that this transformation successfully addressed the previous violations of linearity and homoscedasticity, I will now evaluate the updated diagnostic plot.

Updated Diagnostic Plot for log model Y

Code

par(mfrow = c(2, 2), oma = c(0, 0, 2, 0))
plot(mod_logY, sub.caption = "") 
mtext("Diagnostic Plots for Mod Log Y", outer = TRUE, side = 3, cex = 1.5, line = 0)

Conclusion:

While the log-transformation of the response variable successfully linearized the model and improved overall fit, the persistence of heteroscedasticity in the Scale-Location plot suggests that the model is still struggling with non-constant variance. To formally diagnose this, I conducted a Breusch-Pagan Test.

Breusch-Pagan Test

Code

bptest(mod_logY)


    studentized Breusch-Pagan test

data:  mod_logY
BP = 8.8336, df = 5, p-value = 0.1159

The B-P test confirms that the residuals are not uniformly distributed wih the p = 0.1159 (greater than 0.05). Furthermore, a secondary Cook’s Distance analysis on the log-transformed model revealed that despite the improved R² certain observations continue to exert disproportionate influence on the regression coefficients.

Cook’s Distance Re-visited

Code

# 1. Calculate Cook's Distance
D <- cooks.distance(mod_logY)

# 2. Define your threshold (e.g., 4/n or 1.0)
n <- nrow(mod_logY$model)
threshold <- 4 / n 

# 3. Refit the model excluding points above the threshold
mod_no_outliers <- update(mod_logY, subset = D <= threshold)


# Compare Coefficients and R-squared
summary(mod_logY)


Call:
lm(formula = log(house_price) ~ trans_date + house_age + dist_mrt + 
    num_conv_stores + latitude, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.68218 -0.11505  0.00055  0.11262  1.04395 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -4.665e+02  8.091e+01  -5.766 1.61e-08 ***
trans_date       1.358e-01  3.890e-02   3.491 0.000533 ***
house_age       -6.977e-03  9.625e-04  -7.248 2.13e-12 ***
dist_mrt        -1.495e-04  1.226e-05 -12.194  < 2e-16 ***
num_conv_stores  2.766e-02  4.694e-03   5.892 7.97e-09 ***
latitude         7.883e+00  1.105e+00   7.132 4.54e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2214 on 408 degrees of freedom
Multiple R-squared:  0.6857,    Adjusted R-squared:  0.6818 
F-statistic:   178 on 5 and 408 DF,  p-value: < 2.2e-16

Code

summary(mod_no_outliers)


Call:
lm(formula = log(house_price) ~ trans_date + house_age + dist_mrt + 
    num_conv_stores + latitude, data = dat, subset = D <= threshold)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.48694 -0.09694  0.00347  0.09920  0.47296 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -3.288e+02  6.005e+01  -5.476 7.90e-08 ***
trans_date       7.980e-02  2.868e-02   2.783  0.00566 ** 
house_age       -7.790e-03  7.275e-04 -10.707  < 2e-16 ***
dist_mrt        -1.730e-04  9.766e-06 -17.711  < 2e-16 ***
num_conv_stores  2.445e-02  3.488e-03   7.010 1.08e-11 ***
latitude         6.887e+00  8.826e-01   7.802 5.87e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1576 on 382 degrees of freedom
Multiple R-squared:  0.8102,    Adjusted R-squared:  0.8077 
F-statistic: 326.1 on 5 and 382 DF,  p-value: < 2.2e-16

Code

# Compare Residual Plots visually
par(mfrow = c(1, 2))
plot(mod_logY, which = 1, main = "With Outliers")
plot(mod_no_outliers, which = 1, main = "Without Outliers")

Code

par(mfrow = c(2, 2), oma = c(0, 0, 2, 0))
plot(mod_no_outliers, sub.caption = "") 
mtext("Diagnostic Plots for Mod Log Y Without Outliers", outer = TRUE, side = 3, cex = 1.5, line = 0)

Still noting the same issues with non-linearity and homoscedasticity even after removing all outliers, I noticed a significant scale disparity in the distance to the nearest MRT station (dist_mrt), where values range form a minimum of 26 meters to a maximum of 6,488 meters. This extreme variance in the X space suggests that the relationship between distance and price may be multiplicative rather than additive. The importance of distance seems to make sense at closer values rather than values that are much further away resulting diminishing returns and not a strict linear relationship. Therefore, a log-log model may be necessary to stabilize the variance and more accurately capture the diminishing returns effect of the distance to the MRT station.

Quick Plot to Check a log-log model effect

Code

ggplot(data = dat, aes(x = log(house_price), y = log(dist_mrt))) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_minimal() +
  labs(title = "Scatter plot of log Y and log X3",
       x = "House Price Per Unit Area",
       y = "Distance in meters")

The double-log transformation of Log-Y and Log-X3 successfully linearized the relationship between distance and house price while preserving the expected inverse correlation. Visually, the observations now exhibit a tighter, more uniform grouping along the regression line, suggesting that the transformation has reduced the impact of extreme values and improved the model’s structural fit. This clear visual alignment provides strong empirical justification to proceed with a log-transformed X3 variable in the final model.

Log-Log Model

Code

mod_logY.X3 <- lm(log(house_price) ~ trans_date + house_age + log(dist_mrt) + num_conv_stores + latitude, data = dat)

summary(mod_logY.X3)


Call:
lm(formula = log(house_price) ~ trans_date + house_age + log(dist_mrt) + 
    num_conv_stores + latitude, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.61005 -0.10840  0.01182  0.10975  0.95976 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -5.974e+02  7.673e+01  -7.786 5.78e-14 ***
trans_date       1.674e-01  3.707e-02   4.516 8.25e-06 ***
house_age       -6.068e-03  9.180e-04  -6.610 1.21e-10 ***
log(dist_mrt)   -1.944e-01  1.333e-02 -14.582  < 2e-16 ***
num_conv_stores  1.027e-02  4.966e-03   2.068   0.0393 *  
latitude         1.062e+01  9.591e-01  11.076  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2097 on 408 degrees of freedom
Multiple R-squared:  0.7181,    Adjusted R-squared:  0.7146 
F-statistic: 207.8 on 5 and 408 DF,  p-value: < 2.2e-16

The implementation of the log-log transformation has produced the most robust thus far, with the adjusted R-squared increasing to 0.715. This indicates that the final model now explains approximately 71.5% of the variance in house prices. A substantial improvement over the initial linear mdoel’s 58% and the Log(Y) model’s 68%. Notably, the log-transformed distance to the MRT variable is highly significant with it’s p = <2e -16 confirming that the relationship between proximity to transit and property value is indeed elastic and non-linear. With the Residual Standard Error further reduced to 0.210, the model demonstrates improved predictive precision. I will now create the final residual plots to verify that this functional form has addressed the previously observed structural violations.

Residual Plot for Log-Log Model

Code

par(mfrow = c(2, 2), oma = c(0, 0, 2, 0))
plot(mod_logY.X3, sub.caption = "") 
mtext("Diagnostic Plots for Mod Log Y and log X3", outer = TRUE, side = 3, cex = 1.5, line = 0)

The diagnostic plots for the log-log specification confirm that this model is the most statistically sound iteration. The Residuals vs Fitted plot now shows a a random, horizontal cloud of points with the previous curvature successfully eliminated, validating the linearity of the double-log relationship. While the Q-Q Residuals still exhibit some minor deviation in the extreme tails specifically 114 and 271, the vast majority of residuals now align closely with the theoretical normal distribution. Additionally, the Scale-Location plot indicates a more stabilized variance, and the Residuals vs Leverage plot shows no remaining observations exceeding the critical Cook’s Distance thresholds. Both the structural non-linearity of the predictors and the skewness of the response, this model provides the most reliable and precise estimates for house prices.

The Final Selected Model

The final selected model is the log–log specification:

\[ log(hosue\_price) \sim \beta_0 + \beta_1(trans_date) + \beta_2(house\_age) + \\ \beta_3log(dist_mart) + \beta_4(num_conv_stores) + \beta_5(latitude) + \epsilon \]

Final Coefficient Interpretations:

Transaction Date:

Holding all other variables constant, each one-year increase in the transaction date is associated with an approximate 16.7% increase in house prices.

House Age:

Holding all other variables constant, for every additional year of house age, the model predicts an approximate 0.6% decrease in house prices.

Log Distance to MRT:

This represents price elasticity, holding all other variables constant, a 1% increase in the distance to the nearest MRT station results in an approximate 0.19% decrease in house price.

Convenience Stores:

Each additional convenience store in the vicinity is associated with an expected 1.0% increase in house prices, holding all other variables constant.

Latitude:

Holding all other variables constant, for every 0.01 degree increase in latitude (moving approximately 1.1 km North), the house price is expected to increase by approximately 10.6%.

Code

op <- par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))

# ---- Panel A: Reduced model (original scale) ----
y_obs_A <- dat$house_price
y_hat_A <- predict(reduced_mod)

plot(y_hat_A, y_obs_A,
     xlab = "Predicted house price",
     ylab = "Observed house price",
     main = "Reduced Model: Observed vs Predicted",
     pch = 19, cex = 0.7,
     cex.main = 0.9)

abline(a = 0, b = 1, lwd = 2, col="red")  # 45-degree line

# Add quick fit stats on plot
r2_A <- summary(reduced_mod)$r.squared
rmse_A <- sqrt(mean((y_obs_A - y_hat_A)^2))
mtext(sprintf("R² = %.3f | RMSE = %.2f", r2_A, rmse_A), side = 3, line = -1, adj = 0, cex = 0.9)

# ---- Panel B: Log model (log scale) ----
y_obs_B <- log(dat$house_price)
y_hat_B <- predict(mod_logY)  # assumes mod_logY predicts log(Y)

plot(y_hat_B, y_obs_B,
     xlab = "Predicted log(price)",
     ylab = "Observed log(price)",
     main = "Log Model: Observed vs Predicted",
     pch = 19, cex = 0.7,
     cex.main = 0.9)

abline(a = 0, b = 1, lwd = 2, col="red")

r2_B <- summary(mod_logY)$r.squared
rmse_B <- sqrt(mean((y_obs_B - y_hat_B)^2))
mtext(sprintf("R² = %.3f | RMSE = %.3f", r2_B, rmse_B), side = 3, line = -1, adj = 0, cex = 0.9)

Code

par(op)

Visually, the log model on the right shows a much tighter alignment with the 45-degree reference line, confirming that our transformation successfully stabilized the variance and secured a significant boost in predictive power.