No X1.transaction.date X2.house.age
Min. : 1.0 Min. :2013 Min. : 0.000
1st Qu.:104.2 1st Qu.:2013 1st Qu.: 9.025
Median :207.5 Median :2013 Median :16.100
Mean :207.5 Mean :2013 Mean :17.713
3rd Qu.:310.8 3rd Qu.:2013 3rd Qu.:28.150
Max. :414.0 Max. :2014 Max. :43.800
X3.distance.to.the.nearest.MRT.station X4.number.of.convenience.stores
Min. : 23.38 Min. : 0.000
1st Qu.: 289.32 1st Qu.: 1.000
Median : 492.23 Median : 4.000
Mean :1083.89 Mean : 4.094
3rd Qu.:1454.28 3rd Qu.: 6.000
Max. :6488.02 Max. :10.000
X5.latitude X6.longitude Y.house.price.of.unit.area
Min. :24.93 Min. :121.5 Min. : 7.60
1st Qu.:24.96 1st Qu.:121.5 1st Qu.: 27.70
Median :24.97 Median :121.5 Median : 38.45
Mean :24.97 Mean :121.5 Mean : 37.98
3rd Qu.:24.98 3rd Qu.:121.5 3rd Qu.: 46.60
Max. :25.01 Max. :121.6 Max. :117.50
I conducted an overall F-test to determine the collective significance of the regression model. The null hypothesis posits that none of the chosen features have a relationship with the house price.
The model yielded a highly significant F-statistic of 94.59 with a p-value of < 2.2 X 10-16
Because the p-value is well below the standard alpha threshold of 0.05, we reject the null hypothesis. This result provides strong empirical evidence that at least one of the predictors in the model has a statistically significant relationship with house price, confirming that the model as a whole offers a significantly better fit than a null model (intercept only).
Interpretations:
trans_date ** (5.146227e+00)
Holding all other variables constant, for each 1 year increase in the transaction date, house price has an expected increase by approximately 51, 462 TWD per ping. (10000 New Taiwanese Dollar Per Ping)
house_age *** (-2.696954e-01)
Holding all other variables constant, for each 1 year increase in the age of the house, we can expect a decrease of approximately 2,697 TWD per ping.
dist_mrt *** (-4.487461e-03)
Holding all other variables constant, for every 100-meter increase in distance to the nearest MRT station, the house price is expected to decrease by approximately 4,487 TWD per ping. (a MRT station is the Taiwanese Rapid Transit subway metro network)
num_conv_stores *** (1.133277e+00)
Holding all other variables constant, the addition of one convenience store within the immediate vicinity is associated with an expected increase in house price of approximately, 11,333 TWD per ping.
latitude *** (2.254730e+02)
Holding all other variables constant, for every 0.01 degree increase in latitude (moving approximately 1.1 kilometers North), the house price is expected to increase by approximately 22,547 TWD per ping.
longitude - not significant
The longitude variable shows no evidence of a linear relationship with house price when controlling for the other factors. This discrepancy suggests that while the overall model is sound, a reduced model may be more appropriate for explaining the data without redundant variables.
Model Testing
Longitude (𝑋6) was not statistically significant.
\[
p = 0.79829
\]
A partial F-test confirmed that removing 𝑋6 did not significantly reduce explanatory power.
Call:
lm(formula = house_price ~ trans_date + house_age + dist_mrt +
num_conv_stores + latitude, data = dat)
Residuals:
Min 1Q Median 3Q Max
-35.623 -5.371 -1.020 4.244 75.346
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.596e+04 3.233e+03 -4.936 1.17e-06 ***
trans_date 5.135e+00 1.555e+00 3.303 0.00104 **
house_age -2.694e-01 3.847e-02 -7.003 1.04e-11 ***
dist_mrt -4.353e-03 4.899e-04 -8.887 < 2e-16 ***
num_conv_stores 1.136e+00 1.876e-01 6.056 3.17e-09 ***
latitude 2.269e+02 4.417e+01 5.136 4.36e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.848 on 408 degrees of freedom
Multiple R-squared: 0.5823, Adjusted R-squared: 0.5772
F-statistic: 113.8 on 5 and 408 DF, p-value: < 2.2e-16
Code
anova(reduced_mod, full_mod)
Analysis of Variance Table
Model 1: house_price ~ trans_date + house_age + dist_mrt + num_conv_stores +
latitude
Model 2: house_price ~ trans_date + house_age + dist_mrt + num_conv_stores +
latitude + longitude
Res.Df RSS Df Sum of Sq F Pr(>F)
1 408 31938
2 407 31933 1 5.1308 0.0654 0.7983
After controlling for the predictors in the reduced model, the additional predictor (longitude) does not significantly reduce residual variation ( F = 0.065, p = 0.798 ). Specifically, adding longitude only reduced the Residual Sum of Squares (RSS) by a negligible 5.13 units, which accounts for less than 0.02% of the total remaining variation. Therefore, the reduced model is preferred for parsimony.
Diagnostics (residuals, outliers, assumptions)
Code
plot(fitted(reduced_mod), resid(reduced_mod),pch =19, cex =0.7,xlab ="Fitted values", ylab ="Residuals",main ="Residuals vs Fitted")abline(h =0, lwd =2)lines(lowess(fitted(reduced_mod), resid(reduced_mod), f =2/3), lwd =2)
Observation:
The residuals vs fitted plot reveals a subtle non-linear trend, evidenced by a slight curvature in the mean residual line. Additionally, the data points appear to form two distinct clusters, suggesting a possible bimodal distribution in the underlying features. Finally, two prominent outliers are visible, which may exert undue influence on the model’s coefficients.
The Normal Q-Q plot reveals a significant right-tail deviation, indicating a departure from the assumption of normality. Specifically, two extreme observations suggest the presence of potential leverage points. The extreme observations are visible in the upper-right portion of the plot, and a less pronounced outlier in the lower-left. To formally identify whether these observations are statistically significant outliers that might bias the model estimates, a Bonferroni outlier screening will be conducted.
The Bonferroni outlier test formally identified observations 114 and 271 as significant outliers. Observation 271 exhibited an externally studentized residual (R-student) of 9.459045, while observation 114 showed a score of -4.122015. Given that both absolute values exceed the critical threshold of 3, and confirmed by the Bonferroni adjusted p-values. The data points represent highly influential cases that warrant further investigation or removal to ensure model stability.
Checking the leverage
Code
hatvalues(reduced_mod)[outlier_ids]
114 271
0.008494229 0.013689602
Residuals DOF for the reduced model is 408. n = 408 + (no. of predictors) 5 = 413
We use this formula to determine leverage: \[
h_{ii}>\frac{2p}n
\]
2 * 5 (no. of predictors) / divided by n = 413
Based on these results we can say that the outliers do not have a leverage effect. Due to both values for 114 and 271 are less than our hii.
While observations 114 and 271 were identified as significant outliers, a leverage analysis reveals that neither point exceeds the critical threshold of 0.024. This indicates that these observations are primarily vertical outliers (anomalous in the response variable Y ) rather than high-leverage points (extreme in the predictor space, X).
However, for observation 271, the magnitude of the studentized residual is so extreme that it results in a high Cook’s Distance despite the lack of high leverage. This confirms that an observation does not require high leverage to be influential if its residual is sufficiently large.
Given that observation 271 was identified as both a significant outlier and a highly influential case (Di = 0.170 ), I will proceed by refitting the model without this observation. This step will determine if the previously observed nonlinearity in the residuals is a structural issue or merely an artifact driven by a single influential data point.
Remove observation 271
Code
mydata_no271 <- dat[-271, ]model_no271 <-lm(formula(reduced_mod), data = mydata_no271)# compare modelssummary(reduced_mod)
Call:
lm(formula = house_price ~ trans_date + house_age + dist_mrt +
num_conv_stores + latitude, data = dat)
Residuals:
Min 1Q Median 3Q Max
-35.623 -5.371 -1.020 4.244 75.346
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.596e+04 3.233e+03 -4.936 1.17e-06 ***
trans_date 5.135e+00 1.555e+00 3.303 0.00104 **
house_age -2.694e-01 3.847e-02 -7.003 1.04e-11 ***
dist_mrt -4.353e-03 4.899e-04 -8.887 < 2e-16 ***
num_conv_stores 1.136e+00 1.876e-01 6.056 3.17e-09 ***
latitude 2.269e+02 4.417e+01 5.136 4.36e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.848 on 408 degrees of freedom
Multiple R-squared: 0.5823, Adjusted R-squared: 0.5772
F-statistic: 113.8 on 5 and 408 DF, p-value: < 2.2e-16
Code
summary(model_no271)
Call:
lm(formula = formula(reduced_mod), data = mydata_no271)
Residuals:
Min 1Q Median 3Q Max
-35.451 -5.235 -1.025 4.305 33.042
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.479e+04 2.934e+03 -5.040 7.03e-07 ***
trans_date 4.620e+00 1.410e+00 3.276 0.00114 **
house_age -2.616e-01 3.488e-02 -7.501 4.02e-13 ***
dist_mrt -4.078e-03 4.450e-04 -9.163 < 2e-16 ***
num_conv_stores 1.283e+00 1.708e-01 7.510 3.78e-13 ***
latitude 2.213e+02 4.005e+01 5.526 5.84e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.021 on 407 degrees of freedom
Multiple R-squared: 0.6266, Adjusted R-squared: 0.622
F-statistic: 136.6 on 5 and 407 DF, p-value: < 2.2e-16
The removal of observation 271 significantly improved the model’s goodness-of-fit, with the Adjusted R-squared increasing from 0.5772 to 0.622, a nearly five percentage point gain in explained variance. Furthermore, the Residual Standard Error (RSE) decreased from 8.848 to 8.021, indicating more precise predictions.
Despite these improvements in fit, a visual inspection of the residuals is required to determine if the underlying structural patterns have been resolved.
Plots for comparison:
Code
# reduced modelpar(mfrow =c(2, 2), oma =c(0, 0, 2, 0))plot(reduced_mod, sub.caption ="") mtext("Diagnostic Plots for Reduced Model", outer =TRUE, side =3, cex =1.5, line =0)
Code
# new model without obs 271par(mfrow =c(2, 2), oma =c(0, 0, 2, 0))plot(model_no271, sub.caption ="") mtext("Diagnostic Plots for 'no' 271 Model", outer =TRUE, side =3, cex =1.5, line =0)
Post-Diagnostic Model Assessment
Refitting the model without observation 271 resulted in a marginal improvement in the distribution of residuals but failed to resolve underlying structural issues. While the Normal Q-Q plot showed a slight reduction in right-tail deviation after the removal of this extreme outlier, the remaining residuals still exhibit non-normal behavior, particularly in the upper and lower tails (observation 114). Furthermore, the Scale-Location plot indicates that the mild heteroskedasticity persisted. The upward trend in spread at higher fitted values remains largely unchanged which suggests that the variance is not stabilized by the exclusion of a single influential point.
The Residuals vs Fitted and Residuals vs Leverage plots further confirm that the model’s primary deficiencies are systemic rather than anecdotal. The characteristic curvature in the LOESS line remained nearly identical after removal, indicating a persistent violation of the linearity assumption. Additionally, the removal of 271 merely redistributed influence to the next tier of obsevations, such as 149 and 313, rather than stabilizing the leverage structure. Because the exclusion of the most influential outlier failed to rectify the heteroskedasticity or the non-linear pattern, it is evident that the model suffers from functional misspecification, and not data contamination.
To addresss the persistent non-linearity and heteroscedasticity, I will perform a Box-Cox transformation anlaysis to determine the optimal functional form for the response variable.
Box-Cox method to determine transformation
Code
#use reduced model# Run Box-Cox to see the log-likelihood plotbc <-boxcox(reduced_mod, lambda =seq(-2, 2, 0.1))
Code
# Extract the exact optimal lambdabest_lambda <- bc$x[which.max(bc$y)]
The dashed vertical line represents the confidence interval for \(\lambda\).
Optimal \(\lambda\) value: The maximum log-likelihood is at the peak of the curve and occurs at a value very close to 0.
The Box-Cox power transformation plot identifies an optimal \(\lambda\) near zero. Because a 95% confidence interval for the maximum likelihood estimate encompasses \(\lambda\) = 0, a logarithmic transformation of the response variable ( Y) is statistically indicated to stabilize variance and linearize the relationship.
Log Transform of Response Variable on Reduced Model
Call:
lm(formula = log(house_price) ~ trans_date + house_age + dist_mrt +
num_conv_stores + latitude, data = dat)
Residuals:
Min 1Q Median 3Q Max
-1.68218 -0.11505 0.00055 0.11262 1.04395
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.665e+02 8.091e+01 -5.766 1.61e-08 ***
trans_date 1.358e-01 3.890e-02 3.491 0.000533 ***
house_age -6.977e-03 9.625e-04 -7.248 2.13e-12 ***
dist_mrt -1.495e-04 1.226e-05 -12.194 < 2e-16 ***
num_conv_stores 2.766e-02 4.694e-03 5.892 7.97e-09 ***
latitude 7.883e+00 1.105e+00 7.132 4.54e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2214 on 408 degrees of freedom
Multiple R-squared: 0.6857, Adjusted R-squared: 0.6818
F-statistic: 178 on 5 and 408 DF, p-value: < 2.2e-16
The logarithmic transformation significantly enhanced the model’s performance, increasing the Adjusted R-squared from 0.577 to 0.682. This suggests that the log-linear functional form explains approximately 68% of the variance in house prices, a substantial improvement over the initial linear model. All predictors remain highly statistically significant with Residual Standard Error notably stabilized. Now, to confirm that this transformation successfully addressed the previous violations of linearity and homoscedasticity, I will now evaluate the updated diagnostic plot.
Updated Diagnostic Plot for log model Y
Code
par(mfrow =c(2, 2), oma =c(0, 0, 2, 0))plot(mod_logY, sub.caption ="") mtext("Diagnostic Plots for Mod Log Y", outer =TRUE, side =3, cex =1.5, line =0)
Conclusion:
While the log-transformation of the response variable successfully linearized the model and improved overall fit, the persistence of heteroscedasticity in the Scale-Location plot suggests that the model is still struggling with non-constant variance. To formally diagnose this, I conducted a Breusch-Pagan Test.
Breusch-Pagan Test
Code
bptest(mod_logY)
studentized Breusch-Pagan test
data: mod_logY
BP = 8.8336, df = 5, p-value = 0.1159
The B-P test confirms that the residuals are not uniformly distributed wih the p = 0.1159 (greater than 0.05). Furthermore, a secondary Cook’s Distance analysis on the log-transformed model revealed that despite the improved R2 certain observations continue to exert disproportionate influence on the regression coefficients.
Cook’s Distance Re-visited
Code
# 1. Calculate Cook's DistanceD <-cooks.distance(mod_logY)# 2. Define your threshold (e.g., 4/n or 1.0)n <-nrow(mod_logY$model)threshold <-4/ n # 3. Refit the model excluding points above the thresholdmod_no_outliers <-update(mod_logY, subset = D <= threshold)# Compare Coefficients and R-squaredsummary(mod_logY)
Call:
lm(formula = log(house_price) ~ trans_date + house_age + dist_mrt +
num_conv_stores + latitude, data = dat)
Residuals:
Min 1Q Median 3Q Max
-1.68218 -0.11505 0.00055 0.11262 1.04395
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.665e+02 8.091e+01 -5.766 1.61e-08 ***
trans_date 1.358e-01 3.890e-02 3.491 0.000533 ***
house_age -6.977e-03 9.625e-04 -7.248 2.13e-12 ***
dist_mrt -1.495e-04 1.226e-05 -12.194 < 2e-16 ***
num_conv_stores 2.766e-02 4.694e-03 5.892 7.97e-09 ***
latitude 7.883e+00 1.105e+00 7.132 4.54e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2214 on 408 degrees of freedom
Multiple R-squared: 0.6857, Adjusted R-squared: 0.6818
F-statistic: 178 on 5 and 408 DF, p-value: < 2.2e-16
Code
summary(mod_no_outliers)
Call:
lm(formula = log(house_price) ~ trans_date + house_age + dist_mrt +
num_conv_stores + latitude, data = dat, subset = D <= threshold)
Residuals:
Min 1Q Median 3Q Max
-0.48694 -0.09694 0.00347 0.09920 0.47296
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.288e+02 6.005e+01 -5.476 7.90e-08 ***
trans_date 7.980e-02 2.868e-02 2.783 0.00566 **
house_age -7.790e-03 7.275e-04 -10.707 < 2e-16 ***
dist_mrt -1.730e-04 9.766e-06 -17.711 < 2e-16 ***
num_conv_stores 2.445e-02 3.488e-03 7.010 1.08e-11 ***
latitude 6.887e+00 8.826e-01 7.802 5.87e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1576 on 382 degrees of freedom
Multiple R-squared: 0.8102, Adjusted R-squared: 0.8077
F-statistic: 326.1 on 5 and 382 DF, p-value: < 2.2e-16
Code
# Compare Residual Plots visuallypar(mfrow =c(1, 2))plot(mod_logY, which =1, main ="With Outliers")plot(mod_no_outliers, which =1, main ="Without Outliers")
Code
par(mfrow =c(2, 2), oma =c(0, 0, 2, 0))plot(mod_no_outliers, sub.caption ="") mtext("Diagnostic Plots for Mod Log Y Without Outliers", outer =TRUE, side =3, cex =1.5, line =0)
Still noting the same issues with non-linearity and homoscedasticity even after removing all outliers, I noticed a significant scale disparity in the distance to the nearest MRT station (dist_mrt), where values range form a minimum of 26 meters to a maximum of 6,488 meters. This extreme variance in the X space suggests that the relationship between distance and price may be multiplicative rather than additive. The importance of distance seems to make sense at closer values rather than values that are much further away resulting diminishing returns and not a strict linear relationship. Therefore, a log-log model may be necessary to stabilize the variance and more accurately capture the diminishing returns effect of the distance to the MRT station.
Quick Plot to Check a log-log model effect
Code
ggplot(data = dat, aes(x =log(house_price), y =log(dist_mrt))) +geom_point() +geom_smooth(method ="lm") +theme_minimal() +labs(title ="Scatter plot of log Y and log X3",x ="House Price Per Unit Area",y ="Distance in meters")
The double-log transformation of Log-Y and Log-X3 successfully linearized the relationship between distance and house price while preserving the expected inverse correlation. Visually, the observations now exhibit a tighter, more uniform grouping along the regression line, suggesting that the transformation has reduced the impact of extreme values and improved the model’s structural fit. This clear visual alignment provides strong empirical justification to proceed with a log-transformed X3 variable in the final model.
Call:
lm(formula = log(house_price) ~ trans_date + house_age + log(dist_mrt) +
num_conv_stores + latitude, data = dat)
Residuals:
Min 1Q Median 3Q Max
-1.61005 -0.10840 0.01182 0.10975 0.95976
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.974e+02 7.673e+01 -7.786 5.78e-14 ***
trans_date 1.674e-01 3.707e-02 4.516 8.25e-06 ***
house_age -6.068e-03 9.180e-04 -6.610 1.21e-10 ***
log(dist_mrt) -1.944e-01 1.333e-02 -14.582 < 2e-16 ***
num_conv_stores 1.027e-02 4.966e-03 2.068 0.0393 *
latitude 1.062e+01 9.591e-01 11.076 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2097 on 408 degrees of freedom
Multiple R-squared: 0.7181, Adjusted R-squared: 0.7146
F-statistic: 207.8 on 5 and 408 DF, p-value: < 2.2e-16
The implementation of the log-log transformation has produced the most robust thus far, with the adjusted R-squared increasing to 0.715. This indicates that the final model now explains approximately 71.5% of the variance in house prices. A substantial improvement over the initial linear mdoel’s 58% and the Log(Y) model’s 68%. Notably, the log-transformed distance to the MRT variable is highly significant with it’s p = <2e -16 confirming that the relationship between proximity to transit and property value is indeed elastic and non-linear. With the Residual Standard Error further reduced to 0.210, the model demonstrates improved predictive precision. I will now create the final residual plots to verify that this functional form has addressed the previously observed structural violations.
Residual Plot for Log-Log Model
Code
par(mfrow =c(2, 2), oma =c(0, 0, 2, 0))plot(mod_logY.X3, sub.caption ="") mtext("Diagnostic Plots for Mod Log Y and log X3", outer =TRUE, side =3, cex =1.5, line =0)
The diagnostic plots for the log-log specification confirm that this model is the most statistically sound iteration. The Residuals vs Fitted plot now shows a a random, horizontal cloud of points with the previous curvature successfully eliminated, validating the linearity of the double-log relationship. While the Q-Q Residuals still exhibit some minor deviation in the extreme tails specifically 114 and 271, the vast majority of residuals now align closely with the theoretical normal distribution. Additionally, the Scale-Location plot indicates a more stabilized variance, and the Residuals vs Leverage plot shows no remaining observations exceeding the critical Cook’s Distance thresholds. Both the structural non-linearity of the predictors and the skewness of the response, this model provides the most reliable and precise estimates for house prices.
The Final Selected Model
The final selected model is the log–log specification:
Holding all other variables constant, each one-year increase in the transaction date is associated with an approximate 16.7% increase in house prices.
House Age:
Holding all other variables constant, for every additional year of house age, the model predicts an approximate 0.6% decrease in house prices.
Log Distance to MRT:
This represents price elasticity, holding all other variables constant, a 1% increase in the distance to the nearest MRT station results in an approximate 0.19% decrease in house price.
Convenience Stores:
Each additional convenience store in the vicinity is associated with an expected 1.0% increase in house prices, holding all other variables constant.
Latitude:
Holding all other variables constant, for every 0.01 degree increase in latitude (moving approximately 1.1 km North), the house price is expected to increase by approximately 10.6%.
Code
op <-par(mfrow =c(1, 2), mar =c(4, 4, 2, 1))# ---- Panel A: Reduced model (original scale) ----y_obs_A <- dat$house_pricey_hat_A <-predict(reduced_mod)plot(y_hat_A, y_obs_A,xlab ="Predicted house price",ylab ="Observed house price",main ="Reduced Model: Observed vs Predicted",pch =19, cex =0.7,cex.main =0.9)abline(a =0, b =1, lwd =2, col="red") # 45-degree line# Add quick fit stats on plotr2_A <-summary(reduced_mod)$r.squaredrmse_A <-sqrt(mean((y_obs_A - y_hat_A)^2))mtext(sprintf("R² = %.3f | RMSE = %.2f", r2_A, rmse_A), side =3, line =-1, adj =0, cex =0.9)# ---- Panel B: Log model (log scale) ----y_obs_B <-log(dat$house_price)y_hat_B <-predict(mod_logY) # assumes mod_logY predicts log(Y)plot(y_hat_B, y_obs_B,xlab ="Predicted log(price)",ylab ="Observed log(price)",main ="Log Model: Observed vs Predicted",pch =19, cex =0.7,cex.main =0.9)abline(a =0, b =1, lwd =2, col="red")r2_B <-summary(mod_logY)$r.squaredrmse_B <-sqrt(mean((y_obs_B - y_hat_B)^2))mtext(sprintf("R² = %.3f | RMSE = %.3f", r2_B, rmse_B), side =3, line =-1, adj =0, cex =0.9)
Code
par(op)
Visually, the log model on the right shows a much tighter alignment with the 45-degree reference line, confirming that our transformation successfully stabilized the variance and secured a significant boost in predictive power.