The purpose of this project is to perform statistical analysis with regression using the data collected for home appraisal of the properties from 6309-6350 88th Street in Lubbock, TX. Using this regression, the objective is to determine whether or not the appraisal of the property at 6321 88th Street.
library(readr)
urlfile='https://raw.githubusercontent.com/dhostas/Stats-Term-Project-Data/main/Home%20Appraisal%20Project%20Data.csv'
data<-read.csv(url(urlfile))
rownames(data)<-data$House..
# Assigning House # as observation #
data<-data[,-1]
# Removal of House # as a predictor
data<-data[-6,]
data<-data[-13,]
# Removal of Houses 6314 & 6322
data<-data[,-4]
# Removal of Homestead Cap Loss predictor
To begin the treatment of data, the house number column in the original data frame was assigned as the observation number for easier identification of the properties as well as deleting that column as it is not needed in the regression. Next, the houses at 6314 and 6322 were considered as outliers due to the fact that they were the only properties that included a pool house and this could potentially skew the regression. Finally, the “Homestead Cap Loss” predictor was removed as it can be considered irrelevant as it is for the current tax year and will only increase over time, as well as a large portion of observations have a recording of 0 for this predictor which will largely effect the regression. Homestead cap loss can fruther be defined, specifically in Texas, as a tax break given to homestead owners on taxes due on their property. This is calcaculated by limiting the value to at most 10% of the previous year’s appraisal.
str(data)
## 'data.frame': 40 obs. of 9 variables:
## $ X2023.Market.Value : int 754207 703481 590665 617689 471762 563861 762392 595123 491900 479919 ...
## $ Total.Improvement.Market.Value: int 696207 642796 532665 553939 426609 518917 694921 550179 444305 434975 ...
## $ Total.Land.Market.Value : int 58000 60685 58000 63750 45153 44944 67471 44944 47595 44944 ...
## $ Total.Main.Area : int 4525 4304 4001 4186 2747 3524 4022 3852 3812 2902 ...
## $ Main.Area : int 3462 3226 3036 3277 2241 2582 3147 2910 2842 2382 ...
## $ Main.Area.Value : int 610216 558773 466010 490753 387261 445738 617642 487891 385151 396067 ...
## $ Garage.Area : int 1063 1078 965 909 506 942 875 942 970 520 ...
## $ Garage.Value : int 85991 84023 66655 63186 39348 73179 77279 62288 59154 38908 ...
## $ Land : int 10000 10463 10000 10625 7785 7749 11633 7749 8206 7749 ...
MV<-data$X2023.Market.Value
TIMV<-data$Total.Improvement.Market.Value
TLMV<-data$Total.Land.Market.Value
TMA<-data$Total.Main.Area
MA<-data$Main.Area
MAV<-data$Main.Area.Value
GA<-data$Garage.Area
GV<-data$Garage.Value
L<-data$Land
In the code above, the structure was observed and all predictors are being considered as integers. This is acceptable in terms of this regression. Then, each column or predictor was assigned to a variable for convenience in regression code.
The predictors considered in this project are the aspects of each property that contribute or potentially contribute to the home’s assessed value. These predictors will be considered in different forms and combinations in order to provide the best possible model(s) that shows the importance of certain predictors for future reference.
mod<-lm(MV~TMA+MA+GA+L, data=data)
summary(mod)
##
## Call:
## lm(formula = MV ~ TMA + MA + GA + L, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94261 -3099 3978 10217 85284
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34079.839 51189.955 0.666 0.50981
## TMA 71.541 29.303 2.441 0.01967 *
## MA 46.205 43.977 1.051 0.30042
## GA NA NA NA NA
## L 18.042 5.707 3.162 0.00318 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33210 on 36 degrees of freedom
## Multiple R-squared: 0.7869, Adjusted R-squared: 0.7691
## F-statistic: 44.3 on 3 and 36 DF, p-value: 3.597e-12
The initial raw first order regression without interactions shows significance in the predictors of TMA (Total Main Area) and L (Land). The variables of TIMV (Total Improvement Market Value), TLMV (Total Land Market Value), MAV (Main Area Value), and GV (Garage Value) are not considered as predictors because they are a direct contributor of the response variable.
modinter<-lm(MV~TMA+MA+GA+L+TMA:MA+TMA:GA+TMA:L+MA:GA+MA:L+GA:L, data = data)
summary(modinter)
##
## Call:
## lm(formula = MV ~ TMA + MA + GA + L + TMA:MA + TMA:GA + TMA:L +
## MA:GA + MA:L + GA:L, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -86921 -4385 3429 11591 91818
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.566e+05 9.227e+05 0.820 0.418
## TMA -3.921e+02 4.958e+02 -0.791 0.435
## MA 4.041e+02 7.533e+02 0.536 0.595
## GA NA NA NA NA
## L -7.651e+01 1.389e+02 -0.551 0.586
## TMA:MA -1.883e-02 5.539e-02 -0.340 0.736
## TMA:GA 2.083e-01 2.787e-01 0.747 0.460
## TMA:L 1.462e-02 5.266e-02 0.278 0.783
## MA:GA -1.830e-01 3.737e-01 -0.490 0.628
## MA:L 1.185e-02 7.629e-02 0.155 0.878
## GA:L NA NA NA NA
##
## Residual standard error: 34620 on 31 degrees of freedom
## Multiple R-squared: 0.8005, Adjusted R-squared: 0.749
## F-statistic: 15.55 on 8 and 31 DF, p-value: 6.729e-09
Observing the summary of the initial first order regression with interactions shows no significance in any single predictor. Although the r-squared value increases, due to multiple terms being added, the overall p-value of the model decreases. Initially, it is safe to say that interactions of the predictors will not be significant enough to be included in the model.
The function of “dredge” and step regression will be used to confirm that the stated “best” model is accurate.
library(MuMIn)
dredge1<-lm(MV~TMA+MA+GA+L, data=data, na.action="na.fail")
summary.fit<-dredge(dredge1)
## Fixed term is "(Intercept)"
head(summary.fit)
According to the “dredge” analysis, the top three models are model 11, 8, 12, and 15. These are the top models due to their AIC values.
dredge_11<-lm(MV~TMA+L, data=data)
summary(dredge_11)
##
## Call:
## lm(formula = MV ~ TMA + L, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100648 -4988 3618 11196 83766
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61828.656 43912.153 1.408 0.16748
## TMA 98.093 14.853 6.604 9.59e-08 ***
## L 19.107 5.624 3.397 0.00164 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33260 on 37 degrees of freedom
## Multiple R-squared: 0.7803, Adjusted R-squared: 0.7685
## F-statistic: 65.72 on 2 and 37 DF, p-value: 6.652e-13
dredge_8<-lm(MV~MA+GA+L, data=data)
summary(dredge_8)
##
## Call:
## lm(formula = MV ~ MA + GA + L, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94261 -3099 3978 10217 85284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34079.839 51189.955 0.666 0.50981
## MA 117.746 23.872 4.932 1.85e-05 ***
## GA 71.541 29.303 2.441 0.01967 *
## L 18.042 5.707 3.162 0.00318 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33210 on 36 degrees of freedom
## Multiple R-squared: 0.7869, Adjusted R-squared: 0.7691
## F-statistic: 44.3 on 3 and 36 DF, p-value: 3.597e-12
dredge_12<-lm(MV~TMA+GA+L, data=data)
summary(dredge_12)
##
## Call:
## lm(formula = MV ~ TMA + GA + L, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94261 -3099 3978 10217 85284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34079.839 51189.955 0.666 0.50981
## TMA 117.746 23.872 4.932 1.85e-05 ***
## GA -46.205 43.977 -1.051 0.30042
## L 18.042 5.707 3.162 0.00318 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33210 on 36 degrees of freedom
## Multiple R-squared: 0.7869, Adjusted R-squared: 0.7691
## F-statistic: 44.3 on 3 and 36 DF, p-value: 3.597e-12
dredge_15<-lm(MV~TMA+MA+L, data=data)
summary(dredge_15)
##
## Call:
## lm(formula = MV ~ TMA + MA + L, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94261 -3099 3978 10217 85284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34079.839 51189.955 0.666 0.50981
## TMA 71.541 29.303 2.441 0.01967 *
## MA 46.205 43.977 1.051 0.30042
## L 18.042 5.707 3.162 0.00318 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33210 on 36 degrees of freedom
## Multiple R-squared: 0.7869, Adjusted R-squared: 0.7691
## F-statistic: 44.3 on 3 and 36 DF, p-value: 3.597e-12
With the models from the “dredge” analysis narrowed down to the above models, said models will be compared with ANOVA.
anova(dredge_11,dredge_8)
anova(dredge_11,dredge_12)
anova(dredge_11,dredge_15)
Observing the above models, the “best” models according to the “dredge” analysis and confirming it via ANOVA are, MV ~ TMA + L and MV ~ MA + GA + L. Simply saying the market value of a home is solely dependent upon the land and total main area square footage. The additional model also being considered at this point is due to TMA consisting of both MA and GA. The difference between the models is that the coefficients weigh slightly different in terms of which predictors effect the regression more. Both are being considered due to their similarity.
The last built-in R function that can determine the best model is stepwise regression. There are three different forms; backwards, forwards, and both. For this project, since certain variables are only being considered to be predictors, only forward and backward step regression will be used.
backreg_sc<-lm(MV~TMA+MA+GA+L+TMA:MA+TMA:GA+TMA:L+MA:GA+MA:L+GA:L, data = data)
step(backreg_sc,direction="backward")
## Start: AIC=843.99
## MV ~ TMA + MA + GA + L + TMA:MA + TMA:GA + TMA:L + MA:GA + MA:L +
## GA:L
##
##
## Step: AIC=843.99
## MV ~ TMA + MA + GA + L + TMA:MA + TMA:GA + TMA:L + MA:GA + MA:L
##
## Df Sum of Sq RSS AIC
## - MA:L 1 28922364 3.7193e+10 842.02
## - TMA:L 1 92442914 3.7256e+10 842.09
## - TMA:MA 1 138491127 3.7302e+10 842.14
## - MA:GA 1 287394135 3.7451e+10 842.30
## - TMA:GA 1 669678623 3.7834e+10 842.70
## <none> 3.7164e+10 843.99
##
## Step: AIC=842.02
## MV ~ TMA + MA + GA + L + TMA:MA + TMA:GA + TMA:L + MA:GA
##
## Df Sum of Sq RSS AIC
## - TMA:MA 1 119314428 3.7312e+10 840.15
## - MA:GA 1 296992886 3.7490e+10 840.34
## - TMA:L 1 574084758 3.7767e+10 840.63
## - TMA:GA 1 647976839 3.7841e+10 840.71
## <none> 3.7193e+10 842.02
##
## Step: AIC=840.15
## MV ~ TMA + MA + GA + L + TMA:GA + TMA:L + MA:GA
##
## Df Sum of Sq RSS AIC
## - MA:GA 1 334860791 3.7647e+10 838.51
## - TMA:L 1 475784284 3.7788e+10 838.65
## - TMA:GA 1 584188259 3.7896e+10 838.77
## <none> 3.7312e+10 840.15
##
## Step: AIC=838.51
## MV ~ TMA + MA + GA + L + TMA:GA + TMA:L
##
##
## Step: AIC=838.51
## MV ~ TMA + GA + L + TMA:GA + TMA:L
##
## Df Sum of Sq RSS AIC
## - TMA:L 1 276068929 3.7923e+10 836.80
## - TMA:GA 1 307059392 3.7954e+10 836.83
## <none> 3.7647e+10 838.51
##
## Step: AIC=836.8
## MV ~ TMA + GA + L + TMA:GA
##
## Df Sum of Sq RSS AIC
## - TMA:GA 1 1785427866 3.9708e+10 836.64
## <none> 3.7923e+10 836.80
## - L 1 7279204174 4.5202e+10 841.82
##
## Step: AIC=836.64
## MV ~ TMA + GA + L
##
## Df Sum of Sq RSS AIC
## - GA 1 1.2176e+09 4.0926e+10 835.85
## <none> 3.9708e+10 836.64
## - L 1 1.1025e+10 5.0734e+10 844.44
## - TMA 1 2.6835e+10 6.6543e+10 855.29
##
## Step: AIC=835.85
## MV ~ TMA + L
##
## Df Sum of Sq RSS AIC
## <none> 4.0926e+10 835.85
## - L 1 1.2767e+10 5.3693e+10 844.71
## - TMA 1 4.8247e+10 8.9173e+10 865.00
##
## Call:
## lm(formula = MV ~ TMA + L, data = data)
##
## Coefficients:
## (Intercept) TMA L
## 61828.66 98.09 19.11
forwardreg_sc<-lm(MV~1, data = data)
step(forwardreg_sc,scope~TMA+MA+GA+L+TMA:MA+TMA:GA+TMA:L+MA:GA+MA:L+GA:L, direction="forward")
## Start: AIC=892.47
## MV ~ 1
##
## Df Sum of Sq RSS AIC
## + TMA 1 1.3261e+11 5.3693e+10 844.71
## + MA 1 1.2596e+11 6.0343e+10 849.38
## + L 1 9.7130e+10 8.9173e+10 865.00
## + GA 1 7.1402e+10 1.1490e+11 875.14
## <none> 1.8630e+11 892.47
##
## Step: AIC=844.71
## MV ~ TMA
##
## Df Sum of Sq RSS AIC
## + L 1 1.2767e+10 4.0926e+10 835.85
## + MA 1 2.9595e+09 5.0734e+10 844.44
## + GA 1 2.9595e+09 5.0734e+10 844.44
## <none> 5.3693e+10 844.71
##
## Step: AIC=835.85
## MV ~ TMA + L
##
## Df Sum of Sq RSS AIC
## <none> 4.0926e+10 835.85
## + TMA:L 1 1304973001 3.9621e+10 836.55
## + MA 1 1217598557 3.9708e+10 836.64
## + GA 1 1217598557 3.9708e+10 836.64
##
## Call:
## lm(formula = MV ~ TMA + L, data = data)
##
## Coefficients:
## (Intercept) TMA L
## 61828.66 98.09 19.11
Both forward and backward regressions confirm the “dredge” analysis above that according to these functions, the “best” model results in being MV ~ TMA + L.
bestmod<-lm(MV~TMA+L, data=data)
plot(bestmod)
bestmod2<-lm(MV~MA+GA+L, data=data)
plot(bestmod2)
After observing the adequacy plots of the “best” models, it is clear that there is an increasing pattern in the constant variance plots and skewness with light tails in the normal probability plots. Due to this observation in the adequacy plots, next the models will be analyzed through Box-Cox Analysis to see if there is a need for a transformation.
The Box-Cox analysis determines if there is a need for a power transformation in observing the value of lambda in the plot. If the value for lambda = 1 is in the 95% confidence interval, there is not a major need for a transformation. Otherwise, the data will need a transform for model adequacy.
library(MASS)
b<-boxcox(bestmod)
b2<-boxcox(bestmod2)
According to the Box-Cox plots above, the value of lambda = 1 is clearly not in the 95% confidence intervals for either model. Next, the value for lambda will be found at the maximum of the function in the Box-Cox plots to determine the power of the transformations.
lambda<-b$x
likelihood<-b$y
lambda[which.max(likelihood)]
## [1] -1.191919
The lambda value needed for the power transformation for “bestmod” resulted in being -1.191919.
lambda<-b2$x
likelihood<-b2$y
lambda[which.max(likelihood)]
## [1] -1.272727
The lambda value needed for the power transformation for “bestmod2” resulted in being -1.272727.
tfMV<-MV^(-1.191919)
tfMV2<-MV^(-1.272727)
Now that the response has been transformed, a new regression can be made to observe if there was any change in the model adequacy.
tfmod<-lm(tfMV~TMA+L, data=data)
summary(tfmod)
##
## Call:
## lm(formula = tfMV ~ TMA + L, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.681e-08 -4.572e-09 -2.360e-09 3.260e-09 3.111e-08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.766e-07 1.213e-08 22.804 < 2e-16 ***
## TMA -2.939e-11 4.103e-12 -7.162 1.73e-08 ***
## L -3.824e-12 1.554e-12 -2.461 0.0186 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.188e-09 on 37 degrees of freedom
## Multiple R-squared: 0.7724, Adjusted R-squared: 0.7601
## F-statistic: 62.79 on 2 and 37 DF, p-value: 1.28e-12
plot(tfmod)
tfmod2<-lm(tfMV2~MA+GA+L, data=data)
summary(tfmod2)
##
## Call:
## lm(formula = tfMV2 ~ MA + GA + L, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.147e-09 -1.558e-09 -5.725e-10 8.047e-10 1.090e-08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.016e-07 5.132e-09 19.795 < 2e-16 ***
## MA -1.334e-11 2.393e-12 -5.574 2.58e-06 ***
## GA -7.314e-12 2.938e-12 -2.489 0.0176 *
## L -1.241e-12 5.721e-13 -2.168 0.0368 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.33e-09 on 36 degrees of freedom
## Multiple R-squared: 0.7826, Adjusted R-squared: 0.7644
## F-statistic: 43.19 on 3 and 36 DF, p-value: 5.137e-12
plot(tfmod2)
After the transformation, the model is more adequate in terms of the normal probability plot and constant variance with the exception of few outliers (6316,6318,6324) that still need to be addressed.
Eliminating the above mentioned outliers could potentially solidy the adequacy of this transformed model.
dataelim<-data[-7,]
dataelim<-dataelim[-8,]
dataelim<-dataelim[-12,]
tf_mod_elim<-lm(tfMV~TMA+L, data=dataelim)
summary(tf_mod_elim)
##
## Call:
## lm(formula = tfMV ~ TMA + L, data = dataelim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.681e-08 -4.572e-09 -2.360e-09 3.260e-09 3.111e-08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.766e-07 1.213e-08 22.804 < 2e-16 ***
## TMA -2.939e-11 4.103e-12 -7.162 1.73e-08 ***
## L -3.824e-12 1.554e-12 -2.461 0.0186 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.188e-09 on 37 degrees of freedom
## Multiple R-squared: 0.7724, Adjusted R-squared: 0.7601
## F-statistic: 62.79 on 2 and 37 DF, p-value: 1.28e-12
plot(tf_mod_elim)
tf_mod_elim2<-lm(tfMV2~MA+GA+L, data=dataelim)
summary(tf_mod_elim2)
##
## Call:
## lm(formula = tfMV2 ~ MA + GA + L, data = dataelim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.147e-09 -1.558e-09 -5.725e-10 8.047e-10 1.090e-08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.016e-07 5.132e-09 19.795 < 2e-16 ***
## MA -1.334e-11 2.393e-12 -5.574 2.58e-06 ***
## GA -7.314e-12 2.938e-12 -2.489 0.0176 *
## L -1.241e-12 5.721e-13 -2.168 0.0368 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.33e-09 on 36 degrees of freedom
## Multiple R-squared: 0.7826, Adjusted R-squared: 0.7644
## F-statistic: 43.19 on 3 and 36 DF, p-value: 5.137e-12
plot(tf_mod_elim2)
After eliminating the previously mentioned outliers, the models’ adequacy in terms of the plots improved slightly but did not improve the significance of the models in terms of r-squareds and p-values so there is not a large concern in eliminating these observations.
fitted_tfmod <- predict(tfmod,data.frame(data))
conf_tfmod<-predict(tfmod,data.frame(data),interval="confidence")
pred_tfmod<-predict(tfmod,data.frame(data),interval="prediction")
fitted_trans_back<-exp(log(fitted_tfmod)/-1.191919)
conf_trans_back<-exp(log(conf_tfmod)/-1.191919)
pred_trans_back<-exp(log(pred_tfmod)/-1.191919)
fitted_tfmod2<- predict(tfmod2,data.frame(data))
conf_tfmod2<-predict(tfmod2,data.frame(data),interval="confidence")
pred_tfmod2<-predict(tfmod2,data.frame(data),interval="prediction")
fitted_trans_back2<-exp(log(fitted_tfmod2)/-1.272727)
conf_trans_back2<-exp(log(conf_tfmod2)/-1.272727)
pred_trans_back2<-exp(log(pred_tfmod2)/-1.272727)
data$X2023.Market.Value[12]
## [1] 552073
conf_trans_back[12,]
## fit lwr upr
## 534383.3 545591.7 523667.6
pred_trans_back[12,]
## fit lwr upr
## 534383.3 599110.7 483237.9
conf_trans_back2[12,]
## fit lwr upr
## 537530.1 549802.0 525863.5
pred_trans_back2[12,]
## fit lwr upr
## 537530.1 602762.7 486403.4
The plots of the confidence and predictions intervals on “tfmod” model are shown below.
conf_fit <- conf_trans_back[,1]
conf_lwr <- conf_trans_back[,2]
conf_upr <- conf_trans_back[,3]
pred_fit <- pred_trans_back[,1]
pred_lwr <- pred_trans_back[,2]
pred_upr <- pred_trans_back[,3]
house_nums <- seq(from = 6309, to = 6350, by = 1)
# Remove values 6314 and 6322
house_nums <- house_nums[house_nums != 6314 & house_nums != 6322]
plot(house_nums, y=fitted_trans_back,
xlab = "House Number",
ylab = "MV from 'tfmod'",
main = "Confidence/Prediction Intervals on Fitted MV",
pch = 16,
col = "blue",
ylim = c(450000, 750000),
xlim = c(6309, 6350))
lines(house_nums,fitted_trans_back,col="black")
lines(house_nums,conf_lwr,col="red")
lines(house_nums,conf_upr, col="red")
lines(house_nums,pred_lwr, col="green")
lines(house_nums,pred_upr, col="green")
legend("topright", legend=c("fitted values","confidence interval","prediction interval"), col=c("black","red","green"),
lty = 1,
lwd = 1)
The assessed 2023 Market Value, specifically for the property at 6321 88th Street, does fall in the prediction intervals but not in the confidence intervals of both models. This analytic point of reference explains that the assessed value is not a reasonable estimate of the home’s appraisal value based on this regression, because it doesn’t fall in the 95% confidence interval.
A confidence interval can be further defined as, there is theoretically a 95% chance a of predicted value, based off the model, falling in this interval. Since the true appraisal value does not fall in the confidence interval, proves how unreasonable of an assessed value it is.
After performing the above analysis, the two models that were ultimately decided on are as follows.
Both of these models are very similar in terms of adequacy, r-squared, and overall p-values, but produce slightly different responses. Both are under transformations to improve model adequacy, so when using either model the response will need to be transformed back. Next, both models will be exhibited to show how the models predict the Market Value of the properties based on regression and how it compares to the given Market Values.
The equations for each model is as follows.
eval_tfMV<-(2.766e-07 - 2.939e-11*(data$Total.Main.Area) - 3.824e-12*(data$Land))
eval_MV<-exp(log(eval_tfMV)/-1.191919)
eval_MV[12]
## [1] 534511.3
data$X2023.Market.Value[12]
## [1] 552073
Observing the output above, it is clear to see based off the model regressed with the Total Main Area (TMA) and Land (L) predictors that the assessed value of the home is greater than it should be by about $18,000.
eval_tfMV2<-(1.016e-07 - 1.334e-11*(data$Main.Area) - 7.314e-12*(data$Garage.Area) - 1.241e-12*(data$Land))
eval_MV2<-exp(log(eval_tfMV2)/-1.272727)
eval_MV2[12]
## [1] 537459.1
Observing the evaluation of the “Model 2” output, similarly to the result of the “Model 1” output, the model regressed with the Main Area (MA), Garage Area (GA), and Land (L) predictors shows that the assessed value of the home is still greater than it should be by about $15,000.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
eval_MV_dataset<-cbind(data$X2023.Market.Value,eval_MV,data$Total.Main.Area,data$Land)
rownames(eval_MV_dataset)<-rownames(data)
colnames(eval_MV_dataset)<-c("Market.Value.2023","eval_MV","Total.Main.Area","Land")
eval_MV_dataset<-as.data.frame(eval_MV_dataset)
less_than<-filter(eval_MV_dataset, eval_MV_dataset$eval_MV >= eval_MV_dataset$Market.Value.2023)
greater<-filter(eval_MV_dataset, eval_MV_dataset$eval_MV <= eval_MV_dataset$Market.Value.2023)
count(less_than)
count(greater)
eval_MV_dataset2<-cbind(data$X2023.Market.Value,eval_MV2,data$Total.Main.Area,data$Land)
rownames(eval_MV_dataset2)<-rownames(data)
colnames(eval_MV_dataset2)<-c("Market.Value.2023","eval_MV2","Total.Main.Area","Land")
eval_MV_dataset2<-as.data.frame(eval_MV_dataset2)
less_than2<-filter(eval_MV_dataset2, eval_MV_dataset2$eval_MV2 >= eval_MV_dataset2$Market.Value.2023)
greater2<-filter(eval_MV_dataset2, eval_MV_dataset2$eval_MV2 <= eval_MV_dataset2$Market.Value.2023)
count(less_than2)
count(greater2)
Furthermore, observing the entire neighborhood with the regression models concluded in the project, a total of 14 properties have appraisal values less than the evaluated regression values and 26 properties have appraisal values greater than the regressed values. This is accurate for both regression models being observed.
In conclusion, based on the regression and analysis performed in this project, it shows that both models prove that the assessed value of the property at 6321 88th Street is overvalued by an average of $16,500 between the two models. Also, when looking at all the properties in the neighborhood contributing to this analysis, 35% of them are undervalued and 65% of them are overvalued based on the two regression models.
library(readr)
urlfile='https://raw.githubusercontent.com/dhostas/Stats-Term-Project-Data/main/Home%20Appraisal%20Project%20Data.csv'
data<-read.csv(url(urlfile))
rownames(data)<-data$House..
# Assigning House # as observation #
data<-data[,-1]
# Removal of House # as a predictor
data<-data[-6,]
data<-data[-13,]
# Removal of Houses 6314 & 6322
data<-data[,-4]
# Removal of Homestead Cap Loss predictor
str(data)
MV<-data$X2023.Market.Value
TIMV<-data$Total.Improvement.Market.Value
TLMV<-data$Total.Land.Market.Value
TMA<-data$Total.Main.Area
MA<-data$Main.Area
MAV<-data$Main.Area.Value
GA<-data$Garage.Area
GV<-data$Garage.Value
L<-data$Land
mod<-lm(MV~TMA+MA+GA+L, data=data)
summary(mod)
modinter<-lm(MV~TMA+MA+GA+L+TMA:MA+TMA:GA+TMA:L+MA:GA+MA:L+GA:L, data = data)
summary(modinter)
library(MuMIn)
dredge1<-lm(MV~TMA+MA+GA+L, data=data, na.action="na.fail")
summary.fit<-dredge(dredge1)
head(summary.fit)
dredge_11<-lm(MV~TMA+L, data=data)
summary(dredge_11)
dredge_8<-lm(MV~MA+GA+L, data=data)
summary(dredge_8)
dredge_12<-lm(MV~TMA+GA+L, data=data)
summary(dredge_12)
dredge_15<-lm(MV~TMA+MA+L, data=data)
summary(dredge_15)
anova(dredge_11,dredge_8)
anova(dredge_11,dredge_12)
anova(dredge_11,dredge_15)
backreg_sc<-lm(MV~TMA+MA+GA+L+TMA:MA+TMA:GA+TMA:L+MA:GA+MA:L+GA:L, data = data)
step(backreg_sc,direction="backward")
forwardreg_sc<-lm(MV~1, data = data)
step(forwardreg_sc,scope~TMA+MA+GA+L+TMA:MA+TMA:GA+TMA:L+MA:GA+MA:L+GA:L, direction="forward")
bestmod<-lm(MV~TMA+L, data=data)
plot(bestmod)
bestmod2<-lm(MV~MA+GA+L, data=data)
plot(bestmod2)
library(MASS)
b<-boxcox(bestmod)
b2<-boxcox(bestmod2)
lambda<-b$x
likelihood<-b$y
lambda[which.max(likelihood)]
lambda<-b2$x
likelihood<-b2$y
lambda[which.max(likelihood)]
tfMV<-MV^(-1.191919)
tfMV2<-MV^(-1.272727)
tfmod<-lm(tfMV~TMA+L, data=data)
summary(tfmod)
plot(tfmod)
tfmod2<-lm(tfMV2~MA+GA+L, data=data)
summary(tfmod2)
plot(tfmod2)
dataelim<-data[-7,]
dataelim<-dataelim[-8,]
dataelim<-dataelim[-12,]
tf_mod_elim<-lm(tfMV~TMA+L, data=dataelim)
summary(tf_mod_elim)
plot(tf_mod_elim)
tf_mod_elim2<-lm(tfMV2~MA+GA+L, data=dataelim)
summary(tf_mod_elim2)
plot(tf_mod_elim2)
fitted_tfmod <- predict(tfmod,data.frame(data))
conf_tfmod<-predict(tfmod,data.frame(data),interval="confidence")
pred_tfmod<-predict(tfmod,data.frame(data),interval="prediction")
fitted_trans_back<-exp(log(fitted_tfmod)/-1.191919)
conf_trans_back<-exp(log(conf_tfmod)/-1.191919)
pred_trans_back<-exp(log(pred_tfmod)/-1.191919)
fitted_tfmod2<- predict(tfmod2,data.frame(data))
conf_tfmod2<-predict(tfmod2,data.frame(data),interval="confidence")
pred_tfmod2<-predict(tfmod2,data.frame(data),interval="prediction")
fitted_trans_back2<-exp(log(fitted_tfmod2)/-1.272727)
conf_trans_back2<-exp(log(conf_tfmod2)/-1.272727)
pred_trans_back2<-exp(log(pred_tfmod2)/-1.272727)
conf_trans_back[12,]
pred_trans_back[12,]
conf_trans_back2[12,]
pred_trans_back2[12,]
conf_fit <- conf_trans_back[,1]
conf_lwr <- conf_trans_back[,2]
conf_upr <- conf_trans_back[,3]
pred_fit <- pred_trans_back[,1]
pred_lwr <- pred_trans_back[,2]
pred_upr <- pred_trans_back[,3]
house_nums <- seq(from = 6309, to = 6350, by = 1)
# Remove values 6314 and 6322
house_nums <- house_nums[house_nums != 6314 & house_nums != 6322]
plot(house_nums, y=fitted_trans_back,
xlab = "House Number",
ylab = "MV from 'tfmod'",
main = "Confidence/Prediction Intervals on Fitted MV",
pch = 16,
col = "blue",
ylim = c(450000, 750000),
xlim = c(6309, 6350))
lines(house_nums,fitted_trans_back,col="black")
lines(house_nums,conf_lwr,col="red")
lines(house_nums,conf_upr, col="red")
lines(house_nums,pred_lwr, col="green")
lines(house_nums,pred_upr, col="green")
legend("topright", legend=c("fitted values","confidence interval","prediction interval"), col=c("black","red","green"),
lty = 1,
lwd = 1)
eval_tfMV<-(2.766e-07 - 2.939e-11*(data$Total.Main.Area) - 3.824e-12*(data$Land))
eval_MV<-exp(log(eval_tfMV)/-1.191919)
eval_MV[12]
data$X2023.Market.Value[12]
eval_tfMV2<-(1.016e-07 - 1.334e-11*(data$Main.Area) - 7.314e-12*(data$Garage.Area) - 1.241e-12*(data$Land))
eval_MV2<-exp(log(eval_tfMV2)/-1.272727)
eval_MV2[12]
eval_MV_dataset<-cbind(data$X2023.Market.Value,eval_MV,data$Total.Main.Area,data$Land)
rownames(eval_MV_dataset)<-rownames(data)
colnames(eval_MV_dataset)<-c("Market.Value.2023","eval_MV","Total.Main.Area","Land")
eval_MV_dataset<-as.data.frame(eval_MV_dataset)
less_than<-filter(eval_MV_dataset, eval_MV_dataset$eval_MV >= eval_MV_dataset$Market.Value.2023)
greater<-filter(eval_MV_dataset, eval_MV_dataset$eval_MV <= eval_MV_dataset$Market.Value.2023)
count(less_than)
count(greater)
eval_MV_dataset2<-cbind(data$X2023.Market.Value,eval_MV2,data$Total.Main.Area,data$Land)
rownames(eval_MV_dataset2)<-rownames(data)
colnames(eval_MV_dataset2)<-c("Market.Value.2023","eval_MV2","Total.Main.Area","Land")
eval_MV_dataset2<-as.data.frame(eval_MV_dataset2)
less_than2<-filter(eval_MV_dataset2, eval_MV_dataset2$eval_MV2 >= eval_MV_dataset2$Market.Value.2023)
greater2<-filter(eval_MV_dataset2, eval_MV_dataset2$eval_MV2 <= eval_MV_dataset2$Market.Value.2023)
count(less_than2)
count(greater2)