Objective

The purpose of this project is to perform statistical analysis with regression using the data collected for home appraisal of the properties from 6309-6350 88th Street in Lubbock, TX. Using this regression, the objective is to determine whether or not the appraisal of the property at 6321 88th Street.

Data Pre-treatment

Insertion of data and cleaning.

library(readr)
urlfile='https://raw.githubusercontent.com/dhostas/Stats-Term-Project-Data/main/Home%20Appraisal%20Project%20Data.csv'
data<-read.csv(url(urlfile))
rownames(data)<-data$House..
  # Assigning House # as observation #
data<-data[,-1]
  # Removal of House # as a predictor
data<-data[-6,]
data<-data[-13,]
  # Removal of Houses 6314 & 6322
data<-data[,-4]
  # Removal of Homestead Cap Loss predictor 

To begin the treatment of data, the house number column in the original data frame was assigned as the observation number for easier identification of the properties as well as deleting that column as it is not needed in the regression. Next, the houses at 6314 and 6322 were considered as outliers due to the fact that they were the only properties that included a pool house and this could potentially skew the regression. Finally, the “Homestead Cap Loss” predictor was removed as it can be considered irrelevant as it is for the current tax year and will only increase over time, as well as a large portion of observations have a recording of 0 for this predictor which will largely effect the regression. Homestead cap loss can fruther be defined, specifically in Texas, as a tax break given to homestead owners on taxes due on their property. This is calcaculated by limiting the value to at most 10% of the previous year’s appraisal.

The next step in pre-treatment is to observe the structure of the data.

str(data)
## 'data.frame':    40 obs. of  9 variables:
##  $ X2023.Market.Value            : int  754207 703481 590665 617689 471762 563861 762392 595123 491900 479919 ...
##  $ Total.Improvement.Market.Value: int  696207 642796 532665 553939 426609 518917 694921 550179 444305 434975 ...
##  $ Total.Land.Market.Value       : int  58000 60685 58000 63750 45153 44944 67471 44944 47595 44944 ...
##  $ Total.Main.Area               : int  4525 4304 4001 4186 2747 3524 4022 3852 3812 2902 ...
##  $ Main.Area                     : int  3462 3226 3036 3277 2241 2582 3147 2910 2842 2382 ...
##  $ Main.Area.Value               : int  610216 558773 466010 490753 387261 445738 617642 487891 385151 396067 ...
##  $ Garage.Area                   : int  1063 1078 965 909 506 942 875 942 970 520 ...
##  $ Garage.Value                  : int  85991 84023 66655 63186 39348 73179 77279 62288 59154 38908 ...
##  $ Land                          : int  10000 10463 10000 10625 7785 7749 11633 7749 8206 7749 ...
MV<-data$X2023.Market.Value
TIMV<-data$Total.Improvement.Market.Value
TLMV<-data$Total.Land.Market.Value
TMA<-data$Total.Main.Area
MA<-data$Main.Area
MAV<-data$Main.Area.Value
GA<-data$Garage.Area
GV<-data$Garage.Value
L<-data$Land

In the code above, the structure was observed and all predictors are being considered as integers. This is acceptable in terms of this regression. Then, each column or predictor was assigned to a variable for convenience in regression code.

The predictors considered in this project are the aspects of each property that contribute or potentially contribute to the home’s assessed value. These predictors will be considered in different forms and combinations in order to provide the best possible model(s) that shows the importance of certain predictors for future reference.

Initial Regression

Raw Regressions

mod<-lm(MV~TMA+MA+GA+L, data=data)
summary(mod)
## 
## Call:
## lm(formula = MV ~ TMA + MA + GA + L, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -94261  -3099   3978  10217  85284 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 34079.839  51189.955   0.666  0.50981   
## TMA            71.541     29.303   2.441  0.01967 * 
## MA             46.205     43.977   1.051  0.30042   
## GA                 NA         NA      NA       NA   
## L              18.042      5.707   3.162  0.00318 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33210 on 36 degrees of freedom
## Multiple R-squared:  0.7869, Adjusted R-squared:  0.7691 
## F-statistic:  44.3 on 3 and 36 DF,  p-value: 3.597e-12

The initial raw first order regression without interactions shows significance in the predictors of TMA (Total Main Area) and L (Land). The variables of TIMV (Total Improvement Market Value), TLMV (Total Land Market Value), MAV (Main Area Value), and GV (Garage Value) are not considered as predictors because they are a direct contributor of the response variable.

modinter<-lm(MV~TMA+MA+GA+L+TMA:MA+TMA:GA+TMA:L+MA:GA+MA:L+GA:L, data = data)
summary(modinter)
## 
## Call:
## lm(formula = MV ~ TMA + MA + GA + L + TMA:MA + TMA:GA + TMA:L + 
##     MA:GA + MA:L + GA:L, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -86921  -4385   3429  11591  91818 
## 
## Coefficients: (2 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  7.566e+05  9.227e+05   0.820    0.418
## TMA         -3.921e+02  4.958e+02  -0.791    0.435
## MA           4.041e+02  7.533e+02   0.536    0.595
## GA                  NA         NA      NA       NA
## L           -7.651e+01  1.389e+02  -0.551    0.586
## TMA:MA      -1.883e-02  5.539e-02  -0.340    0.736
## TMA:GA       2.083e-01  2.787e-01   0.747    0.460
## TMA:L        1.462e-02  5.266e-02   0.278    0.783
## MA:GA       -1.830e-01  3.737e-01  -0.490    0.628
## MA:L         1.185e-02  7.629e-02   0.155    0.878
## GA:L                NA         NA      NA       NA
## 
## Residual standard error: 34620 on 31 degrees of freedom
## Multiple R-squared:  0.8005, Adjusted R-squared:  0.749 
## F-statistic: 15.55 on 8 and 31 DF,  p-value: 6.729e-09

Observing the summary of the initial first order regression with interactions shows no significance in any single predictor. Although the r-squared value increases, due to multiple terms being added, the overall p-value of the model decreases. Initially, it is safe to say that interactions of the predictors will not be significant enough to be included in the model.

After the initial raw regression, the “best” model is being considered as MV ~ TMA + L.

Dredge, Step Regression, and ANOVA

The function of “dredge” and step regression will be used to confirm that the stated “best” model is accurate.

Dredge

library(MuMIn)
dredge1<-lm(MV~TMA+MA+GA+L, data=data, na.action="na.fail")
summary.fit<-dredge(dredge1)
## Fixed term is "(Intercept)"
head(summary.fit)

According to the “dredge” analysis, the top three models are model 11, 8, 12, and 15. These are the top models due to their AIC values.

Observing the models from the dredge analysis

dredge_11<-lm(MV~TMA+L, data=data)
summary(dredge_11)
## 
## Call:
## lm(formula = MV ~ TMA + L, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -100648   -4988    3618   11196   83766 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 61828.656  43912.153   1.408  0.16748    
## TMA            98.093     14.853   6.604 9.59e-08 ***
## L              19.107      5.624   3.397  0.00164 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33260 on 37 degrees of freedom
## Multiple R-squared:  0.7803, Adjusted R-squared:  0.7685 
## F-statistic: 65.72 on 2 and 37 DF,  p-value: 6.652e-13
dredge_8<-lm(MV~MA+GA+L, data=data)
summary(dredge_8)
## 
## Call:
## lm(formula = MV ~ MA + GA + L, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -94261  -3099   3978  10217  85284 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34079.839  51189.955   0.666  0.50981    
## MA            117.746     23.872   4.932 1.85e-05 ***
## GA             71.541     29.303   2.441  0.01967 *  
## L              18.042      5.707   3.162  0.00318 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33210 on 36 degrees of freedom
## Multiple R-squared:  0.7869, Adjusted R-squared:  0.7691 
## F-statistic:  44.3 on 3 and 36 DF,  p-value: 3.597e-12
dredge_12<-lm(MV~TMA+GA+L, data=data)
summary(dredge_12)
## 
## Call:
## lm(formula = MV ~ TMA + GA + L, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -94261  -3099   3978  10217  85284 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34079.839  51189.955   0.666  0.50981    
## TMA           117.746     23.872   4.932 1.85e-05 ***
## GA            -46.205     43.977  -1.051  0.30042    
## L              18.042      5.707   3.162  0.00318 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33210 on 36 degrees of freedom
## Multiple R-squared:  0.7869, Adjusted R-squared:  0.7691 
## F-statistic:  44.3 on 3 and 36 DF,  p-value: 3.597e-12
dredge_15<-lm(MV~TMA+MA+L, data=data)
summary(dredge_15)
## 
## Call:
## lm(formula = MV ~ TMA + MA + L, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -94261  -3099   3978  10217  85284 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 34079.839  51189.955   0.666  0.50981   
## TMA            71.541     29.303   2.441  0.01967 * 
## MA             46.205     43.977   1.051  0.30042   
## L              18.042      5.707   3.162  0.00318 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33210 on 36 degrees of freedom
## Multiple R-squared:  0.7869, Adjusted R-squared:  0.7691 
## F-statistic:  44.3 on 3 and 36 DF,  p-value: 3.597e-12

With the models from the “dredge” analysis narrowed down to the above models, said models will be compared with ANOVA.

Comparing dredge models with ANOVA

anova(dredge_11,dredge_8)
anova(dredge_11,dredge_12)
anova(dredge_11,dredge_15)

Observing the above models, the “best” models according to the “dredge” analysis and confirming it via ANOVA are, MV ~ TMA + L and MV ~ MA + GA + L. Simply saying the market value of a home is solely dependent upon the land and total main area square footage. The additional model also being considered at this point is due to TMA consisting of both MA and GA. The difference between the models is that the coefficients weigh slightly different in terms of which predictors effect the regression more. Both are being considered due to their similarity.

Step Regression

The last built-in R function that can determine the best model is stepwise regression. There are three different forms; backwards, forwards, and both. For this project, since certain variables are only being considered to be predictors, only forward and backward step regression will be used.

backreg_sc<-lm(MV~TMA+MA+GA+L+TMA:MA+TMA:GA+TMA:L+MA:GA+MA:L+GA:L, data = data)
step(backreg_sc,direction="backward")
## Start:  AIC=843.99
## MV ~ TMA + MA + GA + L + TMA:MA + TMA:GA + TMA:L + MA:GA + MA:L + 
##     GA:L
## 
## 
## Step:  AIC=843.99
## MV ~ TMA + MA + GA + L + TMA:MA + TMA:GA + TMA:L + MA:GA + MA:L
## 
##          Df Sum of Sq        RSS    AIC
## - MA:L    1  28922364 3.7193e+10 842.02
## - TMA:L   1  92442914 3.7256e+10 842.09
## - TMA:MA  1 138491127 3.7302e+10 842.14
## - MA:GA   1 287394135 3.7451e+10 842.30
## - TMA:GA  1 669678623 3.7834e+10 842.70
## <none>                3.7164e+10 843.99
## 
## Step:  AIC=842.02
## MV ~ TMA + MA + GA + L + TMA:MA + TMA:GA + TMA:L + MA:GA
## 
##          Df Sum of Sq        RSS    AIC
## - TMA:MA  1 119314428 3.7312e+10 840.15
## - MA:GA   1 296992886 3.7490e+10 840.34
## - TMA:L   1 574084758 3.7767e+10 840.63
## - TMA:GA  1 647976839 3.7841e+10 840.71
## <none>                3.7193e+10 842.02
## 
## Step:  AIC=840.15
## MV ~ TMA + MA + GA + L + TMA:GA + TMA:L + MA:GA
## 
##          Df Sum of Sq        RSS    AIC
## - MA:GA   1 334860791 3.7647e+10 838.51
## - TMA:L   1 475784284 3.7788e+10 838.65
## - TMA:GA  1 584188259 3.7896e+10 838.77
## <none>                3.7312e+10 840.15
## 
## Step:  AIC=838.51
## MV ~ TMA + MA + GA + L + TMA:GA + TMA:L
## 
## 
## Step:  AIC=838.51
## MV ~ TMA + GA + L + TMA:GA + TMA:L
## 
##          Df Sum of Sq        RSS    AIC
## - TMA:L   1 276068929 3.7923e+10 836.80
## - TMA:GA  1 307059392 3.7954e+10 836.83
## <none>                3.7647e+10 838.51
## 
## Step:  AIC=836.8
## MV ~ TMA + GA + L + TMA:GA
## 
##          Df  Sum of Sq        RSS    AIC
## - TMA:GA  1 1785427866 3.9708e+10 836.64
## <none>                 3.7923e+10 836.80
## - L       1 7279204174 4.5202e+10 841.82
## 
## Step:  AIC=836.64
## MV ~ TMA + GA + L
## 
##        Df  Sum of Sq        RSS    AIC
## - GA    1 1.2176e+09 4.0926e+10 835.85
## <none>               3.9708e+10 836.64
## - L     1 1.1025e+10 5.0734e+10 844.44
## - TMA   1 2.6835e+10 6.6543e+10 855.29
## 
## Step:  AIC=835.85
## MV ~ TMA + L
## 
##        Df  Sum of Sq        RSS    AIC
## <none>               4.0926e+10 835.85
## - L     1 1.2767e+10 5.3693e+10 844.71
## - TMA   1 4.8247e+10 8.9173e+10 865.00
## 
## Call:
## lm(formula = MV ~ TMA + L, data = data)
## 
## Coefficients:
## (Intercept)          TMA            L  
##    61828.66        98.09        19.11
forwardreg_sc<-lm(MV~1, data = data)
step(forwardreg_sc,scope~TMA+MA+GA+L+TMA:MA+TMA:GA+TMA:L+MA:GA+MA:L+GA:L, direction="forward")
## Start:  AIC=892.47
## MV ~ 1
## 
##        Df  Sum of Sq        RSS    AIC
## + TMA   1 1.3261e+11 5.3693e+10 844.71
## + MA    1 1.2596e+11 6.0343e+10 849.38
## + L     1 9.7130e+10 8.9173e+10 865.00
## + GA    1 7.1402e+10 1.1490e+11 875.14
## <none>               1.8630e+11 892.47
## 
## Step:  AIC=844.71
## MV ~ TMA
## 
##        Df  Sum of Sq        RSS    AIC
## + L     1 1.2767e+10 4.0926e+10 835.85
## + MA    1 2.9595e+09 5.0734e+10 844.44
## + GA    1 2.9595e+09 5.0734e+10 844.44
## <none>               5.3693e+10 844.71
## 
## Step:  AIC=835.85
## MV ~ TMA + L
## 
##         Df  Sum of Sq        RSS    AIC
## <none>                4.0926e+10 835.85
## + TMA:L  1 1304973001 3.9621e+10 836.55
## + MA     1 1217598557 3.9708e+10 836.64
## + GA     1 1217598557 3.9708e+10 836.64
## 
## Call:
## lm(formula = MV ~ TMA + L, data = data)
## 
## Coefficients:
## (Intercept)          TMA            L  
##    61828.66        98.09        19.11

Both forward and backward regressions confirm the “dredge” analysis above that according to these functions, the “best” model results in being MV ~ TMA + L.

Currently, the “best” models being considered at this stage in the analysis are MV ~ TMA + L and MV ~ MA + GA + L

Plot of “Best” Models

bestmod<-lm(MV~TMA+L, data=data)
plot(bestmod)

bestmod2<-lm(MV~MA+GA+L, data=data)
plot(bestmod2)

After observing the adequacy plots of the “best” models, it is clear that there is an increasing pattern in the constant variance plots and skewness with light tails in the normal probability plots. Due to this observation in the adequacy plots, next the models will be analyzed through Box-Cox Analysis to see if there is a need for a transformation.

Box-Cox Analysis

The Box-Cox analysis determines if there is a need for a power transformation in observing the value of lambda in the plot. If the value for lambda = 1 is in the 95% confidence interval, there is not a major need for a transformation. Otherwise, the data will need a transform for model adequacy.

library(MASS)
b<-boxcox(bestmod)

b2<-boxcox(bestmod2)

According to the Box-Cox plots above, the value of lambda = 1 is clearly not in the 95% confidence intervals for either model. Next, the value for lambda will be found at the maximum of the function in the Box-Cox plots to determine the power of the transformations.

Transformations Power

lambda<-b$x
likelihood<-b$y
lambda[which.max(likelihood)]
## [1] -1.191919

The lambda value needed for the power transformation for “bestmod” resulted in being -1.191919.

lambda<-b2$x
likelihood<-b2$y
lambda[which.max(likelihood)]
## [1] -1.272727

The lambda value needed for the power transformation for “bestmod2” resulted in being -1.272727.

Repsonse Transformations

tfMV<-MV^(-1.191919)
tfMV2<-MV^(-1.272727)

Now that the response has been transformed, a new regression can be made to observe if there was any change in the model adequacy.

Transformation Regressions

tfmod<-lm(tfMV~TMA+L, data=data)
summary(tfmod)
## 
## Call:
## lm(formula = tfMV ~ TMA + L, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.681e-08 -4.572e-09 -2.360e-09  3.260e-09  3.111e-08 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.766e-07  1.213e-08  22.804  < 2e-16 ***
## TMA         -2.939e-11  4.103e-12  -7.162 1.73e-08 ***
## L           -3.824e-12  1.554e-12  -2.461   0.0186 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.188e-09 on 37 degrees of freedom
## Multiple R-squared:  0.7724, Adjusted R-squared:  0.7601 
## F-statistic: 62.79 on 2 and 37 DF,  p-value: 1.28e-12
plot(tfmod)

tfmod2<-lm(tfMV2~MA+GA+L, data=data)
summary(tfmod2)
## 
## Call:
## lm(formula = tfMV2 ~ MA + GA + L, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -6.147e-09 -1.558e-09 -5.725e-10  8.047e-10  1.090e-08 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.016e-07  5.132e-09  19.795  < 2e-16 ***
## MA          -1.334e-11  2.393e-12  -5.574 2.58e-06 ***
## GA          -7.314e-12  2.938e-12  -2.489   0.0176 *  
## L           -1.241e-12  5.721e-13  -2.168   0.0368 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.33e-09 on 36 degrees of freedom
## Multiple R-squared:  0.7826, Adjusted R-squared:  0.7644 
## F-statistic: 43.19 on 3 and 36 DF,  p-value: 5.137e-12
plot(tfmod2)

After the transformation, the model is more adequate in terms of the normal probability plot and constant variance with the exception of few outliers (6316,6318,6324) that still need to be addressed.

Eliminating Outliers

Eliminating the above mentioned outliers could potentially solidy the adequacy of this transformed model.

dataelim<-data[-7,]
dataelim<-dataelim[-8,]
dataelim<-dataelim[-12,]

tf_mod_elim<-lm(tfMV~TMA+L, data=dataelim)
summary(tf_mod_elim)
## 
## Call:
## lm(formula = tfMV ~ TMA + L, data = dataelim)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.681e-08 -4.572e-09 -2.360e-09  3.260e-09  3.111e-08 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.766e-07  1.213e-08  22.804  < 2e-16 ***
## TMA         -2.939e-11  4.103e-12  -7.162 1.73e-08 ***
## L           -3.824e-12  1.554e-12  -2.461   0.0186 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.188e-09 on 37 degrees of freedom
## Multiple R-squared:  0.7724, Adjusted R-squared:  0.7601 
## F-statistic: 62.79 on 2 and 37 DF,  p-value: 1.28e-12
plot(tf_mod_elim)

tf_mod_elim2<-lm(tfMV2~MA+GA+L, data=dataelim)
summary(tf_mod_elim2)
## 
## Call:
## lm(formula = tfMV2 ~ MA + GA + L, data = dataelim)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -6.147e-09 -1.558e-09 -5.725e-10  8.047e-10  1.090e-08 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.016e-07  5.132e-09  19.795  < 2e-16 ***
## MA          -1.334e-11  2.393e-12  -5.574 2.58e-06 ***
## GA          -7.314e-12  2.938e-12  -2.489   0.0176 *  
## L           -1.241e-12  5.721e-13  -2.168   0.0368 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.33e-09 on 36 degrees of freedom
## Multiple R-squared:  0.7826, Adjusted R-squared:  0.7644 
## F-statistic: 43.19 on 3 and 36 DF,  p-value: 5.137e-12
plot(tf_mod_elim2)

After eliminating the previously mentioned outliers, the models’ adequacy in terms of the plots improved slightly but did not improve the significance of the models in terms of r-squareds and p-values so there is not a large concern in eliminating these observations.

Confidence and Prediction Intervals on tfmod Model

fitted_tfmod <- predict(tfmod,data.frame(data))
conf_tfmod<-predict(tfmod,data.frame(data),interval="confidence")
pred_tfmod<-predict(tfmod,data.frame(data),interval="prediction")

fitted_trans_back<-exp(log(fitted_tfmod)/-1.191919)
conf_trans_back<-exp(log(conf_tfmod)/-1.191919)
pred_trans_back<-exp(log(pred_tfmod)/-1.191919)

fitted_tfmod2<- predict(tfmod2,data.frame(data))
conf_tfmod2<-predict(tfmod2,data.frame(data),interval="confidence")
pred_tfmod2<-predict(tfmod2,data.frame(data),interval="prediction")

fitted_trans_back2<-exp(log(fitted_tfmod2)/-1.272727)
conf_trans_back2<-exp(log(conf_tfmod2)/-1.272727)
pred_trans_back2<-exp(log(pred_tfmod2)/-1.272727)

data$X2023.Market.Value[12]
## [1] 552073
conf_trans_back[12,]
##      fit      lwr      upr 
## 534383.3 545591.7 523667.6
pred_trans_back[12,]
##      fit      lwr      upr 
## 534383.3 599110.7 483237.9
conf_trans_back2[12,]
##      fit      lwr      upr 
## 537530.1 549802.0 525863.5
pred_trans_back2[12,]
##      fit      lwr      upr 
## 537530.1 602762.7 486403.4

The plots of the confidence and predictions intervals on “tfmod” model are shown below.

conf_fit <- conf_trans_back[,1]
conf_lwr <- conf_trans_back[,2]
conf_upr <- conf_trans_back[,3]
pred_fit <- pred_trans_back[,1]
pred_lwr <- pred_trans_back[,2]
pred_upr <- pred_trans_back[,3]


house_nums <- seq(from = 6309, to = 6350, by = 1)
# Remove values 6314 and 6322
house_nums <- house_nums[house_nums != 6314 & house_nums != 6322]

plot(house_nums, y=fitted_trans_back, 
     xlab = "House Number", 
     ylab = "MV from 'tfmod'", 
     main = "Confidence/Prediction Intervals on Fitted MV", 
     pch = 16, 
     col = "blue",
     ylim = c(450000, 750000),
     xlim = c(6309, 6350))

lines(house_nums,fitted_trans_back,col="black")
lines(house_nums,conf_lwr,col="red")
lines(house_nums,conf_upr, col="red")
lines(house_nums,pred_lwr, col="green")
lines(house_nums,pred_upr, col="green")


legend("topright", legend=c("fitted values","confidence interval","prediction interval"), col=c("black","red","green"),
       lty = 1, 
       lwd = 1)

The assessed 2023 Market Value, specifically for the property at 6321 88th Street, does fall in the prediction intervals but not in the confidence intervals of both models. This analytic point of reference explains that the assessed value is not a reasonable estimate of the home’s appraisal value based on this regression, because it doesn’t fall in the 95% confidence interval.

A confidence interval can be further defined as, there is theoretically a 95% chance a of predicted value, based off the model, falling in this interval. Since the true appraisal value does not fall in the confidence interval, proves how unreasonable of an assessed value it is.

Discussion and Final Determination

After performing the above analysis, the two models that were ultimately decided on are as follows.

Model 1: tfMV = TMA + L

Model 2: tfMV2 = MA + GA + L

Both of these models are very similar in terms of adequacy, r-squared, and overall p-values, but produce slightly different responses. Both are under transformations to improve model adequacy, so when using either model the response will need to be transformed back. Next, both models will be exhibited to show how the models predict the Market Value of the properties based on regression and how it compares to the given Market Values.

The equations for each model is as follows.

tfMV = 2.766e-07 - 2.939e-11(TMA) - 3.824e-12(L)

tfMV2 = 1.016e-07 - 1.334e-11(MA) - 7.314e-12(GA) - 1.241e-12(L)

6321 Evaluation

eval_tfMV<-(2.766e-07 - 2.939e-11*(data$Total.Main.Area) - 3.824e-12*(data$Land))
eval_MV<-exp(log(eval_tfMV)/-1.191919)
eval_MV[12]
## [1] 534511.3
data$X2023.Market.Value[12]
## [1] 552073

Observing the output above, it is clear to see based off the model regressed with the Total Main Area (TMA) and Land (L) predictors that the assessed value of the home is greater than it should be by about $18,000.

eval_tfMV2<-(1.016e-07 - 1.334e-11*(data$Main.Area) - 7.314e-12*(data$Garage.Area) - 1.241e-12*(data$Land))
eval_MV2<-exp(log(eval_tfMV2)/-1.272727)
eval_MV2[12]
## [1] 537459.1

Observing the evaluation of the “Model 2” output, similarly to the result of the “Model 1” output, the model regressed with the Main Area (MA), Garage Area (GA), and Land (L) predictors shows that the assessed value of the home is still greater than it should be by about $15,000.

Neighborhood Evaluation

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
eval_MV_dataset<-cbind(data$X2023.Market.Value,eval_MV,data$Total.Main.Area,data$Land)
rownames(eval_MV_dataset)<-rownames(data)
colnames(eval_MV_dataset)<-c("Market.Value.2023","eval_MV","Total.Main.Area","Land")
eval_MV_dataset<-as.data.frame(eval_MV_dataset)
less_than<-filter(eval_MV_dataset, eval_MV_dataset$eval_MV >= eval_MV_dataset$Market.Value.2023)
greater<-filter(eval_MV_dataset, eval_MV_dataset$eval_MV <= eval_MV_dataset$Market.Value.2023)
count(less_than)
count(greater)
eval_MV_dataset2<-cbind(data$X2023.Market.Value,eval_MV2,data$Total.Main.Area,data$Land)
rownames(eval_MV_dataset2)<-rownames(data)
colnames(eval_MV_dataset2)<-c("Market.Value.2023","eval_MV2","Total.Main.Area","Land")
eval_MV_dataset2<-as.data.frame(eval_MV_dataset2)
less_than2<-filter(eval_MV_dataset2, eval_MV_dataset2$eval_MV2 >= eval_MV_dataset2$Market.Value.2023)
greater2<-filter(eval_MV_dataset2, eval_MV_dataset2$eval_MV2 <= eval_MV_dataset2$Market.Value.2023)
count(less_than2)
count(greater2)

Furthermore, observing the entire neighborhood with the regression models concluded in the project, a total of 14 properties have appraisal values less than the evaluated regression values and 26 properties have appraisal values greater than the regressed values. This is accurate for both regression models being observed.

Conclusion

In conclusion, based on the regression and analysis performed in this project, it shows that both models prove that the assessed value of the property at 6321 88th Street is overvalued by an average of $16,500 between the two models. Also, when looking at all the properties in the neighborhood contributing to this analysis, 35% of them are undervalued and 65% of them are overvalued based on the two regression models.

Unevaluated Code

library(readr)
urlfile='https://raw.githubusercontent.com/dhostas/Stats-Term-Project-Data/main/Home%20Appraisal%20Project%20Data.csv'
data<-read.csv(url(urlfile))
rownames(data)<-data$House..
  # Assigning House # as observation #
data<-data[,-1]
  # Removal of House # as a predictor
data<-data[-6,]
data<-data[-13,]
  # Removal of Houses 6314 & 6322
data<-data[,-4]
  # Removal of Homestead Cap Loss predictor 

str(data)

MV<-data$X2023.Market.Value
TIMV<-data$Total.Improvement.Market.Value
TLMV<-data$Total.Land.Market.Value
TMA<-data$Total.Main.Area
MA<-data$Main.Area
MAV<-data$Main.Area.Value
GA<-data$Garage.Area
GV<-data$Garage.Value
L<-data$Land

mod<-lm(MV~TMA+MA+GA+L, data=data)
summary(mod)

modinter<-lm(MV~TMA+MA+GA+L+TMA:MA+TMA:GA+TMA:L+MA:GA+MA:L+GA:L, data = data)
summary(modinter)

library(MuMIn)
dredge1<-lm(MV~TMA+MA+GA+L, data=data, na.action="na.fail")
summary.fit<-dredge(dredge1)
head(summary.fit)

dredge_11<-lm(MV~TMA+L, data=data)
summary(dredge_11)
dredge_8<-lm(MV~MA+GA+L, data=data)
summary(dredge_8)
dredge_12<-lm(MV~TMA+GA+L, data=data)
summary(dredge_12)
dredge_15<-lm(MV~TMA+MA+L, data=data)
summary(dredge_15)

anova(dredge_11,dredge_8)
anova(dredge_11,dredge_12)
anova(dredge_11,dredge_15)

backreg_sc<-lm(MV~TMA+MA+GA+L+TMA:MA+TMA:GA+TMA:L+MA:GA+MA:L+GA:L, data = data)
step(backreg_sc,direction="backward")

forwardreg_sc<-lm(MV~1, data = data)
step(forwardreg_sc,scope~TMA+MA+GA+L+TMA:MA+TMA:GA+TMA:L+MA:GA+MA:L+GA:L, direction="forward")

bestmod<-lm(MV~TMA+L, data=data)
plot(bestmod)

bestmod2<-lm(MV~MA+GA+L, data=data)
plot(bestmod2)

library(MASS)
b<-boxcox(bestmod)
b2<-boxcox(bestmod2)

lambda<-b$x
likelihood<-b$y
lambda[which.max(likelihood)]

lambda<-b2$x
likelihood<-b2$y
lambda[which.max(likelihood)]

tfMV<-MV^(-1.191919)
tfMV2<-MV^(-1.272727)

tfmod<-lm(tfMV~TMA+L, data=data)
summary(tfmod)
plot(tfmod)

tfmod2<-lm(tfMV2~MA+GA+L, data=data)
summary(tfmod2)
plot(tfmod2)

dataelim<-data[-7,]
dataelim<-dataelim[-8,]
dataelim<-dataelim[-12,]

tf_mod_elim<-lm(tfMV~TMA+L, data=dataelim)
summary(tf_mod_elim)
plot(tf_mod_elim)

tf_mod_elim2<-lm(tfMV2~MA+GA+L, data=dataelim)
summary(tf_mod_elim2)
plot(tf_mod_elim2)

fitted_tfmod <- predict(tfmod,data.frame(data))
conf_tfmod<-predict(tfmod,data.frame(data),interval="confidence")
pred_tfmod<-predict(tfmod,data.frame(data),interval="prediction")

fitted_trans_back<-exp(log(fitted_tfmod)/-1.191919)
conf_trans_back<-exp(log(conf_tfmod)/-1.191919)
pred_trans_back<-exp(log(pred_tfmod)/-1.191919)

fitted_tfmod2<- predict(tfmod2,data.frame(data))
conf_tfmod2<-predict(tfmod2,data.frame(data),interval="confidence")
pred_tfmod2<-predict(tfmod2,data.frame(data),interval="prediction")

fitted_trans_back2<-exp(log(fitted_tfmod2)/-1.272727)
conf_trans_back2<-exp(log(conf_tfmod2)/-1.272727)
pred_trans_back2<-exp(log(pred_tfmod2)/-1.272727)

conf_trans_back[12,]
pred_trans_back[12,]

conf_trans_back2[12,]
pred_trans_back2[12,]

conf_fit <- conf_trans_back[,1]
conf_lwr <- conf_trans_back[,2]
conf_upr <- conf_trans_back[,3]
pred_fit <- pred_trans_back[,1]
pred_lwr <- pred_trans_back[,2]
pred_upr <- pred_trans_back[,3]


house_nums <- seq(from = 6309, to = 6350, by = 1)
# Remove values 6314 and 6322
house_nums <- house_nums[house_nums != 6314 & house_nums != 6322]

plot(house_nums, y=fitted_trans_back, 
     xlab = "House Number", 
     ylab = "MV from 'tfmod'", 
     main = "Confidence/Prediction Intervals on Fitted MV", 
     pch = 16, 
     col = "blue",
     ylim = c(450000, 750000),
     xlim = c(6309, 6350))

lines(house_nums,fitted_trans_back,col="black")
lines(house_nums,conf_lwr,col="red")
lines(house_nums,conf_upr, col="red")
lines(house_nums,pred_lwr, col="green")
lines(house_nums,pred_upr, col="green")


legend("topright", legend=c("fitted values","confidence interval","prediction interval"), col=c("black","red","green"),
       lty = 1, 
       lwd = 1)
       
eval_tfMV<-(2.766e-07 - 2.939e-11*(data$Total.Main.Area) - 3.824e-12*(data$Land))
eval_MV<-exp(log(eval_tfMV)/-1.191919)
eval_MV[12]
data$X2023.Market.Value[12]

eval_tfMV2<-(1.016e-07 - 1.334e-11*(data$Main.Area) - 7.314e-12*(data$Garage.Area) - 1.241e-12*(data$Land))
eval_MV2<-exp(log(eval_tfMV2)/-1.272727)
eval_MV2[12]

eval_MV_dataset<-cbind(data$X2023.Market.Value,eval_MV,data$Total.Main.Area,data$Land)
rownames(eval_MV_dataset)<-rownames(data)
colnames(eval_MV_dataset)<-c("Market.Value.2023","eval_MV","Total.Main.Area","Land")
eval_MV_dataset<-as.data.frame(eval_MV_dataset)
less_than<-filter(eval_MV_dataset, eval_MV_dataset$eval_MV >= eval_MV_dataset$Market.Value.2023)
greater<-filter(eval_MV_dataset, eval_MV_dataset$eval_MV <= eval_MV_dataset$Market.Value.2023)
count(less_than)
count(greater)

eval_MV_dataset2<-cbind(data$X2023.Market.Value,eval_MV2,data$Total.Main.Area,data$Land)
rownames(eval_MV_dataset2)<-rownames(data)
colnames(eval_MV_dataset2)<-c("Market.Value.2023","eval_MV2","Total.Main.Area","Land")
eval_MV_dataset2<-as.data.frame(eval_MV_dataset2)
less_than2<-filter(eval_MV_dataset2, eval_MV_dataset2$eval_MV2 >= eval_MV_dataset2$Market.Value.2023)
greater2<-filter(eval_MV_dataset2, eval_MV_dataset2$eval_MV2 <= eval_MV_dataset2$Market.Value.2023)
count(less_than2)
count(greater2)