Project 2-Hannah Mitchell

Property Tax Assesment

Texas, a property tax reliant state, used appraisal districts by county. Enough information regarding property parameters and evaluated values that a model can be constructed to determine if a given property is over or under valued by the state, and as such, if the owner is paying proper property taxes.

Uploading Data & Cleaning It

The below section of code, the .csv file with all the raw data from the Lubbock county appraisal district website is available via github. All the information is available to the public simply by searching by address or by name.

github_url <- 'https://raw.githubusercontent.com/dshorselover/StatDatAnalysis/refs/heads/main/Project%202%20Data-fin.csv'
Property_Data <- read.csv(github_url)

The below section of code renames the columns and displays the first five lines of the .csv file. The columns are renamed to remove any special characters to avoid any problems when creating the model. The last row is also removed as it is only characters and will later mess with analysis. j

##  [1] "House.Number"               "Property.ID"               
##  [3] "X2025.Market.Value"         "X2025.Improvement.Value"   
##  [5] "X2025.Land.Market.Value"    "X2025.Assessed.Value"      
##  [7] "Total.Area"                 "MA.Total"                  
##  [9] "MA.Total.Value"             "GAR.Total"                 
## [11] "GAR.Total.Value"            "Additional.Ammenities..ft."
## [13] "AA.Value"                   "Land.Area..ft."

##   HouseNum PropertyID MarketVal ImprovementVal LandMarketVal AssessedVal
## 1     6309    R322649    735026         677026         58000      735026
## 2     6310    R322646    663907         603222         60685      663907
## 3     6311    R322648    569992         511992         58000      569992
## 4     6312    R322647    602427         538677         63750      602427
## 5     6313    R330751    460288         415135         45153      460288
## 6     6314    R330752    968766         888796          7970      968766
##   TotalArea MATotal MATotalVal GARTotal GARTotalVal   AA AAVal LandArea
## 1      4525    3462   593404.0     1063     83622.0    0     0    10000
## 2      4304    3226   452136.2     1078    151085.8    0     0    10463
## 3      4001    3036   447924.0      965     64068.0    0     0    10000
## 4      4186    3277   477232.0      909     61445.0    0     0    10625
## 5      2747    2241   376845.0      506     38290.0    0     0     7785
## 6      6368    4188   723336.0      985     79833.0 1195 85607    13788

Exploratory Data Analysis

Starting with the summary of the raw data, basic properties of the data are determined. Properties such as class, minimum, max, quartiles, and mare are shown below. These properties give a very broad strokes view of the data and can determine if the .csv file was transferred correctly from the excel file it was created in.

summary(Property_Data)

##     HouseNum     PropertyID          MarketVal       ImprovementVal   
##  Min.   :6309   Length:42          Min.   : 418386   Min.   : 373092  
##  1st Qu.:6319   Class :character   1st Qu.: 506390   1st Qu.: 460405  
##  Median :6330   Mode  :character   Median : 536431   Median : 490236  
##  Mean   :6330                      Mean   : 570191   Mean   : 519709  
##  3rd Qu.:6340                      3rd Qu.: 580772   3rd Qu.: 535879  
##  Max.   :6350                      Max.   :1218146   Max.   :1116617  
##  LandMarketVal     AssessedVal        TotalArea       MATotal    
##  Min.   :  7970   Min.   : 418386   Min.   :2747   Min.   :2241  
##  1st Qu.: 44944   1st Qu.: 506390   1st Qu.:3160   1st Qu.:2614  
##  Median : 45658   Median : 536431   Median :3362   Median :2763  
##  Mean   : 48768   Mean   : 570191   Mean   :3637   Mean   :2857  
##  3rd Qu.: 46631   3rd Qu.: 580772   3rd Qu.:3913   3rd Qu.:2953  
##  Max.   :101529   Max.   :1218146   Max.   :7117   Max.   :4582  
##    MATotalVal        GARTotal       GARTotalVal           AA         
##  Min.   :283458   Min.   : 479.0   Min.   : 32308   Min.   :   0.00  
##  1st Qu.:416911   1st Qu.: 528.0   1st Qu.: 38354   1st Qu.:   0.00  
##  Median :444626   Median : 552.0   Median : 40867   Median :   0.00  
##  Mean   :456314   Mean   : 704.3   Mean   : 53113   Mean   :  76.36  
##  3rd Qu.:458849   3rd Qu.: 936.0   3rd Qu.: 66713   3rd Qu.:   0.00  
##  Max.   :899451   Max.   :1119.0   Max.   :151086   Max.   :1468.00  
##      AAVal           LandArea    
##  Min.   :     0   Min.   : 7501  
##  1st Qu.:     0   1st Qu.: 7756  
##  Median :     0   Median : 7872  
##  Mean   :  5547   Mean   : 8695  
##  3rd Qu.:     0   3rd Qu.: 8169  
##  Max.   :120403   Max.   :17505

The next part of exploratory data analysis, is the histogram plot. A histogram shows which values are most commonly in the data set. The below plot shows that there are outleirs beyond $800,000.00. This is most likely due to those properties having features on their property beyond a house and a garage. Any additional aspects of the properties, such as pools or pool houses, are considered amenities to simplify the model later on. These properties may interfere with the models fit later on and may be removed from the model.

hist(Property_Data$MarketVal,
     main = 'Total Market Value',
     xlab = 'Market Value',
     col = 'lightpink3')

Continuing with the data exploration, the below plots look at how the main building, or house, area effects the total market value of the property. There are also outliers in the plot in the far right and the model will likely require some kind of outlier correction or removal.

plot(Property_Data$MATotal, Property_Data$MarketVal,
     main = 'Main Building Area vs Total Market Value',
     xlab = 'Main Building Area (ft^2)',
     ylab = 'Total Property Market Value',
     pch = 20,
     col = 'lightpink3')

The garage area plot, similar to the above main building area plot, shows how the area of the garage effects the total market value of the property. The plot is split as garage area is not typically truly continuous, its based on car size. For example, the cluster in the bottom left is most likely a one or two car garage. The cluster in bottom right is most likely either a four car garage or a utility garage to store boats. The outliers in the top right correlate to the outliers in the histogram that most likely due to the properties with additional amenities.

plot(Property_Data$GARTotal, Property_Data$MarketVal,
     main = 'Garage Area vs Total Market Value',
     xlab = 'Garage Area (ft^2)',
     ylab = 'Total Property Market Value',
     pch = 20,
     col = 'lightpink3')

The last exploratory plot is the additional amenities effect on the total market value, shown below. As expected, most of the values are set at 0$ft^2$ since most of the properties don’t have any kind of additional amenities. Unsurprisingly, the larger the area of the amenities, the greater the total market value. This plot depict the effect of the outlines shown in earlier plots.

plot(Property_Data$AA, Property_Data$MarketVal,
     main = 'Ammenity vs Total Market Value',
     xlab = 'Ammenity Area (ft^2)',
     ylab = 'Total Property Market Value',
     pch = 20,
     col = 'lightpink3')

Making The Model

Due to the outleirs and multiple factors, building the model will take some trial and error. The model is build by putting the data into a linear model using the “lm” function, done in each attempt. In a simple linear regression there is only one predictor and one response variable. In multiple linear regression, a response variable is determined using multiple predictor variables, shown in the equation below.

\[ y=\Sigma X_i\beta_i+\epsilon \]

In this equation, every predictor variable’s effect on the responce is represented by $X_i\beta_i$ with $\epsilon$ representing the random error in the model. The $\beta$ component of the model equation is linear in how it explains the effects of X on the response variable, y. In this model, the response variable is the total market value. The predictor variables include, but may not be limited to:

Main Building Area (represented by MATotal in the code)
Garage Area (represented by GARTotal in the code)
Land Area (represented by LandArea in the code)
Amenity Area (Represented by AA in the code)

Original Model

The below code shows the first model which should describe the effects of each factor individually and combined on the total market value. The summary, also shown below, shows the estimated $\beta_i$ values of a multiple linear regression, the $R^2$ statistic, the t-statistic, and an F-statistic. Focusing on the $R^2$ and the p-value from the F-statistic, the first attempt model appears to be a good fit. The $R^2$ is at nearly 95% and the p-vale is well below the 0.05 that most industries consider to be acceptable. These are very good summary statistics to see as they indicate that the model fits the data very well. Residual analysis will continue to determine the models adequacy.

The model is built based on the belief that the individual area components each individually effect the total market value and the interation

model <- lm(MarketVal ~ MATotal + GARTotal + LandArea + AA + MATotal*GARTotal*LandArea*AA, data = Property_Data) 
summary(model)

## 
## Call:
## lm(formula = MarketVal ~ MATotal + GARTotal + LandArea + AA + 
##     MATotal * GARTotal * LandArea * AA, data = Property_Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -80687  -4474   4481  11722  60716 
## 
## Coefficients: (5 not defined because of singularities)
##                                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                  -8.066e+06  3.612e+06  -2.233   0.0329 *
## MATotal                       2.847e+03  1.188e+03   2.397   0.0228 *
## GARTotal                      1.082e+04  4.407e+03   2.454   0.0199 *
## LandArea                      1.007e+03  4.491e+02   2.242   0.0323 *
## AA                            6.082e+03  2.731e+03   2.227   0.0333 *
## MATotal:GARTotal             -3.629e+00  1.477e+00  -2.456   0.0198 *
## MATotal:LandArea             -3.328e-01  1.463e-01  -2.275   0.0300 *
## GARTotal:LandArea            -1.284e+00  5.268e-01  -2.437   0.0207 *
## MATotal:AA                   -3.589e+00  1.633e+00  -2.198   0.0356 *
## GARTotal:AA                   8.793e+00  4.004e+00   2.196   0.0357 *
## LandArea:AA                          NA         NA      NA       NA  
## MATotal:GARTotal:LandArea     4.315e-04  1.740e-04   2.480   0.0188 *
## MATotal:GARTotal:AA                  NA         NA      NA       NA  
## MATotal:LandArea:AA                  NA         NA      NA       NA  
## GARTotal:LandArea:AA                 NA         NA      NA       NA  
## MATotal:GARTotal:LandArea:AA         NA         NA      NA       NA  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29820 on 31 degrees of freedom
## Multiple R-squared:  0.9648, Adjusted R-squared:  0.9535 
## F-statistic: 85.03 on 10 and 31 DF,  p-value: < 2.2e-16

Below is the first plot for residual analysis, the Residual vs Fitted plot. This plot represents the homoscedasticity of the model. Ideally the points on the plot would not be clustered to the left. Instead it would be randomly distributed across the entire plot. Despite this, the red line is relatively straight across the graph which is a strong representative of homosedacity. There are outleirs that still appear to correlate with the properties with the amenities such as pools and pool houses. These points, along with the amenities predictor variable may be taken out in later attempts depending on how the rest of the residual analysis goes.

plot(model, 1,
     pch = 20,
     col = 'navy')

The below plot is the Q-Q plot which depicts how normally distributed the residuals are. This is an acceptable plot for normally distributed residuals as most of the points are relatively linear. There is a fat tail with points 32, 10, and 8 are listed on the plot as being the problems. These points are worth investigating if there are stronger indicators of a lack of model fit.

plot(model, 2,
     pch = 20,
     col = 'navy')

## Warning: not plotting observations with leverage one:
##   6, 14, 18

The final plot to check the residuals is the Residuals vs Leverage plot. This plot depicts any influential outleirs that could be drastically effecting the model by putting any such points outside the Cooks distance lines. While point 1 is very close, it is still within the bounds of the lines. Point 8 has been noted on multiple plots, by being specifically listed, and warrants seeing how it effects the model by being removed.

plot(model, 5,
     pch = 20,
     col = 'navy')

## Warning: not plotting observations with leverage one:
##   6, 14, 18

The final check for model adequacy is the Variance Inflation Factor (VIF), which tests for multicollinearity. The interaction term in the model is to complex for VIF, so analysis is only done on the main factors. None of the VIF’s are over five, and therefore are good enough.

vif_model <- lm(MarketVal ~ MATotal + GARTotal + LandArea + AA, data = Property_Data) 
vif(vif_model)

##  MATotal GARTotal LandArea       AA 
## 4.554770 1.679801 4.197298 2.917910

Row 8’s Effect

To ensure that the original model is the best fit, row 8 is removed from all the predictor factors. It was the most noted row as being abnormal to the model and the most likely to be a slight outleir. To determine if it is in fact an outlier, the row will be removed from the model to see if it improves model addequacy. If it does not then the original model is the better model and will be used in the prediction.

The below line of code removes that row from the data set so that it can be put into the model. The same process for analyzing the model is used on this new model.

clean_data <- Property_Data[-c(8),]

The below summary shows that the $R^2$ value of the new model marginally improved without the outleir but the p-value is unchanged. The coefficients changed but their t-test p-values shows that by removing the outlier, no predictor variable needs to be removed from the model. The biggest change seems to be the effect of the amenities it drastically decreased.

modell <- lm(MarketVal ~ MATotal + GARTotal + LandArea + AA + (MATotal*GARTotal*LandArea*AA), data = clean_data)
summary(modell)

## 
## Call:
## lm(formula = MarketVal ~ MATotal + GARTotal + LandArea + AA + 
##     (MATotal * GARTotal * LandArea * AA), data = clean_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -80517  -4755   4796  11407  45467 
## 
## Coefficients: (5 not defined because of singularities)
##                                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                  -7.543e+06  3.326e+06  -2.268   0.0307 *
## MATotal                       2.665e+03  1.094e+03   2.435   0.0211 *
## GARTotal                      1.030e+04  4.056e+03   2.539   0.0165 *
## LandArea                      9.520e+02  4.133e+02   2.303   0.0284 *
## AA                            5.849e+03  2.512e+03   2.329   0.0268 *
## MATotal:GARTotal             -3.418e+00  1.360e+00  -2.512   0.0176 *
## MATotal:LandArea             -3.139e-01  1.347e-01  -2.331   0.0267 *
## GARTotal:LandArea            -1.241e+00  4.845e-01  -2.562   0.0157 *
## MATotal:AA                   -3.444e+00  1.502e+00  -2.293   0.0291 *
## GARTotal:AA                   8.419e+00  3.683e+00   2.286   0.0295 *
## LandArea:AA                          NA         NA      NA       NA  
## MATotal:GARTotal:LandArea     4.132e-04  1.601e-04   2.580   0.0150 *
## MATotal:GARTotal:AA                  NA         NA      NA       NA  
## MATotal:LandArea:AA                  NA         NA      NA       NA  
## GARTotal:LandArea:AA                 NA         NA      NA       NA  
## MATotal:GARTotal:LandArea:AA         NA         NA      NA       NA  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27410 on 30 degrees of freedom
## Multiple R-squared:  0.9705, Adjusted R-squared:  0.9607 
## F-statistic: 98.66 on 10 and 30 DF,  p-value: < 2.2e-16

The below residual plot lost its linearity and didnt improve the distribution of the points across the graph. Removing the outlier did not improve the Residual vs Fitted plot.

plot(modell, 1,
     pch = 20,
     col = 'navy')

The below Q-Q plot appears to have been changed very little from the original model. Unfortunately, due to its lack of change, it is not a good indicator of which model is the better fit.

plot(modell, 2,
     pch = 20,
     col = 'navy')

## Warning: not plotting observations with leverage one:
##   6, 13, 17

The final plot is the most telling in regards to this models adequacy. In the below Residual vs Leverage plot is a drastic change from the original models, and shows a clear outlier in row 16 in this model. Because it has created new outliers instead of reducing the effects of the possible ones in the original model, the new model can not be used to predict the value of a given house. It is a worse model than the original model.

plot(modell, 5,
     pch = 20,
     col = 'navy')

## Warning: not plotting observations with leverage one:
##   6, 13, 17

Below is the VIF for the new model. It is unchanged from the original model.

vif_modell <- lm(MarketVal ~ MATotal + GARTotal + LandArea + AA, data = clean_data) 
vif(vif_modell)

##  MATotal GARTotal LandArea       AA 
## 4.527360 1.659574 4.603337 3.209099

After all the model adequacy analysis, the new model does not significantly improve the model. There is no reason to use it over the original model when it decreased the quality of some of the residual plots, instead of improving them.

Prediction Model

As requested, the below model used the original model to determine if the owners of property 6321 are paying the correct property taxes.

House 6321 Data

To first build upon the original model to specifically predict the value of house 6321. The below code takes the actual data for house 6321 and puts in into its own data set. This is so that it can be plugged specifically into the prediction model to determine if the property is over or undervalued.

house_6321 <- subset(Property_Data, HouseNum == 6321)

house_predict <- data.frame(
  MATotal = house_6321$MATotal,
  GARTotal = house_6321$GARTotal,
  AA = house_6321$AA,
  LandArea = house_6321$LandArea)

Prediction For House 6321

The below code predicts the range of values for the house 6321. It puts the data for that property and puts it into the model to determine an upper, lower, and best fit value for the house. These values are then put into individual variables along with the county appraisal districts value for the property. The confidence interval is also determined as it will be used for the plot. The best fit value for the value of the house is used to determine how much house 6321 should be valued at. The upper and lower confidence intervals for the range of acceptable values for the house is used because it takes the entire street into consideration. The prediction interval does not take the entire street into consideration, just the model value for the house compared to the actual value of the house.

pred <- predict(model, newdata = house_predict, interval = 'prediction')
conf <- predict(model, newdata = house_predict, interval = 'confidence')

fit <- pred[1]
lower <- conf[2]
upper <- conf[3]
actual <- house_6321$MarketVal

Plotting the Prediction

The below code creates a visual model of the predicted value compared to the actual value. It uses a blue dot to depict where in the predicted interval the model predicts the value of the property to be using the best fit part of the prediction. The tan dashed lines make it easier to determine the value of the predicted fit and the actual value on the plot. The pink triangle depicts the county’s value for the property. It is within the prediction intervals, on the nearly black line, and most likely accounts for some other factor not included on the county appraisal districts public access page. These factors could be something as simple as how clean the yard looked during evaluation or how long the Christmas lights had been left up. Both factors legitimately change the value of the house, but are easy fixes.

y_min <- min(lower, actual)
y_max <- max(upper, actual)

plot(NA,
     xlim = c(0.75,1.25),
     ylim = c(y_min, y_max),
     ylab = 'Market Value ($)',
     main = '6321 Actual vs Model Prediction',
     xaxt = 'n')
points(1, actual,
       pch = 17,
       col = '#A63A50',
       cex = 1.25)
points(1, fit, 
       pch = 16,
       col = '#586994',
       cex = 1.25)
segments(1, lower, 1, upper,
         col = '#160F29', lwd = 2, lty = 1)
abline(h = fit, 
       col = '#AA8F66',
       lty = 2,
       lwd = 2)
abline(h = actual, 
       col = '#AA8F66',
       lty = 2,
       lwd = 2)

After analyzing the plot, the actual value is significantly higher than the predicted value of the house. Unless the county can provide plausible reason for the exaggerated difference, an appeal is warranted. It also may be prudent to determine why there is such a significant difference in the county’s assessment.

Entire Code

#Library
library(car)

#Data
github_url <- 'https://raw.githubusercontent.com/dshorselover/StatDatAnalysis/refs/heads/main/Project%202%20Data-fin.csv'
Property_Data <- read.csv(github_url)

names(Property_Data)

##  [1] "House.Number"               "Property.ID"               
##  [3] "X2025.Market.Value"         "X2025.Improvement.Value"   
##  [5] "X2025.Land.Market.Value"    "X2025.Assessed.Value"      
##  [7] "Total.Area"                 "MA.Total"                  
##  [9] "MA.Total.Value"             "GAR.Total"                 
## [11] "GAR.Total.Value"            "Additional.Ammenities..ft."
## [13] "AA.Value"                   "Land.Area..ft."

colnames(Property_Data) <- c('HouseNum', 'PropertyID', 'MarketVal', 'ImprovementVal', 'LandMarketVal', 'AssessedVal', 'TotalArea', 'MATotal', 'MATotalVal', 'GARTotal', 'GARTotalVal', 'AA', 'AAVal', 'LandArea')
Property_Data <- Property_Data[-c(43),]
head(Property_Data)

##   HouseNum PropertyID MarketVal ImprovementVal LandMarketVal AssessedVal
## 1     6309    R322649    735026         677026         58000      735026
## 2     6310    R322646    663907         603222         60685      663907
## 3     6311    R322648    569992         511992         58000      569992
## 4     6312    R322647    602427         538677         63750      602427
## 5     6313    R330751    460288         415135         45153      460288
## 6     6314    R330752    968766         888796          7970      968766
##   TotalArea MATotal MATotalVal GARTotal GARTotalVal   AA AAVal LandArea
## 1      4525    3462   593404.0     1063     83622.0    0     0    10000
## 2      4304    3226   452136.2     1078    151085.8    0     0    10463
## 3      4001    3036   447924.0      965     64068.0    0     0    10000
## 4      4186    3277   477232.0      909     61445.0    0     0    10625
## 5      2747    2241   376845.0      506     38290.0    0     0     7785
## 6      6368    4188   723336.0      985     79833.0 1195 85607    13788

#EDA
summary(Property_Data)

##     HouseNum     PropertyID          MarketVal       ImprovementVal   
##  Min.   :6309   Length:42          Min.   : 418386   Min.   : 373092  
##  1st Qu.:6319   Class :character   1st Qu.: 506390   1st Qu.: 460405  
##  Median :6330   Mode  :character   Median : 536431   Median : 490236  
##  Mean   :6330                      Mean   : 570191   Mean   : 519709  
##  3rd Qu.:6340                      3rd Qu.: 580772   3rd Qu.: 535879  
##  Max.   :6350                      Max.   :1218146   Max.   :1116617  
##  LandMarketVal     AssessedVal        TotalArea       MATotal    
##  Min.   :  7970   Min.   : 418386   Min.   :2747   Min.   :2241  
##  1st Qu.: 44944   1st Qu.: 506390   1st Qu.:3160   1st Qu.:2614  
##  Median : 45658   Median : 536431   Median :3362   Median :2763  
##  Mean   : 48768   Mean   : 570191   Mean   :3637   Mean   :2857  
##  3rd Qu.: 46631   3rd Qu.: 580772   3rd Qu.:3913   3rd Qu.:2953  
##  Max.   :101529   Max.   :1218146   Max.   :7117   Max.   :4582  
##    MATotalVal        GARTotal       GARTotalVal           AA         
##  Min.   :283458   Min.   : 479.0   Min.   : 32308   Min.   :   0.00  
##  1st Qu.:416911   1st Qu.: 528.0   1st Qu.: 38354   1st Qu.:   0.00  
##  Median :444626   Median : 552.0   Median : 40867   Median :   0.00  
##  Mean   :456314   Mean   : 704.3   Mean   : 53113   Mean   :  76.36  
##  3rd Qu.:458849   3rd Qu.: 936.0   3rd Qu.: 66713   3rd Qu.:   0.00  
##  Max.   :899451   Max.   :1119.0   Max.   :151086   Max.   :1468.00  
##      AAVal           LandArea    
##  Min.   :     0   Min.   : 7501  
##  1st Qu.:     0   1st Qu.: 7756  
##  Median :     0   Median : 7872  
##  Mean   :  5547   Mean   : 8695  
##  3rd Qu.:     0   3rd Qu.: 8169  
##  Max.   :120403   Max.   :17505

hist(Property_Data$MarketVal,
     main = 'Total Market Value',
     xlab = 'Market Value',
     col = 'lightpink3')

plot(Property_Data$MATotal, Property_Data$MarketVal,
     main = 'Main Building Area vs Total Market Value',
     xlab = 'Main Building Area (ft^2)',
     ylab = 'Total Property Market Value',
     pch = 20,
     col = 'lightpink3')

plot(Property_Data$GARTotal, Property_Data$MarketVal,
     main = 'Garage Area vs Total Market Value',
     xlab = 'Garage Area (ft^2)',
     ylab = 'Total Property Market Value',
     pch = 20,
     col = 'lightpink3')

plot(Property_Data$AA, Property_Data$MarketVal,
     main = 'Ammenity vs Total Market Value',
     xlab = 'Ammenity Area (ft^2)',
     ylab = 'Total Property Market Value',
     pch = 20,
     col = 'lightpink3')

#Model Shiz
model <- lm(MarketVal ~ MATotal + GARTotal + LandArea + AA + MATotal*GARTotal*LandArea*AA, data = Property_Data) 
summary(model)

## 
## Call:
## lm(formula = MarketVal ~ MATotal + GARTotal + LandArea + AA + 
##     MATotal * GARTotal * LandArea * AA, data = Property_Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -80687  -4474   4481  11722  60716 
## 
## Coefficients: (5 not defined because of singularities)
##                                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                  -8.066e+06  3.612e+06  -2.233   0.0329 *
## MATotal                       2.847e+03  1.188e+03   2.397   0.0228 *
## GARTotal                      1.082e+04  4.407e+03   2.454   0.0199 *
## LandArea                      1.007e+03  4.491e+02   2.242   0.0323 *
## AA                            6.082e+03  2.731e+03   2.227   0.0333 *
## MATotal:GARTotal             -3.629e+00  1.477e+00  -2.456   0.0198 *
## MATotal:LandArea             -3.328e-01  1.463e-01  -2.275   0.0300 *
## GARTotal:LandArea            -1.284e+00  5.268e-01  -2.437   0.0207 *
## MATotal:AA                   -3.589e+00  1.633e+00  -2.198   0.0356 *
## GARTotal:AA                   8.793e+00  4.004e+00   2.196   0.0357 *
## LandArea:AA                          NA         NA      NA       NA  
## MATotal:GARTotal:LandArea     4.315e-04  1.740e-04   2.480   0.0188 *
## MATotal:GARTotal:AA                  NA         NA      NA       NA  
## MATotal:LandArea:AA                  NA         NA      NA       NA  
## GARTotal:LandArea:AA                 NA         NA      NA       NA  
## MATotal:GARTotal:LandArea:AA         NA         NA      NA       NA  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29820 on 31 degrees of freedom
## Multiple R-squared:  0.9648, Adjusted R-squared:  0.9535 
## F-statistic: 85.03 on 10 and 31 DF,  p-value: < 2.2e-16

plot(model, 1,
     pch = 20,
     col = 'navy')

plot(model, 2,
     pch = 20,
     col = 'navy')

## Warning: not plotting observations with leverage one:
##   6, 14, 18

plot(model, 3,
     pch = 20,
     col = 'navy')

## Warning: not plotting observations with leverage one:
##   6, 14, 18

vif_model <- lm(MarketVal ~ MATotal + GARTotal + LandArea + AA, data = Property_Data) 
vif(vif_model)

##  MATotal GARTotal LandArea       AA 
## 4.554770 1.679801 4.197298 2.917910

modell <- lm(MarketVal ~ MATotal + GARTotal + LandArea + AA + (MATotal*GARTotal*LandArea*AA), data = clean_data)
summary(modell)

## 
## Call:
## lm(formula = MarketVal ~ MATotal + GARTotal + LandArea + AA + 
##     (MATotal * GARTotal * LandArea * AA), data = clean_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -80517  -4755   4796  11407  45467 
## 
## Coefficients: (5 not defined because of singularities)
##                                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                  -7.543e+06  3.326e+06  -2.268   0.0307 *
## MATotal                       2.665e+03  1.094e+03   2.435   0.0211 *
## GARTotal                      1.030e+04  4.056e+03   2.539   0.0165 *
## LandArea                      9.520e+02  4.133e+02   2.303   0.0284 *
## AA                            5.849e+03  2.512e+03   2.329   0.0268 *
## MATotal:GARTotal             -3.418e+00  1.360e+00  -2.512   0.0176 *
## MATotal:LandArea             -3.139e-01  1.347e-01  -2.331   0.0267 *
## GARTotal:LandArea            -1.241e+00  4.845e-01  -2.562   0.0157 *
## MATotal:AA                   -3.444e+00  1.502e+00  -2.293   0.0291 *
## GARTotal:AA                   8.419e+00  3.683e+00   2.286   0.0295 *
## LandArea:AA                          NA         NA      NA       NA  
## MATotal:GARTotal:LandArea     4.132e-04  1.601e-04   2.580   0.0150 *
## MATotal:GARTotal:AA                  NA         NA      NA       NA  
## MATotal:LandArea:AA                  NA         NA      NA       NA  
## GARTotal:LandArea:AA                 NA         NA      NA       NA  
## MATotal:GARTotal:LandArea:AA         NA         NA      NA       NA  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27410 on 30 degrees of freedom
## Multiple R-squared:  0.9705, Adjusted R-squared:  0.9607 
## F-statistic: 98.66 on 10 and 30 DF,  p-value: < 2.2e-16

plot(modell, 1,
     pch = 20,
     col = 'navy')

plot(modell, 2,
     pch = 20,
     col = 'navy')

## Warning: not plotting observations with leverage one:
##   6, 13, 17

plot(modell, 3,
     pch = 20,
     col = 'navy')

## Warning: not plotting observations with leverage one:
##   6, 13, 17

vif_model1 <- lm(MarketVal ~ MATotal + GARTotal + LandArea + AA, data = Property_Data) 
vif(vif_model1)

##  MATotal GARTotal LandArea       AA 
## 4.554770 1.679801 4.197298 2.917910

#Prediction Time
house_6321 <- subset(Property_Data, HouseNum == 6321)

house_predict <- data.frame(
  MATotal = house_6321$MATotal,
  GARTotal = house_6321$GARTotal,
  AA = house_6321$AA,
  LandArea = house_6321$LandArea)

pred <- predict(model, newdata = house_predict, interval = 'prediction')
conf <- predict(model, newdata = house_predict, interval = 'confidence')

#Plotting Prediction
fit <- pred[1]
lower <- conf[2]
upper <- conf[3]
actual <- house_6321$MarketVal

y_min <- min(lower, actual)
y_max <- max(upper, actual)

plot(NA,
     xlim = c(0.75,1.25),
     ylim = c(y_min, y_max),
     ylab = 'Market Value ($)',
     main = '6321 Actual vs Model Prediction',
     xaxt = 'n')
points(1, actual,
       pch = 17,
       col = '#A63A50',
       cex = 1.25)
points(1, fit, 
       pch = 16,
       col = '#586994',
       cex = 1.25)
segments(1, lower, 1, upper,
         col = '#160F29', lwd = 2, lty = 1)
abline(h = fit, 
       col = '#AA8F66',
       lty = 2,
       lwd = 2)
abline(h = actual, 
       col = '#AA8F66',
       lty = 2,
       lwd = 2)