Texas, a property tax reliant state, used appraisal districts by county. Enough information regarding property parameters and evaluated values that a model can be constructed to determine if a given property is over or under valued by the state, and as such, if the owner is paying proper property taxes.
The below section of code, the .csv file with all the raw data from the Lubbock county appraisal district website is available via github. All the information is available to the public simply by searching by address or by name.
github_url <- 'https://raw.githubusercontent.com/dshorselover/StatDatAnalysis/refs/heads/main/Project%202%20Data-fin.csv'
Property_Data <- read.csv(github_url)
The below section of code renames the columns and displays the first five lines of the .csv file. The columns are renamed to remove any special characters to avoid any problems when creating the model. The last row is also removed as it is only characters and will later mess with analysis. j
## [1] "House.Number" "Property.ID"
## [3] "X2025.Market.Value" "X2025.Improvement.Value"
## [5] "X2025.Land.Market.Value" "X2025.Assessed.Value"
## [7] "Total.Area" "MA.Total"
## [9] "MA.Total.Value" "GAR.Total"
## [11] "GAR.Total.Value" "Additional.Ammenities..ft."
## [13] "AA.Value" "Land.Area..ft."
## HouseNum PropertyID MarketVal ImprovementVal LandMarketVal AssessedVal
## 1 6309 R322649 735026 677026 58000 735026
## 2 6310 R322646 663907 603222 60685 663907
## 3 6311 R322648 569992 511992 58000 569992
## 4 6312 R322647 602427 538677 63750 602427
## 5 6313 R330751 460288 415135 45153 460288
## 6 6314 R330752 968766 888796 7970 968766
## TotalArea MATotal MATotalVal GARTotal GARTotalVal AA AAVal LandArea
## 1 4525 3462 593404.0 1063 83622.0 0 0 10000
## 2 4304 3226 452136.2 1078 151085.8 0 0 10463
## 3 4001 3036 447924.0 965 64068.0 0 0 10000
## 4 4186 3277 477232.0 909 61445.0 0 0 10625
## 5 2747 2241 376845.0 506 38290.0 0 0 7785
## 6 6368 4188 723336.0 985 79833.0 1195 85607 13788
Starting with the summary of the raw data, basic properties of the data are determined. Properties such as class, minimum, max, quartiles, and mare are shown below. These properties give a very broad strokes view of the data and can determine if the .csv file was transferred correctly from the excel file it was created in.
summary(Property_Data)
## HouseNum PropertyID MarketVal ImprovementVal
## Min. :6309 Length:42 Min. : 418386 Min. : 373092
## 1st Qu.:6319 Class :character 1st Qu.: 506390 1st Qu.: 460405
## Median :6330 Mode :character Median : 536431 Median : 490236
## Mean :6330 Mean : 570191 Mean : 519709
## 3rd Qu.:6340 3rd Qu.: 580772 3rd Qu.: 535879
## Max. :6350 Max. :1218146 Max. :1116617
## LandMarketVal AssessedVal TotalArea MATotal
## Min. : 7970 Min. : 418386 Min. :2747 Min. :2241
## 1st Qu.: 44944 1st Qu.: 506390 1st Qu.:3160 1st Qu.:2614
## Median : 45658 Median : 536431 Median :3362 Median :2763
## Mean : 48768 Mean : 570191 Mean :3637 Mean :2857
## 3rd Qu.: 46631 3rd Qu.: 580772 3rd Qu.:3913 3rd Qu.:2953
## Max. :101529 Max. :1218146 Max. :7117 Max. :4582
## MATotalVal GARTotal GARTotalVal AA
## Min. :283458 Min. : 479.0 Min. : 32308 Min. : 0.00
## 1st Qu.:416911 1st Qu.: 528.0 1st Qu.: 38354 1st Qu.: 0.00
## Median :444626 Median : 552.0 Median : 40867 Median : 0.00
## Mean :456314 Mean : 704.3 Mean : 53113 Mean : 76.36
## 3rd Qu.:458849 3rd Qu.: 936.0 3rd Qu.: 66713 3rd Qu.: 0.00
## Max. :899451 Max. :1119.0 Max. :151086 Max. :1468.00
## AAVal LandArea
## Min. : 0 Min. : 7501
## 1st Qu.: 0 1st Qu.: 7756
## Median : 0 Median : 7872
## Mean : 5547 Mean : 8695
## 3rd Qu.: 0 3rd Qu.: 8169
## Max. :120403 Max. :17505
The next part of exploratory data analysis, is the histogram plot. A histogram shows which values are most commonly in the data set. The below plot shows that there are outleirs beyond $800,000.00. This is most likely due to those properties having features on their property beyond a house and a garage. Any additional aspects of the properties, such as pools or pool houses, are considered amenities to simplify the model later on. These properties may interfere with the models fit later on and may be removed from the model.
hist(Property_Data$MarketVal,
main = 'Total Market Value',
xlab = 'Market Value',
col = 'lightpink3')
Continuing with the data exploration, the below plots look at how the main building, or house, area effects the total market value of the property. There are also outliers in the plot in the far right and the model will likely require some kind of outlier correction or removal.
plot(Property_Data$MATotal, Property_Data$MarketVal,
main = 'Main Building Area vs Total Market Value',
xlab = 'Main Building Area (ft^2)',
ylab = 'Total Property Market Value',
pch = 20,
col = 'lightpink3')
The garage area plot, similar to the above main building area plot, shows how the area of the garage effects the total market value of the property. The plot is split as garage area is not typically truly continuous, its based on car size. For example, the cluster in the bottom left is most likely a one or two car garage. The cluster in bottom right is most likely either a four car garage or a utility garage to store boats. The outliers in the top right correlate to the outliers in the histogram that most likely due to the properties with additional amenities.
plot(Property_Data$GARTotal, Property_Data$MarketVal,
main = 'Garage Area vs Total Market Value',
xlab = 'Garage Area (ft^2)',
ylab = 'Total Property Market Value',
pch = 20,
col = 'lightpink3')
The last exploratory plot is the additional amenities effect on the total market value, shown below. As expected, most of the values are set at 0\(ft^2\) since most of the properties don’t have any kind of additional amenities. Unsurprisingly, the larger the area of the amenities, the greater the total market value. This plot depict the effect of the outlines shown in earlier plots.
plot(Property_Data$AA, Property_Data$MarketVal,
main = 'Ammenity vs Total Market Value',
xlab = 'Ammenity Area (ft^2)',
ylab = 'Total Property Market Value',
pch = 20,
col = 'lightpink3')
Due to the outleirs and multiple factors, building the model will take some trial and error. The model is build by putting the data into a linear model using the “lm” function, done in each attempt. In a simple linear regression there is only one predictor and one response variable. In multiple linear regression, a response variable is determined using multiple predictor variables, shown in the equation below.
\[ y=\Sigma X_i\beta_i+\epsilon \]
In this equation, every predictor variable’s effect on the responce is represented by \(X_i\beta_i\) with \(\epsilon\) representing the random error in the model. The \(\beta\) component of the model equation is linear in how it explains the effects of X on the response variable, y. In this model, the response variable is the total market value. The predictor variables include, but may not be limited to:
Main Building Area (represented by MATotal in the code)
Garage Area (represented by GARTotal in the code)
Land Area (represented by LandArea in the code)
Amenity Area (Represented by AA in the code)
The below code shows the first model which should describe the effects of each factor individually and combined on the total market value. The summary, also shown below, shows the estimated \(\beta_i\) values of a multiple linear regression, the \(R^2\) statistic, the t-statistic, and an F-statistic. Focusing on the \(R^2\) and the p-value from the F-statistic, the first attempt model appears to be a good fit. The \(R^2\) is at nearly 95% and the p-vale is well below the 0.05 that most industries consider to be acceptable. These are very good summary statistics to see as they indicate that the model fits the data very well. Residual analysis will continue to determine the models adequacy.
The model is built based on the belief that the individual area components each individually effect the total market value and the interation
model <- lm(MarketVal ~ MATotal + GARTotal + LandArea + AA + MATotal*GARTotal*LandArea*AA, data = Property_Data)
summary(model)
##
## Call:
## lm(formula = MarketVal ~ MATotal + GARTotal + LandArea + AA +
## MATotal * GARTotal * LandArea * AA, data = Property_Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80687 -4474 4481 11722 60716
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.066e+06 3.612e+06 -2.233 0.0329 *
## MATotal 2.847e+03 1.188e+03 2.397 0.0228 *
## GARTotal 1.082e+04 4.407e+03 2.454 0.0199 *
## LandArea 1.007e+03 4.491e+02 2.242 0.0323 *
## AA 6.082e+03 2.731e+03 2.227 0.0333 *
## MATotal:GARTotal -3.629e+00 1.477e+00 -2.456 0.0198 *
## MATotal:LandArea -3.328e-01 1.463e-01 -2.275 0.0300 *
## GARTotal:LandArea -1.284e+00 5.268e-01 -2.437 0.0207 *
## MATotal:AA -3.589e+00 1.633e+00 -2.198 0.0356 *
## GARTotal:AA 8.793e+00 4.004e+00 2.196 0.0357 *
## LandArea:AA NA NA NA NA
## MATotal:GARTotal:LandArea 4.315e-04 1.740e-04 2.480 0.0188 *
## MATotal:GARTotal:AA NA NA NA NA
## MATotal:LandArea:AA NA NA NA NA
## GARTotal:LandArea:AA NA NA NA NA
## MATotal:GARTotal:LandArea:AA NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29820 on 31 degrees of freedom
## Multiple R-squared: 0.9648, Adjusted R-squared: 0.9535
## F-statistic: 85.03 on 10 and 31 DF, p-value: < 2.2e-16
Below is the first plot for residual analysis, the Residual vs Fitted plot. This plot represents the homoscedasticity of the model. Ideally the points on the plot would not be clustered to the left. Instead it would be randomly distributed across the entire plot. Despite this, the red line is relatively straight across the graph which is a strong representative of homosedacity. There are outleirs that still appear to correlate with the properties with the amenities such as pools and pool houses. These points, along with the amenities predictor variable may be taken out in later attempts depending on how the rest of the residual analysis goes.
plot(model, 1,
pch = 20,
col = 'navy')
The below plot is the Q-Q plot which depicts how normally distributed the residuals are. This is an acceptable plot for normally distributed residuals as most of the points are relatively linear. There is a fat tail with points 32, 10, and 8 are listed on the plot as being the problems. These points are worth investigating if there are stronger indicators of a lack of model fit.
plot(model, 2,
pch = 20,
col = 'navy')
## Warning: not plotting observations with leverage one:
## 6, 14, 18
The final plot to check the residuals is the Residuals vs Leverage plot. This plot depicts any influential outleirs that could be drastically effecting the model by putting any such points outside the Cooks distance lines. While point 1 is very close, it is still within the bounds of the lines. Point 8 has been noted on multiple plots, by being specifically listed, and warrants seeing how it effects the model by being removed.
plot(model, 5,
pch = 20,
col = 'navy')
## Warning: not plotting observations with leverage one:
## 6, 14, 18
The final check for model adequacy is the Variance Inflation Factor (VIF), which tests for multicollinearity. The interaction term in the model is to complex for VIF, so analysis is only done on the main factors. None of the VIF’s are over five, and therefore are good enough.
vif_model <- lm(MarketVal ~ MATotal + GARTotal + LandArea + AA, data = Property_Data)
vif(vif_model)
## MATotal GARTotal LandArea AA
## 4.554770 1.679801 4.197298 2.917910
To ensure that the original model is the best fit, row 8 is removed from all the predictor factors. It was the most noted row as being abnormal to the model and the most likely to be a slight outleir. To determine if it is in fact an outlier, the row will be removed from the model to see if it improves model addequacy. If it does not then the original model is the better model and will be used in the prediction.
The below line of code removes that row from the data set so that it can be put into the model. The same process for analyzing the model is used on this new model.
clean_data <- Property_Data[-c(8),]
The below summary shows that the \(R^2\) value of the new model marginally improved without the outleir but the p-value is unchanged. The coefficients changed but their t-test p-values shows that by removing the outlier, no predictor variable needs to be removed from the model. The biggest change seems to be the effect of the amenities it drastically decreased.
modell <- lm(MarketVal ~ MATotal + GARTotal + LandArea + AA + (MATotal*GARTotal*LandArea*AA), data = clean_data)
summary(modell)
##
## Call:
## lm(formula = MarketVal ~ MATotal + GARTotal + LandArea + AA +
## (MATotal * GARTotal * LandArea * AA), data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80517 -4755 4796 11407 45467
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.543e+06 3.326e+06 -2.268 0.0307 *
## MATotal 2.665e+03 1.094e+03 2.435 0.0211 *
## GARTotal 1.030e+04 4.056e+03 2.539 0.0165 *
## LandArea 9.520e+02 4.133e+02 2.303 0.0284 *
## AA 5.849e+03 2.512e+03 2.329 0.0268 *
## MATotal:GARTotal -3.418e+00 1.360e+00 -2.512 0.0176 *
## MATotal:LandArea -3.139e-01 1.347e-01 -2.331 0.0267 *
## GARTotal:LandArea -1.241e+00 4.845e-01 -2.562 0.0157 *
## MATotal:AA -3.444e+00 1.502e+00 -2.293 0.0291 *
## GARTotal:AA 8.419e+00 3.683e+00 2.286 0.0295 *
## LandArea:AA NA NA NA NA
## MATotal:GARTotal:LandArea 4.132e-04 1.601e-04 2.580 0.0150 *
## MATotal:GARTotal:AA NA NA NA NA
## MATotal:LandArea:AA NA NA NA NA
## GARTotal:LandArea:AA NA NA NA NA
## MATotal:GARTotal:LandArea:AA NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27410 on 30 degrees of freedom
## Multiple R-squared: 0.9705, Adjusted R-squared: 0.9607
## F-statistic: 98.66 on 10 and 30 DF, p-value: < 2.2e-16
The below residual plot lost its linearity and didnt improve the distribution of the points across the graph. Removing the outlier did not improve the Residual vs Fitted plot.
plot(modell, 1,
pch = 20,
col = 'navy')
The below Q-Q plot appears to have been changed very little from the original model. Unfortunately, due to its lack of change, it is not a good indicator of which model is the better fit.
plot(modell, 2,
pch = 20,
col = 'navy')
## Warning: not plotting observations with leverage one:
## 6, 13, 17
The final plot is the most telling in regards to this models adequacy. In the below Residual vs Leverage plot is a drastic change from the original models, and shows a clear outlier in row 16 in this model. Because it has created new outliers instead of reducing the effects of the possible ones in the original model, the new model can not be used to predict the value of a given house. It is a worse model than the original model.
plot(modell, 5,
pch = 20,
col = 'navy')
## Warning: not plotting observations with leverage one:
## 6, 13, 17
Below is the VIF for the new model. It is unchanged from the original model.
vif_modell <- lm(MarketVal ~ MATotal + GARTotal + LandArea + AA, data = clean_data)
vif(vif_modell)
## MATotal GARTotal LandArea AA
## 4.527360 1.659574 4.603337 3.209099
After all the model adequacy analysis, the new model does not significantly improve the model. There is no reason to use it over the original model when it decreased the quality of some of the residual plots, instead of improving them.
As requested, the below model used the original model to determine if the owners of property 6321 are paying the correct property taxes.
To first build upon the original model to specifically predict the value of house 6321. The below code takes the actual data for house 6321 and puts in into its own data set. This is so that it can be plugged specifically into the prediction model to determine if the property is over or undervalued.
house_6321 <- subset(Property_Data, HouseNum == 6321)
house_predict <- data.frame(
MATotal = house_6321$MATotal,
GARTotal = house_6321$GARTotal,
AA = house_6321$AA,
LandArea = house_6321$LandArea)
The below code predicts the range of values for the house 6321. It puts the data for that property and puts it into the model to determine an upper, lower, and best fit value for the house. These values are then put into individual variables along with the county appraisal districts value for the property. The confidence interval is also determined as it will be used for the plot. The best fit value for the value of the house is used to determine how much house 6321 should be valued at. The upper and lower confidence intervals for the range of acceptable values for the house is used because it takes the entire street into consideration. The prediction interval does not take the entire street into consideration, just the model value for the house compared to the actual value of the house.
pred <- predict(model, newdata = house_predict, interval = 'prediction')
conf <- predict(model, newdata = house_predict, interval = 'confidence')
fit <- pred[1]
lower <- conf[2]
upper <- conf[3]
actual <- house_6321$MarketVal
The below code creates a visual model of the predicted value compared to the actual value. It uses a blue dot to depict where in the predicted interval the model predicts the value of the property to be using the best fit part of the prediction. The tan dashed lines make it easier to determine the value of the predicted fit and the actual value on the plot. The pink triangle depicts the county’s value for the property. It is within the prediction intervals, on the nearly black line, and most likely accounts for some other factor not included on the county appraisal districts public access page. These factors could be something as simple as how clean the yard looked during evaluation or how long the Christmas lights had been left up. Both factors legitimately change the value of the house, but are easy fixes.
y_min <- min(lower, actual)
y_max <- max(upper, actual)
plot(NA,
xlim = c(0.75,1.25),
ylim = c(y_min, y_max),
ylab = 'Market Value ($)',
main = '6321 Actual vs Model Prediction',
xaxt = 'n')
points(1, actual,
pch = 17,
col = '#A63A50',
cex = 1.25)
points(1, fit,
pch = 16,
col = '#586994',
cex = 1.25)
segments(1, lower, 1, upper,
col = '#160F29', lwd = 2, lty = 1)
abline(h = fit,
col = '#AA8F66',
lty = 2,
lwd = 2)
abline(h = actual,
col = '#AA8F66',
lty = 2,
lwd = 2)
After analyzing the plot, the actual value is significantly higher than the predicted value of the house. Unless the county can provide plausible reason for the exaggerated difference, an appeal is warranted. It also may be prudent to determine why there is such a significant difference in the county’s assessment.
#Library
library(car)
#Data
github_url <- 'https://raw.githubusercontent.com/dshorselover/StatDatAnalysis/refs/heads/main/Project%202%20Data-fin.csv'
Property_Data <- read.csv(github_url)
names(Property_Data)
## [1] "House.Number" "Property.ID"
## [3] "X2025.Market.Value" "X2025.Improvement.Value"
## [5] "X2025.Land.Market.Value" "X2025.Assessed.Value"
## [7] "Total.Area" "MA.Total"
## [9] "MA.Total.Value" "GAR.Total"
## [11] "GAR.Total.Value" "Additional.Ammenities..ft."
## [13] "AA.Value" "Land.Area..ft."
colnames(Property_Data) <- c('HouseNum', 'PropertyID', 'MarketVal', 'ImprovementVal', 'LandMarketVal', 'AssessedVal', 'TotalArea', 'MATotal', 'MATotalVal', 'GARTotal', 'GARTotalVal', 'AA', 'AAVal', 'LandArea')
Property_Data <- Property_Data[-c(43),]
head(Property_Data)
## HouseNum PropertyID MarketVal ImprovementVal LandMarketVal AssessedVal
## 1 6309 R322649 735026 677026 58000 735026
## 2 6310 R322646 663907 603222 60685 663907
## 3 6311 R322648 569992 511992 58000 569992
## 4 6312 R322647 602427 538677 63750 602427
## 5 6313 R330751 460288 415135 45153 460288
## 6 6314 R330752 968766 888796 7970 968766
## TotalArea MATotal MATotalVal GARTotal GARTotalVal AA AAVal LandArea
## 1 4525 3462 593404.0 1063 83622.0 0 0 10000
## 2 4304 3226 452136.2 1078 151085.8 0 0 10463
## 3 4001 3036 447924.0 965 64068.0 0 0 10000
## 4 4186 3277 477232.0 909 61445.0 0 0 10625
## 5 2747 2241 376845.0 506 38290.0 0 0 7785
## 6 6368 4188 723336.0 985 79833.0 1195 85607 13788
#EDA
summary(Property_Data)
## HouseNum PropertyID MarketVal ImprovementVal
## Min. :6309 Length:42 Min. : 418386 Min. : 373092
## 1st Qu.:6319 Class :character 1st Qu.: 506390 1st Qu.: 460405
## Median :6330 Mode :character Median : 536431 Median : 490236
## Mean :6330 Mean : 570191 Mean : 519709
## 3rd Qu.:6340 3rd Qu.: 580772 3rd Qu.: 535879
## Max. :6350 Max. :1218146 Max. :1116617
## LandMarketVal AssessedVal TotalArea MATotal
## Min. : 7970 Min. : 418386 Min. :2747 Min. :2241
## 1st Qu.: 44944 1st Qu.: 506390 1st Qu.:3160 1st Qu.:2614
## Median : 45658 Median : 536431 Median :3362 Median :2763
## Mean : 48768 Mean : 570191 Mean :3637 Mean :2857
## 3rd Qu.: 46631 3rd Qu.: 580772 3rd Qu.:3913 3rd Qu.:2953
## Max. :101529 Max. :1218146 Max. :7117 Max. :4582
## MATotalVal GARTotal GARTotalVal AA
## Min. :283458 Min. : 479.0 Min. : 32308 Min. : 0.00
## 1st Qu.:416911 1st Qu.: 528.0 1st Qu.: 38354 1st Qu.: 0.00
## Median :444626 Median : 552.0 Median : 40867 Median : 0.00
## Mean :456314 Mean : 704.3 Mean : 53113 Mean : 76.36
## 3rd Qu.:458849 3rd Qu.: 936.0 3rd Qu.: 66713 3rd Qu.: 0.00
## Max. :899451 Max. :1119.0 Max. :151086 Max. :1468.00
## AAVal LandArea
## Min. : 0 Min. : 7501
## 1st Qu.: 0 1st Qu.: 7756
## Median : 0 Median : 7872
## Mean : 5547 Mean : 8695
## 3rd Qu.: 0 3rd Qu.: 8169
## Max. :120403 Max. :17505
hist(Property_Data$MarketVal,
main = 'Total Market Value',
xlab = 'Market Value',
col = 'lightpink3')
plot(Property_Data$MATotal, Property_Data$MarketVal,
main = 'Main Building Area vs Total Market Value',
xlab = 'Main Building Area (ft^2)',
ylab = 'Total Property Market Value',
pch = 20,
col = 'lightpink3')
plot(Property_Data$GARTotal, Property_Data$MarketVal,
main = 'Garage Area vs Total Market Value',
xlab = 'Garage Area (ft^2)',
ylab = 'Total Property Market Value',
pch = 20,
col = 'lightpink3')
plot(Property_Data$AA, Property_Data$MarketVal,
main = 'Ammenity vs Total Market Value',
xlab = 'Ammenity Area (ft^2)',
ylab = 'Total Property Market Value',
pch = 20,
col = 'lightpink3')
#Model Shiz
model <- lm(MarketVal ~ MATotal + GARTotal + LandArea + AA + MATotal*GARTotal*LandArea*AA, data = Property_Data)
summary(model)
##
## Call:
## lm(formula = MarketVal ~ MATotal + GARTotal + LandArea + AA +
## MATotal * GARTotal * LandArea * AA, data = Property_Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80687 -4474 4481 11722 60716
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.066e+06 3.612e+06 -2.233 0.0329 *
## MATotal 2.847e+03 1.188e+03 2.397 0.0228 *
## GARTotal 1.082e+04 4.407e+03 2.454 0.0199 *
## LandArea 1.007e+03 4.491e+02 2.242 0.0323 *
## AA 6.082e+03 2.731e+03 2.227 0.0333 *
## MATotal:GARTotal -3.629e+00 1.477e+00 -2.456 0.0198 *
## MATotal:LandArea -3.328e-01 1.463e-01 -2.275 0.0300 *
## GARTotal:LandArea -1.284e+00 5.268e-01 -2.437 0.0207 *
## MATotal:AA -3.589e+00 1.633e+00 -2.198 0.0356 *
## GARTotal:AA 8.793e+00 4.004e+00 2.196 0.0357 *
## LandArea:AA NA NA NA NA
## MATotal:GARTotal:LandArea 4.315e-04 1.740e-04 2.480 0.0188 *
## MATotal:GARTotal:AA NA NA NA NA
## MATotal:LandArea:AA NA NA NA NA
## GARTotal:LandArea:AA NA NA NA NA
## MATotal:GARTotal:LandArea:AA NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29820 on 31 degrees of freedom
## Multiple R-squared: 0.9648, Adjusted R-squared: 0.9535
## F-statistic: 85.03 on 10 and 31 DF, p-value: < 2.2e-16
plot(model, 1,
pch = 20,
col = 'navy')
plot(model, 2,
pch = 20,
col = 'navy')
## Warning: not plotting observations with leverage one:
## 6, 14, 18
plot(model, 3,
pch = 20,
col = 'navy')
## Warning: not plotting observations with leverage one:
## 6, 14, 18
vif_model <- lm(MarketVal ~ MATotal + GARTotal + LandArea + AA, data = Property_Data)
vif(vif_model)
## MATotal GARTotal LandArea AA
## 4.554770 1.679801 4.197298 2.917910
modell <- lm(MarketVal ~ MATotal + GARTotal + LandArea + AA + (MATotal*GARTotal*LandArea*AA), data = clean_data)
summary(modell)
##
## Call:
## lm(formula = MarketVal ~ MATotal + GARTotal + LandArea + AA +
## (MATotal * GARTotal * LandArea * AA), data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80517 -4755 4796 11407 45467
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.543e+06 3.326e+06 -2.268 0.0307 *
## MATotal 2.665e+03 1.094e+03 2.435 0.0211 *
## GARTotal 1.030e+04 4.056e+03 2.539 0.0165 *
## LandArea 9.520e+02 4.133e+02 2.303 0.0284 *
## AA 5.849e+03 2.512e+03 2.329 0.0268 *
## MATotal:GARTotal -3.418e+00 1.360e+00 -2.512 0.0176 *
## MATotal:LandArea -3.139e-01 1.347e-01 -2.331 0.0267 *
## GARTotal:LandArea -1.241e+00 4.845e-01 -2.562 0.0157 *
## MATotal:AA -3.444e+00 1.502e+00 -2.293 0.0291 *
## GARTotal:AA 8.419e+00 3.683e+00 2.286 0.0295 *
## LandArea:AA NA NA NA NA
## MATotal:GARTotal:LandArea 4.132e-04 1.601e-04 2.580 0.0150 *
## MATotal:GARTotal:AA NA NA NA NA
## MATotal:LandArea:AA NA NA NA NA
## GARTotal:LandArea:AA NA NA NA NA
## MATotal:GARTotal:LandArea:AA NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27410 on 30 degrees of freedom
## Multiple R-squared: 0.9705, Adjusted R-squared: 0.9607
## F-statistic: 98.66 on 10 and 30 DF, p-value: < 2.2e-16
plot(modell, 1,
pch = 20,
col = 'navy')
plot(modell, 2,
pch = 20,
col = 'navy')
## Warning: not plotting observations with leverage one:
## 6, 13, 17
plot(modell, 3,
pch = 20,
col = 'navy')
## Warning: not plotting observations with leverage one:
## 6, 13, 17
vif_model1 <- lm(MarketVal ~ MATotal + GARTotal + LandArea + AA, data = Property_Data)
vif(vif_model1)
## MATotal GARTotal LandArea AA
## 4.554770 1.679801 4.197298 2.917910
#Prediction Time
house_6321 <- subset(Property_Data, HouseNum == 6321)
house_predict <- data.frame(
MATotal = house_6321$MATotal,
GARTotal = house_6321$GARTotal,
AA = house_6321$AA,
LandArea = house_6321$LandArea)
pred <- predict(model, newdata = house_predict, interval = 'prediction')
conf <- predict(model, newdata = house_predict, interval = 'confidence')
#Plotting Prediction
fit <- pred[1]
lower <- conf[2]
upper <- conf[3]
actual <- house_6321$MarketVal
y_min <- min(lower, actual)
y_max <- max(upper, actual)
plot(NA,
xlim = c(0.75,1.25),
ylim = c(y_min, y_max),
ylab = 'Market Value ($)',
main = '6321 Actual vs Model Prediction',
xaxt = 'n')
points(1, actual,
pch = 17,
col = '#A63A50',
cex = 1.25)
points(1, fit,
pch = 16,
col = '#586994',
cex = 1.25)
segments(1, lower, 1, upper,
col = '#160F29', lwd = 2, lty = 1)
abline(h = fit,
col = '#AA8F66',
lty = 2,
lwd = 2)
abline(h = actual,
col = '#AA8F66',
lty = 2,
lwd = 2)