Property taxes are a form of municipal revenue. Property taxes are based upon an assessed property value. Property values can be dependent upon the individual tax assessor who ascertained the value. Because of this, homeowners have the ability to appeal their assessment.
To use linear regression to predict the assessed value of the property at 6321 88th St Lubbock Tx using publicly available parameters of nearby properties.
Data was collected from lubbockcad.org for the properties 6309 - 6351 88th Street, Lubbock, Texas. The variables collected are summarized in the table below:
| Variables | Explanation |
|---|---|
| 2025 Market Value | Total value of property appraisal (house+land) |
| Total Improvement Market Value | Total value of house appraisal |
| Total Land Market Value | Total value of land appraisal |
| Homestead Cap Loss | Represents a discount only in the current tax year if the appraised value from the previous year went up by more than 10% |
| Total Main Area (Sq. Ft.) | Total square footage of house |
| Main Area (Sq. Ft.) | Total square footage of heated house area |
| Main Area (Value) | Total value of heated house area |
| Garage (Sq. Ft.) | Total square footage of non-heated house area |
| Garage (Value) | Total value of non-heated house area |
| Land (Sq. Ft.) | Total square footage of land |
The data was cleaned by removing entries which did not have complete data. Also entries which had multiple garage values were combined into a single garage value.
Histograms and Box Plots were used to examine the distribution of assessed property value across the sample.
Pairwise correlations were examined between all possible predictor variables using a correlation matrix.
Simple linear regression modeling using least squares estimation was used to regress 2025 Market Value on the other variables list in the Data Collection Section.
\[ y = \beta _{0} + \beta _{1}x_{1}+ \beta _{2}x_{2} + \beta _{3}x_{3} + \beta _{4}x_{4}+\varepsilon \] Model adequacy was evaluated by analyzing the variance with respect to the independent variable and the normality of the residuals. The models were also evaluated by the proportion of variance explained by the independent variable through the use of R-squared.
\[ R^{2}=1-\frac{SSR}{SST} \]
Multicollinearity was also evaluated by analyzing the variance inflation factor for the terms in the model.
\[
VIF_{i}=\frac{1}{1-R_{i}^{2}}
\]
The model which was evaluated as the strongest model based on adequacy and multicollinearity was then analyzed for outliers. This was evaluated by observing residuals. Also Cook’s distance was calculated for each residual using the following formula:
\[ D_{i}=\frac{\left ( \beta _{(i)}-\beta \right )^{T}X^{T}X\left (\hat{\beta_{(i)}-\hat{\beta} } \right )}{p*MSE} \]
A 95% confidence interval on the mean was calculated with the following formula.
\[(\hat{\beta}_{0}+\hat{\beta}_{1}x_{j})\pm t_{1-\alpha /2,n-2}\left ( \sqrt{MSE\left [ \frac{1}{n}+\frac{(x_{j}-\bar{x})^{2}}{\sum (x_{j}-\bar{x})^{2}} \right]} \right )\]
A prediction interval was calculated to account for variation due to point spread form the mean by the following formula.
\[(\hat{\beta}_{0}+\hat{\beta}_{1}x_{j})\pm t_{1-\alpha /2,n-2}\left ( \sqrt{MSE\left [1+ \frac{1}{n}+\frac{(x_{j}-\bar{x})^{2}}{\sum (x_{j}-\bar{x})^{2}} \right]} \right )\]
The data that was origianlly extracte from https://lubbockcad.org/ was cleaned and uploaded to github. The cleaned data set was read from https://raw.githubusercontent.com/jkupke9/IE5334/refs/heads/main/cleaned%20CAD%20data.csv
cad_df <- read.csv('https://raw.githubusercontent.com/jkupke9/IE5334/refs/heads/main/cleaned%20CAD%20data.csv')
cad_df[,2:11] <- lapply(cad_df[,2:11], as.numeric)
colnames(cad_df) <- c("Address","Market.Value", "Total.Imp.Val", "Total.Land.Val", "Homestead", "Total.Main.Area",
"Main.Area", "Main.Area.Val", "Garage.Area", "Garage.Value", "Land.Area")
cad_df <- cad_df[-c(12,42),]#the record for the property of interest and an extra row of NA values are removed
A histogram and box plot of the 2025 Market Value data set reveals a right skewed distribution. This means the data is not normally distributed. There are also several points that might be outliers based on where they are positioned relative to the rest of the distribution.
hist(cad_df$Market.Value, main="Frequency of 2025 Market Value", xlab="2025 Market Value ($)")
boxplot(cad_df$Market.Value, main="Boxplot of 2025 Market Value",
ylab="2025 Market Value ($)")
All the value parameters were removed as they are outputs of the assessment process and not suitable as predictors for our modeling purposes.A Correlation plot of all the remaining variables shows a very high correlation(nearly 1 to 1) between Total Main Area and Main Area. This is because for most of the records, these values are the same as the house area and heated area are the same. There was also a moderately high correlation between both Main Area parameters and Land Area. This is reasonable as larger homes normally have larger plots of land. There is also a moderate amount of correlation between both Main Area parameters and Garage Area. This is also reasonable as larger homes tend to have larger garages.
num_cad_df <- cad_df[,-c(1, 3, 4, 5, 8, 10)]
cor(num_cad_df[,-1])
## Total.Main.Area Main.Area Garage.Area Land.Area
## Total.Main.Area 1.0000000 0.9868392 0.5643433 0.8448745
## Main.Area 0.9868392 1.0000000 0.5310268 0.8366394
## Garage.Area 0.5643433 0.5310268 1.0000000 0.5319719
## Land.Area 0.8448745 0.8366394 0.5319719 1.0000000
#install.packages("corrplot")
library(corrplot)
#install.packages("car")
library(car)
corrplot(cor(num_cad_df[,-1]),method="number", type="upper", main="Correlation of Apprasal Data Parameters", mar=c(0,0,2,0))
Multiple regression models were evaluated to determine the best relationship to use to predict the response, 2025 Market Value, with a variety of square footage parameters as predictors.
The first model simply regressed 2025 Market Value on all square footage parameters using multiple linear regression. This can be represented by the formula $2025.Market.Value = *{0} +* (Total.Main.Area)+ *{2}(Main.Area) +* (Garage.Area) + _{4}(Land.Area) +$
model <- lm(Market.Value~.,data=num_cad_df)
summary(model)
##
## Call:
## lm(formula = Market.Value ~ ., data = num_cad_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -109329 -8735 8804 19978 78042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.923e+05 4.368e+04 -4.403 9.59e-05 ***
## Total.Main.Area -6.464e+00 9.384e+01 -0.069 0.9455
## Main.Area 1.754e+02 9.273e+01 1.891 0.0669 .
## Garage.Area 3.566e+01 3.771e+01 0.946 0.3507
## Land.Area 2.973e+01 6.190e+00 4.803 2.91e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41440 on 35 degrees of freedom
## Multiple R-squared: 0.9223, Adjusted R-squared: 0.9135
## F-statistic: 103.9 on 4 and 35 DF, p-value: < 2.2e-16
vif(model)#total main area and main area are highly correlated
## Total.Main.Area Main.Area Garage.Area Land.Area
## 43.146631 39.877350 1.552542 3.562078
The simple multiple regression model with all parameters does a decent job predicting the response variable with an adjusted R squared value of 0.9135 and an overall p value of 2.2e-16. However most of the parameters are not significant and the coefficent for Total.Main.Area was negative which would suggest that as the size of the home increases the value would go down. This is most like due to the multicollinearity between the different parameters.The Variance Inflation Factors for Total.Main.Area and Main.Area were very high at 43.15 and 39.88 respectively.
The second model regressed 2025.Market Value on Total.Main.Area, Garage.Area, and Land.Area. Main area was removed as it was highly correlated with Total Main Area and did not add much additional information.
model2 <- lm(Market.Value~Total.Main.Area+Garage.Area+Land.Area, data=num_cad_df)
summary(model2)
##
## Call:
## lm(formula = Market.Value ~ Total.Main.Area + Garage.Area + Land.Area,
## data = num_cad_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105504 -5155 8256 20245 90251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.750e+05 4.421e+04 -3.959 0.00034 ***
## Total.Main.Area 1.631e+02 2.858e+01 5.709 1.70e-06 ***
## Garage.Area 2.142e+01 3.824e+01 0.560 0.57892
## Land.Area 3.042e+01 6.397e+00 4.756 3.16e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42900 on 36 degrees of freedom
## Multiple R-squared: 0.9144, Adjusted R-squared: 0.9073
## F-statistic: 128.2 on 3 and 36 DF, p-value: < 2.2e-16
vif(model2)#all below 5 now
## Total.Main.Area Garage.Area Land.Area
## 3.734461 1.490579 3.549618
This model has similar performance with respect to adjusted R squared with a value of 0.9073 and an overall p value of 2.2e-16. While the R squared value dropped slightly it is still relatively high. Also, Total.Main.Area is not significant and all the coefficients are positive which is expected as you would expect property value to rise for larger structures and land. Also, the variance inflation factors are now all below 5.
A third model was evaluated which included an interaction term between Total.Main.Area and Land.Area.
model3 <-lm(Market.Value~Total.Main.Area+Garage.Area+Land.Area+Total.Main.Area:Land.Area, data=num_cad_df)
summary(model3)
##
## Call:
## lm(formula = Market.Value ~ Total.Main.Area + Garage.Area + Land.Area +
## Total.Main.Area:Land.Area, data = num_cad_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -89202 -6565 6431 13400 63234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.572e+05 1.430e+05 3.895 0.000423 ***
## Total.Main.Area -3.611e+01 4.360e+01 -0.828 0.413184
## Garage.Area 5.938e+01 2.986e+01 1.989 0.054583 .
## Land.Area -5.502e+01 1.694e+01 -3.249 0.002562 **
## Total.Main.Area:Land.Area 2.160e-02 4.104e-03 5.265 7.2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32500 on 35 degrees of freedom
## Multiple R-squared: 0.9522, Adjusted R-squared: 0.9468
## F-statistic: 174.4 on 4 and 35 DF, p-value: < 2.2e-16
vif(model3)
## Total.Main.Area Garage.Area Land.Area
## 15.145897 1.582902 43.354912
## Total.Main.Area:Land.Area
## 85.124239
This third model did show an improvement in adjusted R squared, with a value of 0.9468 and a p value of 2.2e-16. However multiple coefficients are negative and Total.Main.Area is no longer significant. So, while the model does now account for a greater percentage of the variation from the mean, the model is difficult to logically interpret and the variance inflation factors are very high, with the interaction parameter having a VIF of 85.12. Multiple higher order regression models were evaluated (models 3b and 3c in the appendix at the end) but they suffered from similar issues with negative coefficients and high VIF values.
An alternative model was evaluated with only Total.Main.Area and Land Area. While this did result in much less collinearity between terms, the model had higher residuals as determined by a higher adjusted R squared value. In an attempt to improve the adequacy, an additional term was added to the model for Land.Area squared. The formula for this model is as follows: \[2025.Market.Value = \beta _{0} + \beta _{1}(Total.Main.Area)+ \beta _{2}(Land.Area) + \beta _{3}{(Land.Area)^2} +\varepsilon \]
model5b<-lm(Market.Value~Total.Main.Area+poly(Land.Area,2), data=num_cad_df)
summary(model5b)
##
## Call:
## lm(formula = Market.Value ~ Total.Main.Area + poly(Land.Area,
## 2), data = num_cad_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -102134 -4787 2973 23766 62292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103210.10 65852.61 1.567 0.125796
## Total.Main.Area 163.34 23.02 7.094 2.47e-08 ***
## poly(Land.Area, 2)1 399500.60 66773.39 5.983 7.32e-07 ***
## poly(Land.Area, 2)2 144833.26 35730.19 4.054 0.000258 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35700 on 36 degrees of freedom
## Multiple R-squared: 0.9407, Adjusted R-squared: 0.9358
## F-statistic: 190.4 on 3 and 36 DF, p-value: < 2.2e-16
vif(model5b)
## GVIF Df GVIF^(1/(2*Df))
## Total.Main.Area 3.50024 1 1.870893
## poly(Land.Area, 2) 3.50024 2 1.367806
This model accounted for a large percentage of the variation from the mean with an adjusted R squared of 0.9358 and was very significant with an overal p-value of 2.2e-16. Also, the variance inflation factors all remained below 5.
Residuals were analyzed to determine outliers. This was combined with a review of cook distances to determine which points should be removed as influencial points with negative impacts on the model.
The second model, which was determined to be one of the most promising based on it’s adequacy and logical coefficients, was evaluated for outliers.
plot(model2)#point 13 has leverage
sort(model2$residuals)#point 13 also has a high residual so it's an outlier
## 15 3 31 9 2 26
## -105503.650 -99877.125 -97732.368 -78918.137 -75179.721 -41588.616
## 38 21 11 34 14 7
## -23512.849 -21976.458 -11127.600 -6513.752 -4701.892 -1427.624
## 35 40 24 39 36 33
## 232.648 2003.270 3808.280 5843.948 5900.198 6544.670
## 27 10 16 29 32 20
## 6608.127 7471.683 9040.981 10287.344 11022.728 11881.128
## 18 28 1 19 41 5
## 14537.228 16975.096 18254.573 19057.822 19641.753 19983.898
## 37 4 25 22 30 8
## 21026.586 22033.110 25346.786 27441.622 31173.575 33721.472
## 17 23 6 13
## 37772.656 42210.237 47987.449 90250.925
Based on an analysis of the residuals and the cook’s distance for possible outliers, record 13 was identified as an outlier that could impact the overall model. This record was removed and a fourth model was developed.
num_cad_df_trimmed <- num_cad_df[-12,]
model4 <- lm(Market.Value~Total.Main.Area+Garage.Area+Land.Area, data=num_cad_df_trimmed)
summary(model4)
##
## Call:
## lm(formula = Market.Value ~ Total.Main.Area + Garage.Area + Land.Area,
## data = num_cad_df_trimmed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -97118 -2286 7218 17595 80543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -85600.758 46231.780 -1.852 0.07254 .
## Total.Main.Area 157.677 24.978 6.313 3e-07 ***
## Garage.Area 41.719 33.861 1.232 0.22613
## Land.Area 19.755 6.355 3.109 0.00372 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37420 on 35 degrees of freedom
## Multiple R-squared: 0.8563, Adjusted R-squared: 0.844
## F-statistic: 69.54 on 3 and 35 DF, p-value: 8.03e-15
The fourth model resulted in a slightly lower adjusted R squared of 0.844 which means the model is explaining less of the variation from the mean and has higher residuals. The Garage.Area coefficient is no longer significant.
The higher order model also had a high adjusted R squared and logical coefficients. It was also evaluated for outliers.
plot(model5b)
sort(model5b$residuals)
## 31 9 15 3 26 38
## -102134.4339 -68044.6482 -66408.5751 -54081.8216 -48900.9927 -42114.9605
## 2 21 14 13 35 20
## -33161.3878 -31705.0494 -6499.1081 -6303.6872 -4281.7081 -3751.4809
## 34 24 10 16 36 39
## -3688.1325 -3346.5664 -3177.0378 -1484.2056 -1266.4658 818.9205
## 33 27 18 32 29 40
## 1007.1887 2384.4584 3561.3428 3854.1812 5754.1397 5930.0445
## 41 19 28 4 25 8
## 6371.6340 10041.8821 10304.1240 12112.9493 14647.1369 23615.9205
## 22 37 17 11 30 23
## 24214.6402 25067.9389 28464.4907 28742.4484 32286.3232 41113.2801
## 7 5 6 1
## 45362.7343 46062.8106 46339.9009 62291.7718
While point 13 was not determined to be an outlier based on residuals, it was a point with leverage that was influencial based on it’s cook distance. A new model was evaluated with this point removed.
model5c<-lm(Market.Value~Total.Main.Area+poly(Land.Area,2), data=num_cad_df_trimmed)
summary(model5c)
##
## Call:
## lm(formula = Market.Value ~ Total.Main.Area + poly(Land.Area,
## 2), data = num_cad_df_trimmed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -101487 -4899 2704 23012 69166
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96116.19 65427.15 1.469 0.150749
## Total.Main.Area 162.51 23.23 6.995 3.88e-08 ***
## poly(Land.Area, 2)1 195098.20 53620.09 3.639 0.000876 ***
## poly(Land.Area, 2)2 76596.54 36047.72 2.125 0.040734 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35970 on 35 degrees of freedom
## Multiple R-squared: 0.8672, Adjusted R-squared: 0.8559
## F-statistic: 76.21 on 3 and 35 DF, p-value: 2.031e-15
vif(model5c)
## GVIF Df GVIF^(1/(2*Df))
## Total.Main.Area 2.225859 1 1.491931
## poly(Land.Area, 2) 2.225859 2 1.221446
plot(model5c)
This additional model had a lower adjusted R squared. While this new model did still have significant coefficients, the Total.Main.Area now contributes very little to the response. It is expected that the overall value of the property would not be so dependent upon the area of the land with little contribution from the improvements on the property.
In conclusion the higher order model with no points of influence removed had the overall best performance with regards to adequacy and multicollinearity. This model, \(2025.Market.Value = \beta _{0} + \beta _{1}(Total.Main.Area)+ \beta _{2}(Land.Area) + \beta _{3}{(Land.Area)^2} +\varepsilon\) , had an adjusted R squared value of 0.9358, meaning over 93 percent of the variation from the mean of the response variable is explained by the model.
Using this model we can predict with an alpha of 0.05 the 2025 Market Value for the property at 6321 88th St Lubbock Texas. The table below displays the prediction interval for the expected value of the property in question. This is the interval that we are 95% confident that an observation with the input parameters(square footage values) of 6321 88th St would fall between.
| Parameter | Value |
|---|---|
| Expected Value | 534639 |
| Prediction Interval | 460426-608851 |
# predict value of 6321 88th St using model_5b
new_df <-data.frame("Total.Main.Area"=2773, "Main.Area"=2773, "Garage.Area"=592, "Land.Area"=7546)
pred_value <-predict(model5b, newdata=new_df, interval="prediction")
This prediction interval shows that the expected value of 534639 is very close to the assessed value of 538409 for 6321 88th St. While the assessed value is higher than the expected value, it is well within the prediction interval. This means that based on the values of the nearby properties on 88th street, we can say with a 95% confidence level that the value of 6321 88th should fall between 460426 and 608851 and is accurately assessed at it’s current value.
cad_df <- read.csv('https://raw.githubusercontent.com/jkupke9/IE5334/refs/heads/main/cleaned%20CAD%20data.csv')
cad_df[,2:11] <- lapply(cad_df[,2:11], as.numeric)
colnames(cad_df) <- c("Address","Market.Value", "Total.Imp.Val", "Total.Land.Val", "Homestead", "Total.Main.Area",
"Main.Area", "Main.Area.Val", "Garage.Area", "Garage.Value", "Land.Area")
cad_df <- cad_df[-c(12,42),]
hist(cad_df$Market.Value, main="Frequency of 2025 Market Value", xlab="2025 Market Value ($)")
boxplot(cad_df$Market.Value, main="Boxplot of 2025 Market Value",
ylab="2025 Market Value ($)")
num_cad_df <- cad_df[,-c(1, 3, 4, 5, 8, 10)]
cor(num_cad_df[,-1])
#install.packages("corrplot")
library(corrplot)
#install.packages("car")
library(car)
corrplot(cor(num_cad_df[,-1]),method="number", type="upper", main="Correlation of Apprasal Data Parameters", mar=c(0,0,2,0))
model <- lm(Market.Value~.,data=num_cad_df)
summary(model)
vif(model)#total main area and main area are highly correlated
#main area
model2 <- lm(Market.Value~Total.Main.Area+Garage.Area+Land.Area, data=num_cad_df)
summary(model2)
vif(model2)#all below 5 now
plot(model2)#point 13 has leverage
sort(model2$residuals)#point 13 also has a high residual so it's an outlier
num_cad_df_trimmed <- num_cad_df[-12,]
model4 <- lm(Market.Value~Total.Main.Area+Garage.Area+Land.Area, data=num_cad_df_trimmed)
summary(model4)
vif(model4)
model3 <-lm(Market.Value~Total.Main.Area+Garage.Area+Land.Area+Total.Main.Area:Land.Area, data=num_cad_df)
summary(model3)
vif(model3)
model5<-lm(Market.Value~Total.Main.Area+Land.Area, data=num_cad_df)
summary(model5)
vif(model5)
plot(model5)
model5b<-lm(Market.Value~Total.Main.Area+poly(Land.Area,2), data=num_cad_df)
summary(model5b)
vif(model5b)
plot(model5b)
sort(model5b$residuals)
model5c<-lm(Market.Value~Total.Main.Area+poly(Land.Area,2), data=num_cad_df_trimmed)
summary(model5c)
vif(model5c)
plot(model5c)
# predict value of 6321 88th St using model_5b
new_df <-data.frame("Total.Main.Area"=2773, "Main.Area"=2773, "Garage.Area"=592, "Land.Area"=7546)
pred_value <-predict(model5b, newdata=new_df, interval="prediction")
#additional alternates
model6 <- lm(Market.Value~Total.Main.Area+Land.Area, data=num_cad_df_trimmed)
summary(model6)
plot(model6)
num_cad_df_trimmed2 <- num_cad_df_trimmed[-5,]
model7 <- lm(Market.Value~Total.Main.Area+Land.Area, data=num_cad_df_trimmed2)
summary(model7)
vif(model7)
plot(model7)