1 Background

Property taxes are a form of municipal revenue. Property taxes are based upon an assessed property value. Property values can be dependent upon the individual tax assessor who ascertained the value. Because of this, homeowners have the ability to appeal their assessment.

1.1 Objective

To use linear regression to predict the assessed value of the property at 6321 88th St Lubbock Tx using publicly available parameters of nearby properties.

1.2 Data Collection and Cleaning

Data was collected from lubbockcad.org for the properties 6309 - 6351 88th Street, Lubbock, Texas. The variables collected are summarized in the table below:

Variables	Explanation
2025 Market Value	Total value of property appraisal (house+land)
Total Improvement Market Value	Total value of house appraisal
Total Land Market Value	Total value of land appraisal
Homestead Cap Loss	Represents a discount only in the current tax year if the appraised value from the previous year went up by more than 10%
Total Main Area (Sq. Ft.)	Total square footage of house
Main Area (Sq. Ft.)	Total square footage of heated house area
Main Area (Value)	Total value of heated house area
Garage (Sq. Ft.)	Total square footage of non-heated house area
Garage (Value)	Total value of non-heated house area
Land (Sq. Ft.)	Total square footage of land

The data was cleaned by removing entries which did not have complete data. Also entries which had multiple garage values were combined into a single garage value.

2 Methods

2.1 Exploratory Data Analysis

Histograms and Box Plots were used to examine the distribution of assessed property value across the sample.

Pairwise correlations were examined between all possible predictor variables using a correlation matrix.

2.2 Regression Modeling

Simple linear regression modeling using least squares estimation was used to regress 2025 Market Value on the other variables list in the Data Collection Section.

\[ y = \beta _{0} + \beta _{1}x_{1}+ \beta _{2}x_{2} + \beta _{3}x_{3} + \beta _{4}x_{4}+\varepsilon \] Model adequacy was evaluated by analyzing the variance with respect to the independent variable and the normality of the residuals. The models were also evaluated by the proportion of variance explained by the independent variable through the use of R-squared.

\[ R^{2}=1-\frac{SSR}{SST} \]

Multicollinearity was also evaluated by analyzing the variance inflation factor for the terms in the model.

\[ VIF_{i}=\frac{1}{1-R_{i}^{2}} \]

2.3 Outliers and Leverage

The model which was evaluated as the strongest model based on adequacy and multicollinearity was then analyzed for outliers. This was evaluated by observing residuals. Also Cook’s distance was calculated for each residual using the following formula:

\[ D_{i}=\frac{\left ( \beta _{(i)}-\beta \right )^{T}X^{T}X\left (\hat{\beta_{(i)}-\hat{\beta} } \right )}{p*MSE} \]

2.4 Prediction

A 95% confidence interval on the mean was calculated with the following formula.

\[(\hat{\beta}_{0}+\hat{\beta}_{1}x_{j})\pm t_{1-\alpha /2,n-2}\left ( \sqrt{MSE\left [ \frac{1}{n}+\frac{(x_{j}-\bar{x})^{2}}{\sum (x_{j}-\bar{x})^{2}} \right]} \right )\]

A prediction interval was calculated to account for variation due to point spread form the mean by the following formula.

\[(\hat{\beta}_{0}+\hat{\beta}_{1}x_{j})\pm t_{1-\alpha /2,n-2}\left ( \sqrt{MSE\left [1+ \frac{1}{n}+\frac{(x_{j}-\bar{x})^{2}}{\sum (x_{j}-\bar{x})^{2}} \right]} \right )\]

3 Exploratory Data Analysis

The data that was origianlly extracte from https://lubbockcad.org/ was cleaned and uploaded to github. The cleaned data set was read from https://raw.githubusercontent.com/jkupke9/IE5334/refs/heads/main/cleaned%20CAD%20data.csv

cad_df <- read.csv('https://raw.githubusercontent.com/jkupke9/IE5334/refs/heads/main/cleaned%20CAD%20data.csv')
cad_df[,2:11] <- lapply(cad_df[,2:11], as.numeric)
colnames(cad_df) <- c("Address","Market.Value", "Total.Imp.Val", "Total.Land.Val", "Homestead", "Total.Main.Area", 
                      "Main.Area", "Main.Area.Val", "Garage.Area", "Garage.Value", "Land.Area")
cad_df <- cad_df[-c(12,42),]#the record for the property of interest and an extra row of NA values are removed

A histogram and box plot of the 2025 Market Value data set reveals a right skewed distribution. This means the data is not normally distributed. There are also several points that might be outliers based on where they are positioned relative to the rest of the distribution.

hist(cad_df$Market.Value, main="Frequency of 2025 Market Value", xlab="2025 Market Value ($)")

boxplot(cad_df$Market.Value, main="Boxplot of 2025 Market Value",
        ylab="2025 Market Value ($)")

All the value parameters were removed as they are outputs of the assessment process and not suitable as predictors for our modeling purposes.A Correlation plot of all the remaining variables shows a very high correlation(nearly 1 to 1) between Total Main Area and Main Area. This is because for most of the records, these values are the same as the house area and heated area are the same. There was also a moderately high correlation between both Main Area parameters and Land Area. This is reasonable as larger homes normally have larger plots of land. There is also a moderate amount of correlation between both Main Area parameters and Garage Area. This is also reasonable as larger homes tend to have larger garages.

num_cad_df <- cad_df[,-c(1, 3, 4, 5, 8, 10)]
cor(num_cad_df[,-1])

##                 Total.Main.Area Main.Area Garage.Area Land.Area
## Total.Main.Area       1.0000000 0.9868392   0.5643433 0.8448745
## Main.Area             0.9868392 1.0000000   0.5310268 0.8366394
## Garage.Area           0.5643433 0.5310268   1.0000000 0.5319719
## Land.Area             0.8448745 0.8366394   0.5319719 1.0000000

#install.packages("corrplot")
library(corrplot)
#install.packages("car")
library(car)
corrplot(cor(num_cad_df[,-1]),method="number", type="upper", main="Correlation of Apprasal Data Parameters", mar=c(0,0,2,0))

4 Regression

Multiple regression models were evaluated to determine the best relationship to use to predict the response, 2025 Market Value, with a variety of square footage parameters as predictors.

4.1 Multiple Regression Linear Model using all parameters

The first model simply regressed 2025 Market Value on all square footage parameters using multiple linear regression. This can be represented by the formula $2025.Market.Value = *{0} +* (Total.Main.Area)+ *{2}(Main.Area) +* (Garage.Area) + _{4}(Land.Area) +$

model <- lm(Market.Value~.,data=num_cad_df)
summary(model)

## 
## Call:
## lm(formula = Market.Value ~ ., data = num_cad_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -109329   -8735    8804   19978   78042 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.923e+05  4.368e+04  -4.403 9.59e-05 ***
## Total.Main.Area -6.464e+00  9.384e+01  -0.069   0.9455    
## Main.Area        1.754e+02  9.273e+01   1.891   0.0669 .  
## Garage.Area      3.566e+01  3.771e+01   0.946   0.3507    
## Land.Area        2.973e+01  6.190e+00   4.803 2.91e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41440 on 35 degrees of freedom
## Multiple R-squared:  0.9223, Adjusted R-squared:  0.9135 
## F-statistic: 103.9 on 4 and 35 DF,  p-value: < 2.2e-16

vif(model)#total main area and main area are highly correlated

## Total.Main.Area       Main.Area     Garage.Area       Land.Area 
##       43.146631       39.877350        1.552542        3.562078

The simple multiple regression model with all parameters does a decent job predicting the response variable with an adjusted R squared value of 0.9135 and an overall p value of 2.2e-16. However most of the parameters are not significant and the coefficent for Total.Main.Area was negative which would suggest that as the size of the home increases the value would go down. This is most like due to the multicollinearity between the different parameters.The Variance Inflation Factors for Total.Main.Area and Main.Area were very high at 43.15 and 39.88 respectively.

4.2 Mulitiple Regression Model with Main.Area removed

The second model regressed 2025.Market Value on Total.Main.Area, Garage.Area, and Land.Area. Main area was removed as it was highly correlated with Total Main Area and did not add much additional information.

model2 <- lm(Market.Value~Total.Main.Area+Garage.Area+Land.Area, data=num_cad_df)
summary(model2)

## 
## Call:
## lm(formula = Market.Value ~ Total.Main.Area + Garage.Area + Land.Area, 
##     data = num_cad_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -105504   -5155    8256   20245   90251 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.750e+05  4.421e+04  -3.959  0.00034 ***
## Total.Main.Area  1.631e+02  2.858e+01   5.709 1.70e-06 ***
## Garage.Area      2.142e+01  3.824e+01   0.560  0.57892    
## Land.Area        3.042e+01  6.397e+00   4.756 3.16e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42900 on 36 degrees of freedom
## Multiple R-squared:  0.9144, Adjusted R-squared:  0.9073 
## F-statistic: 128.2 on 3 and 36 DF,  p-value: < 2.2e-16

vif(model2)#all below 5 now

## Total.Main.Area     Garage.Area       Land.Area 
##        3.734461        1.490579        3.549618

This model has similar performance with respect to adjusted R squared with a value of 0.9073 and an overall p value of 2.2e-16. While the R squared value dropped slightly it is still relatively high. Also, Total.Main.Area is not significant and all the coefficients are positive which is expected as you would expect property value to rise for larger structures and land. Also, the variance inflation factors are now all below 5.

4.3 Model using Interaction between Total Main Area and Land Area

A third model was evaluated which included an interaction term between Total.Main.Area and Land.Area.

model3 <-lm(Market.Value~Total.Main.Area+Garage.Area+Land.Area+Total.Main.Area:Land.Area, data=num_cad_df)
summary(model3)

## 
## Call:
## lm(formula = Market.Value ~ Total.Main.Area + Garage.Area + Land.Area + 
##     Total.Main.Area:Land.Area, data = num_cad_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -89202  -6565   6431  13400  63234 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                5.572e+05  1.430e+05   3.895 0.000423 ***
## Total.Main.Area           -3.611e+01  4.360e+01  -0.828 0.413184    
## Garage.Area                5.938e+01  2.986e+01   1.989 0.054583 .  
## Land.Area                 -5.502e+01  1.694e+01  -3.249 0.002562 ** 
## Total.Main.Area:Land.Area  2.160e-02  4.104e-03   5.265  7.2e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32500 on 35 degrees of freedom
## Multiple R-squared:  0.9522, Adjusted R-squared:  0.9468 
## F-statistic: 174.4 on 4 and 35 DF,  p-value: < 2.2e-16

vif(model3)

##           Total.Main.Area               Garage.Area                 Land.Area 
##                 15.145897                  1.582902                 43.354912 
## Total.Main.Area:Land.Area 
##                 85.124239

This third model did show an improvement in adjusted R squared, with a value of 0.9468 and a p value of 2.2e-16. However multiple coefficients are negative and Total.Main.Area is no longer significant. So, while the model does now account for a greater percentage of the variation from the mean, the model is difficult to logically interpret and the variance inflation factors are very high, with the interaction parameter having a VIF of 85.12. Multiple higher order regression models were evaluated (models 3b and 3c in the appendix at the end) but they suffered from similar issues with negative coefficients and high VIF values.

4.4 Multiple Regression with higher order terms

An alternative model was evaluated with only Total.Main.Area and Land Area. While this did result in much less collinearity between terms, the model had higher residuals as determined by a higher adjusted R squared value. In an attempt to improve the adequacy, an additional term was added to the model for Land.Area squared. The formula for this model is as follows: \[2025.Market.Value = \beta _{0} + \beta _{1}(Total.Main.Area)+ \beta _{2}(Land.Area) + \beta _{3}{(Land.Area)^2} +\varepsilon \]

model5b<-lm(Market.Value~Total.Main.Area+poly(Land.Area,2), data=num_cad_df)
summary(model5b)

## 
## Call:
## lm(formula = Market.Value ~ Total.Main.Area + poly(Land.Area, 
##     2), data = num_cad_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -102134   -4787    2973   23766   62292 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         103210.10   65852.61   1.567 0.125796    
## Total.Main.Area        163.34      23.02   7.094 2.47e-08 ***
## poly(Land.Area, 2)1 399500.60   66773.39   5.983 7.32e-07 ***
## poly(Land.Area, 2)2 144833.26   35730.19   4.054 0.000258 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35700 on 36 degrees of freedom
## Multiple R-squared:  0.9407, Adjusted R-squared:  0.9358 
## F-statistic: 190.4 on 3 and 36 DF,  p-value: < 2.2e-16

vif(model5b)

##                       GVIF Df GVIF^(1/(2*Df))
## Total.Main.Area    3.50024  1        1.870893
## poly(Land.Area, 2) 3.50024  2        1.367806

This model accounted for a large percentage of the variation from the mean with an adjusted R squared of 0.9358 and was very significant with an overal p-value of 2.2e-16. Also, the variance inflation factors all remained below 5.

5 Outliers

Residuals were analyzed to determine outliers. This was combined with a review of cook distances to determine which points should be removed as influencial points with negative impacts on the model.

5.1 Outliers removed from Model 2

The second model, which was determined to be one of the most promising based on it’s adequacy and logical coefficients, was evaluated for outliers.

plot(model2)#point 13 has leverage

sort(model2$residuals)#point 13 also has a high residual so it's an outlier

##          15           3          31           9           2          26 
## -105503.650  -99877.125  -97732.368  -78918.137  -75179.721  -41588.616 
##          38          21          11          34          14           7 
##  -23512.849  -21976.458  -11127.600   -6513.752   -4701.892   -1427.624 
##          35          40          24          39          36          33 
##     232.648    2003.270    3808.280    5843.948    5900.198    6544.670 
##          27          10          16          29          32          20 
##    6608.127    7471.683    9040.981   10287.344   11022.728   11881.128 
##          18          28           1          19          41           5 
##   14537.228   16975.096   18254.573   19057.822   19641.753   19983.898 
##          37           4          25          22          30           8 
##   21026.586   22033.110   25346.786   27441.622   31173.575   33721.472 
##          17          23           6          13 
##   37772.656   42210.237   47987.449   90250.925

Based on an analysis of the residuals and the cook’s distance for possible outliers, record 13 was identified as an outlier that could impact the overall model. This record was removed and a fourth model was developed.

num_cad_df_trimmed <- num_cad_df[-12,]
model4 <- lm(Market.Value~Total.Main.Area+Garage.Area+Land.Area, data=num_cad_df_trimmed)
summary(model4)

## 
## Call:
## lm(formula = Market.Value ~ Total.Main.Area + Garage.Area + Land.Area, 
##     data = num_cad_df_trimmed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -97118  -2286   7218  17595  80543 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -85600.758  46231.780  -1.852  0.07254 .  
## Total.Main.Area    157.677     24.978   6.313    3e-07 ***
## Garage.Area         41.719     33.861   1.232  0.22613    
## Land.Area           19.755      6.355   3.109  0.00372 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37420 on 35 degrees of freedom
## Multiple R-squared:  0.8563, Adjusted R-squared:  0.844 
## F-statistic: 69.54 on 3 and 35 DF,  p-value: 8.03e-15

The fourth model resulted in a slightly lower adjusted R squared of 0.844 which means the model is explaining less of the variation from the mean and has higher residuals. The Garage.Area coefficient is no longer significant.

5.2 Outliers removed from Higher Order Model

The higher order model also had a high adjusted R squared and logical coefficients. It was also evaluated for outliers.

plot(model5b)

sort(model5b$residuals)

##           31            9           15            3           26           38 
## -102134.4339  -68044.6482  -66408.5751  -54081.8216  -48900.9927  -42114.9605 
##            2           21           14           13           35           20 
##  -33161.3878  -31705.0494   -6499.1081   -6303.6872   -4281.7081   -3751.4809 
##           34           24           10           16           36           39 
##   -3688.1325   -3346.5664   -3177.0378   -1484.2056   -1266.4658     818.9205 
##           33           27           18           32           29           40 
##    1007.1887    2384.4584    3561.3428    3854.1812    5754.1397    5930.0445 
##           41           19           28            4           25            8 
##    6371.6340   10041.8821   10304.1240   12112.9493   14647.1369   23615.9205 
##           22           37           17           11           30           23 
##   24214.6402   25067.9389   28464.4907   28742.4484   32286.3232   41113.2801 
##            7            5            6            1 
##   45362.7343   46062.8106   46339.9009   62291.7718

While point 13 was not determined to be an outlier based on residuals, it was a point with leverage that was influencial based on it’s cook distance. A new model was evaluated with this point removed.

model5c<-lm(Market.Value~Total.Main.Area+poly(Land.Area,2), data=num_cad_df_trimmed)
summary(model5c)

## 
## Call:
## lm(formula = Market.Value ~ Total.Main.Area + poly(Land.Area, 
##     2), data = num_cad_df_trimmed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -101487   -4899    2704   23012   69166 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          96116.19   65427.15   1.469 0.150749    
## Total.Main.Area        162.51      23.23   6.995 3.88e-08 ***
## poly(Land.Area, 2)1 195098.20   53620.09   3.639 0.000876 ***
## poly(Land.Area, 2)2  76596.54   36047.72   2.125 0.040734 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35970 on 35 degrees of freedom
## Multiple R-squared:  0.8672, Adjusted R-squared:  0.8559 
## F-statistic: 76.21 on 3 and 35 DF,  p-value: 2.031e-15

vif(model5c)

##                        GVIF Df GVIF^(1/(2*Df))
## Total.Main.Area    2.225859  1        1.491931
## poly(Land.Area, 2) 2.225859  2        1.221446

plot(model5c)

This additional model had a lower adjusted R squared. While this new model did still have significant coefficients, the Total.Main.Area now contributes very little to the response. It is expected that the overall value of the property would not be so dependent upon the area of the land with little contribution from the improvements on the property.

6 Results and Conclusion

In conclusion the higher order model with no points of influence removed had the overall best performance with regards to adequacy and multicollinearity. This model, $2025.Market.Value = \beta _{0} + \beta _{1}(Total.Main.Area)+ \beta _{2}(Land.Area) + \beta _{3}{(Land.Area)^2} +\varepsilon$ , had an adjusted R squared value of 0.9358, meaning over 93 percent of the variation from the mean of the response variable is explained by the model.

Using this model we can predict with an alpha of 0.05 the 2025 Market Value for the property at 6321 88th St Lubbock Texas. The table below displays the prediction interval for the expected value of the property in question. This is the interval that we are 95% confident that an observation with the input parameters(square footage values) of 6321 88th St would fall between.

Parameter	Value
Expected Value	534639
Prediction Interval	460426-608851

# predict value of 6321 88th St using model_5b
new_df <-data.frame("Total.Main.Area"=2773, "Main.Area"=2773, "Garage.Area"=592, "Land.Area"=7546)
pred_value <-predict(model5b, newdata=new_df, interval="prediction")

This prediction interval shows that the expected value of 534639 is very close to the assessed value of 538409 for 6321 88th St. While the assessed value is higher than the expected value, it is well within the prediction interval. This means that based on the values of the nearby properties on 88th street, we can say with a 95% confidence level that the value of 6321 88th should fall between 460426 and 608851 and is accurately assessed at it’s current value.

7 Complete R Code

cad_df <- read.csv('https://raw.githubusercontent.com/jkupke9/IE5334/refs/heads/main/cleaned%20CAD%20data.csv')
cad_df[,2:11] <- lapply(cad_df[,2:11], as.numeric)
colnames(cad_df) <- c("Address","Market.Value", "Total.Imp.Val", "Total.Land.Val", "Homestead", "Total.Main.Area", 
                      "Main.Area", "Main.Area.Val", "Garage.Area", "Garage.Value", "Land.Area")
cad_df <- cad_df[-c(12,42),]
hist(cad_df$Market.Value, main="Frequency of 2025 Market Value", xlab="2025 Market Value ($)")
boxplot(cad_df$Market.Value, main="Boxplot of 2025 Market Value",
        ylab="2025 Market Value ($)")

num_cad_df <- cad_df[,-c(1, 3, 4, 5, 8, 10)]
cor(num_cad_df[,-1])
#install.packages("corrplot")
library(corrplot)
#install.packages("car")
library(car)
corrplot(cor(num_cad_df[,-1]),method="number", type="upper", main="Correlation of Apprasal Data Parameters", mar=c(0,0,2,0))
model <- lm(Market.Value~.,data=num_cad_df)
summary(model)
vif(model)#total main area and main area are highly correlated
#main area
model2 <- lm(Market.Value~Total.Main.Area+Garage.Area+Land.Area, data=num_cad_df)
summary(model2)
vif(model2)#all below 5 now
plot(model2)#point 13 has leverage
sort(model2$residuals)#point 13 also has a high residual so it's an outlier
num_cad_df_trimmed <- num_cad_df[-12,]
model4 <- lm(Market.Value~Total.Main.Area+Garage.Area+Land.Area, data=num_cad_df_trimmed)
summary(model4)
vif(model4)
model3 <-lm(Market.Value~Total.Main.Area+Garage.Area+Land.Area+Total.Main.Area:Land.Area, data=num_cad_df)
summary(model3)
vif(model3)
model5<-lm(Market.Value~Total.Main.Area+Land.Area, data=num_cad_df)
summary(model5)
vif(model5)
plot(model5)
model5b<-lm(Market.Value~Total.Main.Area+poly(Land.Area,2), data=num_cad_df)
summary(model5b)
vif(model5b)
plot(model5b)
sort(model5b$residuals)
model5c<-lm(Market.Value~Total.Main.Area+poly(Land.Area,2), data=num_cad_df_trimmed)
summary(model5c)
vif(model5c)
plot(model5c)



# predict value of 6321 88th St using model_5b
new_df <-data.frame("Total.Main.Area"=2773, "Main.Area"=2773, "Garage.Area"=592, "Land.Area"=7546)
pred_value <-predict(model5b, newdata=new_df, interval="prediction")

#additional alternates
model6 <- lm(Market.Value~Total.Main.Area+Land.Area, data=num_cad_df_trimmed)
summary(model6)
plot(model6)
num_cad_df_trimmed2 <- num_cad_df_trimmed[-5,]
model7 <- lm(Market.Value~Total.Main.Area+Land.Area, data=num_cad_df_trimmed2)
summary(model7)
vif(model7)
plot(model7)

IE 5344 Project 2

Joshua Kupke

Last compiled on April 16, 2025 at 8:49 PM - CDT