The objective of this project is to perform a statisctical analysis using multiple linear regression in order to validate whether the home at 6321 88th in Lubbock, TX valued at $ 538409 is over assessed or not. This study is a useful one. Property taxes or related homestead tax exemptions are calculated using the market values of homes. If this baseline is over or under evaluated, homeowners may end end up unfairly, to them or to society, overpay or underpay property taxes. Unfortunately, this process of baselining the market value of the properties in Texas does not follow a scientific process. This project is a worthwhile academic to provide a statistical inference to evaluate the market value of a a given home in Lubbock, Texas.
I used the following Lubbock Central Appraisal District website to collect the data needed for the project:
Several data types are available. After thorough review and analysis, the following variables were selected for collection:
MV = Market Value in dollars
TMA = Total Main Area in square feet
M = Main Area in square feet
G = Size of the Garage in square feet
L = Total Size of the Land in square feet
Here’s the sample data variables of 31 neighboring homes selected from the website:
First, using R, I imported the data by reading the csv file which represent the 5 variables for the 31 homes.
#download data into R
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.3
df<-read.csv("https://raw.githubusercontent.com/Acissokh/Project-2-Property-Assessment/refs/heads/main/Project%202%20Data_v1.csv")
head(df)
## MV TMA MA G L
## 1 735026 3462 3192 1063 10000
## 2 663907 3226 3226 1078 10463
## 3 569992 3036 3036 965 10000
## 4 602427 3277 2877 909 10625
## 5 460288 2241 2241 506 7785
## 6 968766 4188 2041 985 13788
The first 6 rows of data returned a dataset of valid content for all the 5 variables under study, confirming our data pull worked correctly. I also reviewed the structure of the entire dataset to see the variable type for each column. Below is the executable R cell.
#Structure of the data
str(df)
## 'data.frame': 31 obs. of 5 variables:
## $ MV : int 735026 663907 569992 602427 460288 968766 550119 709604 580969 480134 ...
## $ TMA: int 3462 3226 3036 3277 2241 4188 2582 3147 2910 2842 ...
## $ MA : int 3192 3226 3036 2877 2241 2041 2582 3147 2910 2842 ...
## $ G : int 1063 1078 965 909 506 985 942 875 550 970 ...
## $ L : int 10000 10463 10000 10625 7785 13788 7749 11633 7749 8206 ...
This returned integer variable types. I needed to turn these types to continuous in order to stay conformant with the project’s terms of reference. The following executable R command was written.
df$MV<-as.numeric(df$MV)
df$TMA<-as.numeric(df$TMA)
df$MA<-as.numeric(df$MA)
df$G<-as.numeric(df$G)
df$L<-as.numeric(df$L)
str(df)
## 'data.frame': 31 obs. of 5 variables:
## $ MV : num 735026 663907 569992 602427 460288 ...
## $ TMA: num 3462 3226 3036 3277 2241 ...
## $ MA : num 3192 3226 3036 2877 2241 ...
## $ G : num 1063 1078 965 909 506 ...
## $ L : num 10000 10463 10000 10625 7785 ...
The dataset is now continuous
The initial multiple linear regression of the response variable Market Values (MV) of homes in the Lubbock neighborhood under study over the potential predictor variables Total Main Area (TMA), Main Area (MA), Size of the Garage (G), and Total Size of the Land (L) was modeled using the following executable R command:
#Multiple Regression Model First Order
model<-lm(MV~TMA+MA+G+L,data=df)
summary(model) #Statistics
##
## Call:
## lm(formula = MV ~ TMA + MA + G + L, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -97563 -5726 6784 20229 91775
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.175e+05 9.071e+04 -1.296 0.206501
## TMA 1.724e+02 3.400e+01 5.069 2.8e-05 ***
## MA -2.768e+01 3.103e+01 -0.892 0.380659
## G 1.772e+01 4.652e+01 0.381 0.706411
## L 2.976e+01 7.713e+00 3.858 0.000677 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45110 on 26 degrees of freedom
## Multiple R-squared: 0.926, Adjusted R-squared: 0.9146
## F-statistic: 81.36 on 4 and 26 DF, p-value: 2.592e-14
plot(model) #Model Adequacy
The ANOVA table rendered a Multiple R-squared value of 0.926 and an Adjusted one of 0.9146 suggesting an overall strong correlation between the response and the predictor variables. A p-value of 2.592e-14 also suggests that overall the model is strong. Looking at the individual values of each predictor’s p, only Total Main Area (TMA) and Total Size of the Land (L) show significance. But one does know from practical experience that the other two predictor variables, Main Area (MA) and Size of the Garage (G) showing low significance p-values, do influence the market value of homes. So we do suspect intercorrellations between these latter two predictor variables of low significance p-values with the prime two predictor variables with high p-values and therefore I plan to check for collinearity.
Before this however, I must review model adequacy to ensure that any inference derived from this model withstands statistical logic.
There are four assumptions in general which we need to verify to test the adequacy of any regression model. All of these tests involve the use of graphic displays of residuals which are defined as the differences between the observed values and the corresponding fitted values, estimated from the model. They are also known as error terms of the model and denoted as \(\varepsilon_{i}\).
Constant Variance (Residuals vs Fitted Plot)
Normal Probability Plot
Scale-Location (looking for outliers)
At least 2 points are outliers. They seem to have leverage and appear to be points of influence.
Residuals vs Leverage (looking for Points of Influence)
No Cook’s distance \(D_{i} > 1\)
So we can conclude that’ there’s no major points of influence.
Checking for independence between the predictors. We need to run the following R command in order to obtain the correlation matrix.
# Correlation matrix
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.4 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some() masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
data <- df[-1]
corrplot(cor(data %>% select_if(is.numeric)), method = "circle")
In order to check for collinearity between the four different predictor variables, I wrote the following executable R command:
vif(model)
## TMA MA G L
## 4.287751 1.317579 1.731241 4.471134
Before this, the following library must be loaded:
library(car)
This allows us to obtain the Variance Inflation Factor of each predictor variable with the rest of the others, thus verifying how intercorrellated they are with one another.
Unsurprisingly, Total Main Area (TMA) and Total Size of the Land (L), with VIFs of 4.29 and 4.47 seem to have highly correlated with the other two predictor variables of Main Area (MA) and Size of the Garage (G).
Using the results obtained from our initial regression model fitting with its related model adequacy checking and multicollinearity review, I decided to make the following changes:
In my data pull of 31 homes, there are 5 cases where the Total Main Area (TMA) is greater than the Main Area (MA). In all the other cases, TMA equals MA. In the 3 cases out of the 5 where they are unequal, there’s a second unit built, very likely a casita, with sizes varying from about 1000 sq.ft to a little over 2000 sq.ft. These 3 cases seem to be contributing to outliers and applying leverage and influence in the model (see Model Adequacy Review). Dropping them from the model might actually increase the accuracy of the model prediction due to the fact that the market value of the home at 6321 88th street we are assessing does not have a casita and the Total Main Area (TMA) there is equal to the Main Area (MA). Therefore we will keep Main Area (MA) and drop Total Main Area (TMA) from the predictor variables set.
The Land (L) area also contains the Main Area (MA) and Garage (L). Therefore including it in the model will continue to cause multicollinearity. I shift my hypothesis that the best choice for independent predictors of the model are Main Area (MA) and Garage (L)
The suggestions above yielded the ANOVA table from executing the following R command:
#Remove 3 data points with the casitas
df1 <- df %>% slice(-6,-13,-22)
#Multiple Regression Model First Order
model<-lm(MV~MA+G,data=df1)
summary(model) #Statistics
##
## Call:
## lm(formula = MV ~ MA + G, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103812 -9964 2672 8967 69499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -90001.50 73196.21 -1.230 0.230
## MA 218.85 30.60 7.152 1.7e-07 ***
## G 53.58 34.71 1.543 0.135
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33180 on 25 degrees of freedom
## Multiple R-squared: 0.8028, Adjusted R-squared: 0.7871
## F-statistic: 50.9 on 2 and 25 DF, p-value: 1.532e-09
Although the residual plots improved a little bit, both the R-squared value and Adjusted R-squared values of 0.926 and 0.9146 went down to 0.7969 and 0.7784 respectively. The model p-value also went up from 2.592e-14 to 2.426e-08. This suggests that the changes I made did not improve the model.
The following several combinations of removing the 3 data points with the casitas , using only 2 regressor variables at a time yielded the results from the following executable R cells:
model<-lm(MV~L+G,data=df1)
summary(model) #Statistics
##
## Call:
## lm(formula = MV ~ L + G, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -89082 -25211 1475 18103 101141
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.177e+05 5.369e+04 4.056 0.000429 ***
## L 3.075e+01 7.276e+00 4.226 0.000277 ***
## G 1.022e+02 4.395e+01 2.326 0.028444 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44230 on 25 degrees of freedom
## Multiple R-squared: 0.6497, Adjusted R-squared: 0.6217
## F-statistic: 23.18 on 2 and 25 DF, p-value: 2.02e-06
model<-lm(MV~TMA+G,data=df1)
summary(model) #Statistics
##
## Call:
## lm(formula = MV ~ TMA + G, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94379 -5714 4933 12459 78014
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33710.22 62513.62 -0.539 0.594
## TMA 199.96 26.56 7.528 6.99e-08 ***
## G 41.17 34.32 1.200 0.241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32040 on 25 degrees of freedom
## Multiple R-squared: 0.8162, Adjusted R-squared: 0.8015
## F-statistic: 55.5 on 2 and 25 DF, p-value: 6.385e-10
model<-lm(MV~TMA+L,data=df1)
summary(model) #Statistics
##
## Call:
## lm(formula = MV ~ TMA + L, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80225 -10907 3235 15375 63140
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -46422.963 59929.046 -0.775 0.446
## TMA 189.709 33.149 5.723 5.82e-06 ***
## L 8.241 7.090 1.162 0.256
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32090 on 25 degrees of freedom
## Multiple R-squared: 0.8156, Adjusted R-squared: 0.8008
## F-statistic: 55.27 on 2 and 25 DF, p-value: 6.657e-10
None of these models yielded an R-squared or Adjusted R-squared value better than the original ones of 0.926 and 0.9146. It is worth noticing that the best p-value obtained is when TMA is included in the model.
This exercise also revealed two additional things:
Garage (G) size does not seem to be significant in any of these models
Main Area (MA), although significant, seems to take a second place behind Total Main Area (TMA)
Dropping the homes with the casitas, although there weren’t many, did not help improve the model
Hence I decided to drop the 2 variables of Garage (G) and Main Area (MA) from the model. Interesting enough, with Total Main Area (TMA) and Land (L) remaining in the model with the homes with casitas included, this yielded the following result from the executable R cell below:
model<-lm(MV~TMA+L,data=df)
summary(model) #Statistics
##
## Call:
## lm(formula = MV ~ TMA + L, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -101603 -6737 6369 21872 89999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -183699.44 50672.81 -3.625 0.001137 **
## TMA 177.28 32.55 5.446 8.2e-06 ***
## L 28.54 7.23 3.947 0.000484 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44140 on 28 degrees of freedom
## Multiple R-squared: 0.9237, Adjusted R-squared: 0.9183
## F-statistic: 169.6 on 2 and 28 DF, p-value: 2.248e-16
So far in my model search, this model has yielded the best combinations of Adjusted R-Value of 0.9183 and a p-value of 2.248e-16, with both Total Main Area (TMA) and Land (L) showing significant p-values.
None of the 4 plots of Constant Variance, Normal Probability Plot, Scale-Location for Outliers and Residuals versus Leverage for influencers seems to display any violating abnormalities. See below:
Correlation Matrix and Variance Inflation Factor (VIF) between the predictor variables of Total Main Area (TMA) and Land (L) render the following results.
# Correlation matrix
library(tidyverse)
library(corrplot)
data <- df[-1]
corrplot(cor(data %>% select_if(is.numeric)), method = "circle")
VIF = 4.105.
Although one can suspect some collinerarity between these 2 regressor variables, VIF is still below 5. We can therefore move forward in etablishing a model to assess market values of homes in the 6321 88th street Lubbock area.
I used the executable R script cell below in order to find my model coefficients:
model$coefficients #Least Squares Estimates
## (Intercept) TMA L
## -183699.4396 177.2766 28.5384
\(B_{o} = -183699\)
\(B_{1} = 177.28\)
\(B_{2} = 28.54\)
Therefore, our Multiple Linear Regression model using least squares estimation is :
\[ MV = -183699 + 177.28 TMA + 28.54 L + \varepsilon \]
Is the 2025 Market Value of the home at 6321 88th street currently assessed at $538409 greater or less than it should be?
In order to answer this final question, I used the prediction and interval functionalities of R as follows:
# Create the data frame for prediction
Home_Value_6321_88th_street <- data.frame(TMA=2773,L=7546)
# Assess the market value of the home at 6321 88th Street using prediction interval at 95% confidence level
predict(model, newdata = Home_Value_6321_88th_street, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 523239.3 430333.8 616144.8
fit lwr upr
523239.3 430333.8 616144.8
This suggests that I should be 95% confident that the market value of the home at 6321 88th Street falls within a range of $430334 and $616145 and with the best fit at $523240 according to my model. Currently it is assessed at $538409, which falls within the range of my model’s prediction.
Using the step by step process of Multiple Linear Regression, I was able to extract data from a public website, hypothesize a model for predicting market values for homes, make assumptions, test out the assumptions, and progressively came up with a model that I think is a very practical one for assessing market values. I was a little surprised in the development of my model that I could predict the market value of the homes in the neighborhood under study without using the size of garages. The data I pulled did not assign a significant p-value to them. I also was leaning on the side of practicality and ease of use. With predictor variables of Land and Total Main Area giving me significant model R and P values, it’s a good proposal for a model construct that makes it easy to obtain predictor variables.