1. Introduction

This report presents a statistical analysis of property tax assessments, using regression models to predict the 2025 Market Value of residential properties. A key focus is determining whether the estimated value of the property at 6321 88th St is fair, undervalued, or overvalued. The analysis follows a structured process: data validation, exploratory analysis, model development, diagnostics, and prediction. Analysis key factors such as improvement value, land market value, main area in footage and value, garage footage and value, and the land footage to determine influential factors, and determine whether the home in address 6321 88TH Street is over evaluated or under evaluated using the multiple linear regression model. The findings advocate for re-evaluation to align the home’s taxable value with empirical evidence, ensuring fairness and equity in property taxation.

The multiple linear regression equation is given by \[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_nX_n + \varepsilon \]

Where:

-$Y: Dependent~variable$

-$X_1, X_2, \dots, X_n: Independent~variables$

-$\beta_0: Intercept$

-$\beta_1, \dots, \beta_n: Coefficients$

-$\varepsilon: Error~term$

The 2025 market value assessment of $538,409 for the home at 631 88th Street is overvalued, resulting in an unfairly high property tax burden. This report demonstrates this using data from 45 neighboring properties along 88th Street (addresses 6303–6351). Variables analyzed include:

2025_Market_Value
Improvement_Market_Value (value of structures on the property)
Total_Land_Market_Value
Main_Area_Sq_Ft (square footage of the main living area)
Main_Area_Value
Garage_Sq_Ft
Garage_Value
Land_Sq_Ft

The goal of this project is to show that the assessed value of home at 6321 88th Street exceeds or lower than the statistically reasonable range, and urge the county tax assessor to re-evaluate the home’s value and adjust taxes accordingly.

The initial data analysis involved checking the data distribution to ensure reliability for use in regression modeling. This involved checking the assumptions including normality test, outliers/influential points. —

2. Data Collection, Data Cleaning and Validation

Data for this analysis was collected from the Lubbock Central Appraisal District for properties located on 88th Street, Lubbock, Texas.

For the analysis, the required R packages including dplyr, ggplot2, car, and MASS were loaded.

#load necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(car)

## Warning: package 'car' was built under R version 4.4.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.4.3

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(MASS)

## Warning: package 'MASS' was built under R version 4.4.3

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.4.3

## corrplot 0.95 loaded

The focus includes all homes with addresses ranging from 6309 to 6351 88th Street, comprising 45 properties in total. Below is a breakdown of key variables collected:

## Warning: package 'knitr' was built under R version 4.4.3

Table 1: Key Variables Collected from LubbockCAD.org
Variable	Explanation
2025 Market Value	Total appraised value of property (house + land)
Total Improvement Market Value	Total appraised value of structural improvements (house)
Total Land Market Value	Total appraised value of land
Homestead Cap Loss	Discount applied if prior year’s appraisal increased by >10% (excluded from analysis)
Total Main Area (Sq. Ft.)	Total square footage of house (heated + non-heated areas)
Main Area (Sq. Ft.)	Square footage of heated living area
Main Area (Value)	Appraised value of heated living area
Garage (Sq. Ft.)	Square footage of garage/non-heated areas
Garage (Value)	Appraised value of garage/non-heated areas
Land (Sq. Ft.)	Total square footage of land

The data was saved into a comma-delimited file .csv and uploaded to Github account for easy access. The function below calls the data and checks using the head() function for verification.

Missing values were checked and verified to be missing, ensuring the data satisfied the minimum assumptions for regression modeling. Such operations comply with best practice in data science and statistical computing, enforcing the theme that valid inference depends on high-quality input data.

Data validation and verification are the groundwork for sound statistical modeling. In this study, several steps were taken in checking data quality to guarantee the consistency and validity of the dataset. Specifically, variables calculated from their component values like total area and improvement value were computed again from the components and matched against original figures reported to check for accuracy.

# Load the dataset
#Note that House 6321 Data is on row 16 from the data set excluding the header row
df <- read.csv("https://raw.githubusercontent.com/Ahmedja96/IE-5320-Project-2-Dataset/refs/heads/main/IE%205344%20Project%202%20Dataset.csv")
head(df)

##   X2025_Market_Value Improvement_Market_Value Total_Land_Market_Value
## 1             531703                   485373                   46330
## 2             504815                   458572                   46243
## 3             573558                   527274                   46284
## 4             469131                   422975                   46156
## 5            1218146                   116617                  101529
## 6             569992                   511992                   58000
##   Main_Area_Sq_Ft Main_Area_Value Garage_Sq_Ft Garage_Value Land_Sq_Ft
## 1            2743          449668          484        35705       7988
## 2            2610          419843          525        38729       7973
## 3            2851          460543          918        66731       7980
## 4            2991          390541          552        32434       7958
## 5            3097          624126         1095        96763      17505
## 6            3036          447924          575        38175      10000

# Verify calculated fields
df$Check_Main_Area <- df$Main_Area_Sq_Ft + df$Garage_Sq_Ft
df$Check_Improvement_Value <- df$Main_Area_Value + df$Garage_Value
df$Check_Market_Value <- df$Improvement_Market_Value + df$Total_Land_Market_Value

# Find mismatches & Removing Irrelevant Data
which(df$Check_Main_Area != df$Total_Main_Area_Sq_Ft)

## integer(0)

which(df$Check_Improvement_Value != df$Improvement_Market_Value)

##  [1]  5  6  7  9 10 11 12 13 15 17 21 26 30 34 38 41

which(df$Check_Market_Value != df$X2025_Market_Value)

## [1] 5

summary(df)

##  X2025_Market_Value Improvement_Market_Value Total_Land_Market_Value
##  Min.   : 418286    Min.   : 116617          Min.   : 43506         
##  1st Qu.: 504815    1st Qu.: 458572          1st Qu.: 45112         
##  Median : 534991    Median : 485962          Median : 45658         
##  Mean   : 575245    Mean   : 502189          Mean   : 50834         
##  3rd Qu.: 573558    3rd Qu.: 527274          3rd Qu.: 46330         
##  Max.   :1218146    Max.   :1116617          Max.   :101529         
##  Main_Area_Sq_Ft Main_Area_Value   Garage_Sq_Ft     Garage_Value  
##  Min.   :2041    Min.   :331544   Min.   : 325.0   Min.   :24934  
##  1st Qu.:2610    1st Qu.:415934   1st Qu.: 506.0   1st Qu.:36674  
##  Median :2745    Median :443268   Median : 528.0   Median :38729  
##  Mean   :2721    Mean   :440626   Mean   : 570.9   Mean   :41673  
##  3rd Qu.:2902    3rd Qu.:453758   3rd Qu.: 552.0   3rd Qu.:41033  
##  Max.   :3219    Max.   :624126   Max.   :1119.0   Max.   :96763  
##    Land_Sq_Ft    Check_Main_Area Check_Improvement_Value Check_Market_Value
##  Min.   : 7501   Min.   :2591    Min.   :368218          Min.   : 218146   
##  1st Qu.: 7778   1st Qu.:3132    1st Qu.:453413          1st Qu.: 504815   
##  Median : 7872   Median :3286    Median :484488          Median : 532125   
##  Mean   : 8756   Mean   :3292    Mean   :482300          Mean   : 553023   
##  3rd Qu.: 7988   3rd Qu.:3477    3rd Qu.:494642          3rd Qu.: 573558   
##  Max.   :17505   Max.   :4192    Max.   :720889          Max.   :1218146

Observations After Data Collection & Statistics Analysis

The data for “Homestead Cap Loss” variables was redundant and hence removed because it was not relevant to the analysis since it is for the current tax year only and the assessed value will only go up over time. There was no home at address 6351 88TH street and the home at address 6310 88TH street data was incomplete, so those two were excluded from the data set. The Market Value for the neighbored ranges between $418,286 to $1,218,146 and exhibits a right-skewed distribution, evidenced by the mean ($575,245) exceeding the median ($534,991). The range is substantial, with a minimum of $418,286 and a maximum of $1,218,146, suggesting outliers at the upper end, likely the $1.22M property). The interquartile range (IQR: $504,815–$573,558) captures typical values, while the 3rd quartile ($573,558) aligns closely with the median, indicating clustering of mid-to-high values before the extreme upper tail. This skewness may necessitate outlier treatment for robust modeling. The summary gives a general overview of the market value variable. Histogram and box plot will be used to determine the distribution of the variable.

Analyze Total Area of House vs Total Area of Houses in the neighborhood

The Total House Area variable combines the square footage of both heated living space (Main Area) and non-heated garage space (Garage Area). In the dataset, this variable ranges from ~3,000 to ~4,200 sq. ft., reflecting the physical footprint of the homes analyzed. Larger total areas generally align with higher Market Values, as they represent more usable space such as living rooms, bedrooms, garages, which is a key factor in property appraisal. This variable provides a measure of property size, critical for assessing functional utility and buyer preferences in the housing market and assessing the market value.

# summary of total house area variable
df$Total_House_Area_Sq_Ft <- df$Main_Area_Sq_Ft + df$Garage_Sq_Ft
summary(df$Total_House_Area_Sq_Ft)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2591    3132    3286    3292    3477    4192

The variable has a near-symmetric distribution, as the median (3,286) and mean (3,292) are nearly identical. The middle 50% of values fall within a narrow range (3,132–3,477), indicating consistency for most observations. The maximum value (4,192) significantly exceeds the third quartile (3,477), suggesting a potential outlier or a small subset of exceptionally large values in the upper tail.

# histogram and scatterplot

par(mfrow = c(1, 2), mar = c(4, 4, 3, 2))  

# histogram with density curve
hist(df$Total_House_Area_Sq_Ft, 
     main = "Histogram of Total House Area",
     xlab = "Market Value ($)",
     col = "blue",
     border = "yellow",
     breaks = 15)
lines(density(df$Total_House_Area_Sq_Ft), col = "black", lwd = 2)

# market value vs house area
plot(df$X2025_Market_Value, df$Total_House_Area_Sq_Ft,
     main = "Market Value vs. House Area",
     xlab = "2025 Market Value ($)",
     ylab = "Total Main Area (Sq. Ft.)",
     col ="orange",
     pch=19)

points(538409, 3365, cex = 2, pch = 18, col ="grey") 
text(628409, 3330,labels="#6321", col="grey", cex=0.8, font=2)

mtext("Total House Area Distribution", outer = TRUE, cex = 2, font = 4)

# reset plot layout
par(mfrow = c(1, 1))

Observations from plots

The histogram shows the distribution of total house areas, measured in square feet. The data is right-skewed, with most houses clustered around lower market values (approximately 250,000 to 350,000). There is a peak in frequency around the 300,000 to 350,000 range, indicating that this is the most common price range for the houses in the dataset. Fewer houses are observed at higher market values, suggesting that larger or more expensive houses are less frequent.

The scatter plot illustrates the relationship between the total main area (in square feet) and the market value (in dollars). There is a general positive trend, indicating that as the total main area increases, the market value tends to increase as well. However, the relationship is not perfectly linear, as there is some variability in market values for houses of similar sizes. For example, “#6321,” has a relatively high market value compared to other houses with similar total main areas, suggesting it may be an Outlier or have additional features that justify its higher price.

Analyze Total Area Sqft of Land vs Total Area Sqft of land for other houses in the neighborhood

The Land Area variable is the total square footage of the property’s land, ranging from 7,546 to 17,505 sq. ft. in the data set. While larger land areas often correlate with higher market values in real estate makes this variable important to the data analysis.

summary(df$Land_Sq_Ft)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7501    7778    7872    8756    7988   17505

# Market Value vs.Total Land Sqft
par(mfrow = c(1, 2), mar = c(4, 4, 3, 0.2) + 0.1, oma = c(2,0,2,0))
hist(df$Land_Sq_Ft,
     main = "Histogram of Land Area",
     xlab = "Total Main Area (Sq. Ft.)",
     col ="lightblue",
     border = 'darkblue',
     labels = TRUE)

plot(df$X2025_Market_Value, df$Land_Sq_Ft,
     main = "Market Value vs. Total Land",
     xlab = "2025 Market Value ($)",
     ylab = "Total Main Area (Sq. Ft.)",
     col = ifelse(df$Land_Sq_Ft < 8169, "purple", "orange"), #highlight outliers
     pch=19)

points(538409, 7546, cex = 2, pch = 18, col ="lightgreen") 
text(718000, 7546,labels="#6321", col="darkred", cex=0.8, font=2)

legend("topleft", c("Within Specification", "Outliers"), border="black",inset=.02, fill = c("grey", "black"))

mtext("Market Value vs. Land Sqft", outer = TRUE, cex = 2, font = 4)

The land area distribution is right-skewed, evidenced by the mean (8,756 sq. ft.) exceeding the median (7,872 sq. ft.). Most properties cluster tightly between 7,501–7,988 sq. ft. (IQR), while the maximum (17,505 sq. ft.) represents an extreme Outlier. This skewness suggests that while land area varies minimally for the majority of homes, a few large lots disproportionately inflate the average, diminishing its utility as a standalone predictor of market value.

3. Exploratory Data Analysis

Exploratory data analysis is conducted in order to visualize the distribution of the target variable—2025 Market Value. A histogram demonstrated positive skewness in the data, which goes against the linear regression assumption of normally distributed residuals.

Histogram of Market Value

hist(df$X2025_Market_Value, 
     main = "Histogram of 2025 Market Value", 
     xlab = "2025 Market Value", 
     col = "skyblue", 
     breaks = 30)

What can be observed from the Histogram is that 2025 Market Value histogram shows the spread of residential property prices in the dataset. The resultant visualization of the output shows a right-skewed (positively skewed) distribution, which means most properties fall in the lower to mid-value market prices, approximately around dollars 400,000 to 700,000. Conversely, fewer properties have considerably larger market values that range beyond dollars 1,000,000 and form a long tail on the right side of the distribution. This is an indication of outliers or high-end properties that greatly differ from the central tendency of the data set. The skewness of the distribution indicates violation of the assumption of normality necessary for analysis using linear regression.

Boxplot of 2025 Market Value

boxplot(df$X2025_Market_Value, 
        main = "Boxplot of 2025 Market Value", 
        ylab = "2025 Market Value", 
        col = "lightgreen", 
        horizontal = TRUE)

What can be observed from the boxplot is that the 2025 Market Value reveals that most houses range from around dollars 490,000 to dollars 610,000, with the median slightly more than dollars 550,000. There are a couple of high-value houses as outliers at around dollars 700,000, dollars 950,000, and more than dollars 1,200,000. The existence of these outliers and a longer upper whisker confirm a right-skewed distribution. Such outliers can have an effect on regression output, and hence additional diagnostic checks are required to evaluate their effect on the model.

For futher Clarification see visualization of data using a scatterplots

#scatterplot matrix
plot(df$Improvement_Market_Value, df$`2025_Market_Value`,
     main = "Improvement Market Value vs 2025 Value",
     xlab = "Improvement Value", ylab = "2025 Market Value",
     col = "blue")

plot(df$Total_Land_Market_Value, df$`2025_Market_Value`,
     main = "Total Land Value vs 2025 Value",
     xlab = "Land Value", ylab = "2025 Market Value",
     col = "blue")

plot(df$Main_Area_Sq_Ft, df$`2025_Market_Value`,
     main = "Main Area vs 2025 Value",
     xlab = "Sq Ft", ylab = "2025 Market Value",
     col = "blue")

plot(df$Main_Area_Value, df$`2025_Market_Value`,
     main = "Main Area Value vs 2025 Value",
     xlab = "Value", ylab = "2025 Market Value",
     col = "blue")

plot(df$Garage_Sq_Ft, df$`2025_Market_Value`,
     main = "Garage Sq Ft vs 2025 Value",
     xlab = "Sq Ft", ylab = "2025 Market Value",
     col = "blue")

plot(df$Garage_Value, df$`2025_Market_Value`,
     main = "Garage Value vs 2025 Value",
     xlab = "Value", ylab = "2025 Market Value",
     col = "blue")

plot(df$Land_Sq_Ft, df$`2025_Market_Value`,
     main = "Land Sq Ft vs 2025 Value",
     xlab = "Sq Ft", ylab = "2025 Market Value",
     col = "blue")

4. Regression Modeling

The research tried out a number of regression model specifications.

Initial Regression model

#fit initial multiple regression model
model_initial <- lm(X2025_Market_Value ~ ., data = df)
summary(model_initial)

## 
## Call:
## lm(formula = X2025_Market_Value ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -107849   -9540   -1424   15793  109994 
## 
## Coefficients: (4 not defined because of singularities)
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -1.327e+05  6.084e+04  -2.182   0.0356 *  
## Improvement_Market_Value  4.802e-01  7.283e-02   6.593 9.93e-08 ***
## Total_Land_Market_Value   3.355e+01  2.221e+01   1.511   0.1393    
## Main_Area_Sq_Ft           6.740e+02  1.461e+02   4.614 4.62e-05 ***
## Main_Area_Value          -3.813e+00  8.476e-01  -4.499 6.56e-05 ***
## Garage_Sq_Ft             -3.975e+03  7.147e+02  -5.561 2.47e-06 ***
## Garage_Value              5.472e+01  9.524e+00   5.746 1.39e-06 ***
## Land_Sq_Ft               -1.603e+02  1.308e+02  -1.226   0.2281    
## Check_Main_Area                  NA         NA      NA       NA    
## Check_Improvement_Value          NA         NA      NA       NA    
## Check_Market_Value               NA         NA      NA       NA    
## Total_House_Area_Sq_Ft           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34410 on 37 degrees of freedom
## Multiple R-squared:  0.9628, Adjusted R-squared:  0.9558 
## F-statistic: 136.9 on 7 and 37 DF,  p-value: < 2.2e-16

The initial model included all variables to predict 2025_Market_Value: $2025_Market_Value = Improvement_Market_Value X1 + Total_Land_Market_Value X2 + Main_Area_Sq_Ft X3 + Main_Area_Value X4 + Garage_Sq_Ft X5 + Garage_Value X6 + Land_Sq_Ft X7$ where X1 to X7 are coefficients. The multiple regression model is. \[Market Value =−132,700+0.48(Improvement Market Value)+33.55(Land Market Value)+4,649(Main Area Sq Ft)−3.81(Main Area Value)+54.72(Garage Value)−160.3(Land Sq Ft)\] The model has a high R-squared value of 0.9628, indicating that approximately 96.3% of the variability in market value can be explained by the included predictors. The adjusted R-squared of 0.9558 confirms the model’s strong explanatory power while accounting for the number of predictors. The F-statistic of 136.9 with a p-value less than 2.2e-16 suggests the overall model is highly significant. The level of significance used to determine significant factors is 0.05. Most of the predictors were statistically significant at the 0.05 level, including Improvement Market Value, Main Area Square Footage, Main Area Value, Garage Square Footage, and Garage Value. Notably, Improvement Market Value has a positive relationship with the 2025 market value, while Main Area Value and Garage Sq Ft show a negative impact, which may warrant further investigation or could indicate multicollinearity. On the other hand, Total Land Market Value and Land Square Footage were not statistically significant in this model.

plot(model_initial, which = 1:5)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Second linear regression will be specified in using main area square footage, garage size, and land square footage as independent variables.

First-Order Model with Key Predictors

model1 <- lm(X2025_Market_Value ~ Main_Area_Sq_Ft + Garage_Sq_Ft + Land_Sq_Ft, data = df)
summary(model1)

## 
## Call:
## lm(formula = X2025_Market_Value ~ Main_Area_Sq_Ft + Garage_Sq_Ft + 
##     Land_Sq_Ft, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -181371  -40589    8719   43257   88072 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     30699.347  91695.465   0.335    0.739    
## Main_Area_Sq_Ft   -17.895     35.826  -0.499    0.620    
## Garage_Sq_Ft       28.411     60.786   0.467    0.643    
## Land_Sq_Ft         65.897      4.166  15.819   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58780 on 41 degrees of freedom
## Multiple R-squared:  0.8798, Adjusted R-squared:  0.871 
## F-statistic:   100 on 3 and 41 DF,  p-value: < 2.2e-16

plot(model1, which = 1:5)

This produced a very high R-squared value, but also flagged the inclusion of more variables.

The full model below was then formulated by adding economic variables like main area value and land market value. Although this better fit the model, residual patterns indicated additional complexity in the data-generating process.

Full Model with All Predictors

model2 <- lm(X2025_Market_Value ~ df$Main_Area_Sq_Ft + df$Main_Area_Value + df$Garage_Sq_Ft + 
               df$Garage_Value + df$Land_Sq_Ft + df$Total_Land_Market_Value, data = df)
summary(model2)

## 
## Call:
## lm(formula = X2025_Market_Value ~ df$Main_Area_Sq_Ft + df$Main_Area_Value + 
##     df$Garage_Sq_Ft + df$Garage_Value + df$Land_Sq_Ft + df$Total_Land_Market_Value, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -145451  -27441     -23   23165  110511 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)
## (Intercept)                45210.7926 79352.6611   0.570    0.572
## df$Main_Area_Sq_Ft           -55.5944   138.7699  -0.401    0.691
## df$Main_Area_Value             0.4346     0.8015   0.542    0.591
## df$Garage_Sq_Ft             -419.9346   682.7132  -0.615    0.542
## df$Garage_Value                6.3317     8.8313   0.717    0.478
## df$Land_Sq_Ft                174.0064   175.4109   0.992    0.327
## df$Total_Land_Market_Value   -20.8122    30.0055  -0.694    0.492
## 
## Residual standard error: 50070 on 38 degrees of freedom
## Multiple R-squared:  0.9191, Adjusted R-squared:  0.9064 
## F-statistic: 71.99 on 6 and 38 DF,  p-value: < 2.2e-16

plot(model2, which = 1:5)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Final Model With Only Siginificant Variables Considered ( Market Value, Appraied Value, Total House Area, and Total Land Area) Those Variables are determined to be the most significant as they appear to sum up all factors that each house is priced by (for example garage area, yard area, etc.) considering appraised and market price per sqft.

model3 <- lm( X2025_Market_Value ~ df$Improvement_Market_Value + df$Total_Land_Market_Value + df$Main_Area_Sq_Ft + df$Land_Sq_Ft, data = df)
summary (model3)

## 
## Call:
## lm(formula = X2025_Market_Value ~ df$Improvement_Market_Value + 
##     df$Total_Land_Market_Value + df$Main_Area_Sq_Ft + df$Land_Sq_Ft, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -183549  -38744    5728   35926  124784 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                 -1.861e+04  9.063e+04  -0.205   0.8383  
## df$Improvement_Market_Value  1.121e-01  6.856e-02   1.636   0.1098  
## df$Total_Land_Market_Value  -4.604e+01  2.682e+01  -1.717   0.0938 .
## df$Main_Area_Sq_Ft          -8.510e+00  3.369e+01  -0.253   0.8019  
## df$Land_Sq_Ft                3.313e+02  1.560e+02   2.124   0.0399 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 55850 on 40 degrees of freedom
## Multiple R-squared:  0.8941, Adjusted R-squared:  0.8835 
## F-statistic: 84.43 on 4 and 40 DF,  p-value: < 2.2e-16

plot(model3, which = 1:5)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

What can be concluded from the results of the final model (model2) is that for the Multiple R-squared values, the plot is showing a slight deviation from a typical bell-shaped curve. This is evident from the unexpected rise in the bars near both ends of the distribution. The plot’s bin labels clearly highlight the symmetry, with most homes concentrated in the 3,000 to 3,700 square foot range. As for the P-values, the majority of data points are grouped between 3,000 and 3,600 square feet, as shown in the histogram. The plot also identifies homes near the upper and lower boundaries of the range that are close to being outliers but do not qualify as such. Furthermore, house at address 6321 is specifically marked to demonstrate that its value falls within the main cluster, suggesting it represents an average case.

Additionally, The Residuals vs. Fitted plot shows a relatively straight trend line, with most data points either on or near the reference line, suggesting only a mild presence of non-constant variance. The plot also identifies three potential outliers—points 9, 19, and 30— that will require further examination. The Q-Q plot of residuals indicates that the residuals mostly follow a normal distribution, with only slight deviations. While a few points stray from the line and three are flagged as possible outliers, the majority align closely with the reference line, supporting the assumption of normality. The Scale-Location plot, in contrast to the earlier model, demonstrates that all points fall below the 1.8 threshold, indicating minimal deviation. However, point 19 exceeds the 1.5 mark, raising a concern that will be addressed later. The Residuals vs. Leverage plot confirms that the same three points identified by Cook’s Distance (CD) are also flagged by the Leverage statistic (HII) since $\frac{2p}{n}\Rightarrow \frac{2(4)}{45}=0.18$, suggesting they have both leverage and influence. Among them, point 7 appears to be the most influential, lying just outside the CD reference line and farther from the HII threshold. Point 9 shows high leverage but is nearer to the 1.0 line, while point 17 is closer to the HII limit but also near the reference line. All three points demonstrate notable influence and should be investigated further.

For the multiple linear regression model to be valid and reliable, key assumptions must be satisfied. First, model can be checked for linearity which assumes a straight-line relationship between the independent variables and the dependent variable. This ensures that the model accurately captures the true relationship. Also, model can be checked for independence means that the residuals (errors) are not correlated with one another. The other assumption is homoscedasticity which requires that the residuals have constant variance across all levels of the predictors—any patterns or funnel shapes in residual plots may indicate a violation. The fourth assumption is normality of residuals which ensures that hypothesis tests and confidence intervals derived from the model are valid. This is typically assessed using a Q-Q plot. Lastly, which will be shown later is checking that the model assumes no multicollinearity, meaning that the independent variables are not highly correlated with each other, as multicollinearity can inflate standard errors and make coefficient estimates unstable.

5. Model Diagnostics

After model estimation, diagnostic plots were checked to evaluate the linearity assumptions of linear regression. These comprised residuals vs. fitted values, normal Q-Q plots, scale-location plots for homogeneity of variance, and residuals vs. leverage for outlier identification. Cook’s Distance was used to measure the influence of each data point on regression estimates in terms of leverage and residual impact theory.

Verifying points of leverage, outliers, and points of influence

# Outliers
model3$residuals

##             1             2             3             4             5 
##    5728.05372  -18322.08187   44336.04713  -45807.83214  124783.75252 
##             6             7             8             9            10 
##  -85791.21310     -38.65217  -49014.80516   18743.26907   35925.97255 
##            11            12            13            14            15 
##  -64819.12000   66107.68646  -53059.54661  -38743.95631  -39532.65417 
##            16            17            18            19            20 
##   40090.11931   12644.87811    4829.81127 -183548.54871    1755.36374 
##            21            22            23            24            25 
##   75585.65371  -10499.53131  -19687.27484   12272.78179  -86640.09914 
##            26            27            28            29            30 
##   67566.18543   63010.54297   19183.30384  -40520.55166   77961.14119 
##            31            32            33            34            35 
##    2703.02943    3514.50490    6827.99801   33471.60472  -45807.83214 
##            36            37            38            39            40 
##   19281.88608    7547.33413   36996.44124  -18322.08187   13200.17453 
##            41            42            43            44            45 
##   44336.04713  -52470.31854    5728.05372  -31595.65626   40090.11931

model3$fitted.values

##         1         2         3         4         5         6         7         8 
##  525974.9  523137.1  529222.0  514938.8 1093362.2  655783.2  602465.7  509302.8 
##         9        10        11        12        13        14        15        16 
##  950022.7  514193.0  774423.1  514861.3  533193.5  506679.0  649850.7  498318.9 
##        17        18        19        20        21        22        23        24 
## 1205501.1  530161.2  828797.5  509359.6  518846.3  508856.5  511061.3  498893.2 
##        25        26        27        28        29        30        31        32 
##  504926.1  512613.8  507951.5  518686.7  509530.6  532445.9  522870.0  518442.5 
##        33        34        35        36        37        38        39        40 
##  523776.0  521574.4  514938.8  519092.1  524577.7  519805.6  523137.1  518419.8 
##        41        42        43        44        45 
##  529222.0  486815.3  525974.9  545693.7  498318.9

dfOutliers <- df
dfOutliers$fitted <- model3$fitted.values
dfOutliers$residuals <- model3$residuals
#Note that points 9 and 30 are outliers, but point 30 does not appear to have leverage per the plots

# Leverage & Influence
hatvalues(model3)

##          1          2          3          4          5          6          7 
## 0.02599412 0.02853996 0.03792851 0.06212403 0.87521625 0.05661006 0.99999919 
##          8          9         10         11         12         13         14 
## 0.10637205 0.49005419 0.03107769 0.10675883 0.05191473 0.03543308 0.06557517 
##         15         16         17         18         19         20         21 
## 0.03950801 0.03293721 0.60642007 0.02925276 0.13371755 0.02772535 0.06374753 
##         22         23         24         25         26         27         28 
## 0.03669430 0.04956392 0.02842347 0.11546540 0.05111619 0.03341317 0.03153928 
##         29         30         31         32         33         34         35 
## 0.09246670 0.13715690 0.02529293 0.02702496 0.02531986 0.02695823 0.06212403 
##         36         37         38         39         40         41         42 
## 0.02845697 0.02624340 0.04887929 0.02853996 0.02768140 0.03792851 0.06482963 
##         43         44         45 
## 0.02599412 0.02904384 0.03293721

sort(hatvalues(model3))

##         31         33          1         43         37         34         32 
## 0.02529293 0.02531986 0.02599412 0.02599412 0.02624340 0.02695823 0.02702496 
##         40         20         24         36         39          2         44 
## 0.02768140 0.02772535 0.02842347 0.02845697 0.02853996 0.02853996 0.02904384 
##         18         10         28         16         45         27         13 
## 0.02925276 0.03107769 0.03153928 0.03293721 0.03293721 0.03341317 0.03543308 
##         22          3         41         15         38         23         26 
## 0.03669430 0.03792851 0.03792851 0.03950801 0.04887929 0.04956392 0.05111619 
##         12          6         35          4         21         42         14 
## 0.05191473 0.05661006 0.06212403 0.06212403 0.06374753 0.06482963 0.06557517 
##         29          8         11         25         19         30          9 
## 0.09246670 0.10637205 0.10675883 0.11546540 0.13371755 0.13715690 0.49005419 
##         17          5          7 
## 0.60642007 0.87521625 0.99999919

#point 7 seems to have the most leverage as its closest to 1

Cook’s Distance (Influence Plot)

cooksd <- cooks.distance(model3)
plot(cooksd, main = "Cook's Distance")
abline(h = 4/nrow(df), col = "red")

Interpretation: Cook’s Distance plot offers a diagnostic measure for finding influential points in the regression model. Every vertical column in the plot measures the Cook’s distance for a single observation within the dataset, indicating how much the observation affects the regression coefficients that are fitted. The horizontal red line, placed at the 4/n threshold where n represents the number of observations, is a rule-of-thumb cut value to identify potentially influential points. Points with Cook’s distance values that surpass this figure should be explored more closely because they can unfairly affect the estimates of the model and invalidate the inference. Few observations in the plot of the last quadratic model (model2) are over the red line, indicating those points have considerable leverage or are large residuals—or both—and may be affecting the fit of the model. These powerful observations may be outliers, data entry mistakes, or otherwise legitimately distinctive properties with traits poorly represented by the predictors in the model. Cook’s Distance plot identifies influential observations (points above the red line at 4/n). These points can effect the model disproportionately will be removed to see the impact. From a theoretical point of view, detecting and testing influential points are crucial in regression diagnostics because they can bias parameter estimates, increase standard errors, and affect predictive accuracy. Where they are discovered to be an outlier or an error in measurement, they may be deleted or trimmed. Otherwise, if they indicate genuine variability within the population, robust regression methods or model re-specification would be needed to reduce their effect.

Removing points of leverage and points of influence

# remove 7, 9, and 17
finaldf <- df[-c(7,9,17), ]

model_final <- lm( X2025_Market_Value ~ finaldf$Improvement_Market_Value + finaldf$Total_Land_Market_Value + finaldf$Main_Area_Sq_Ft + finaldf$Land_Sq_Ft, data = finaldf)

summary(model_final)

## 
## Call:
## lm(formula = X2025_Market_Value ~ finaldf$Improvement_Market_Value + 
##     finaldf$Total_Land_Market_Value + finaldf$Main_Area_Sq_Ft + 
##     finaldf$Land_Sq_Ft, data = finaldf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -163007  -22110     165   29504  117256 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)                       4.607e+03  1.055e+05   0.044    0.965
## finaldf$Improvement_Market_Value -4.890e-02  1.430e-01  -0.342    0.734
## finaldf$Total_Land_Market_Value   2.861e+04  3.013e+04   0.949    0.349
## finaldf$Main_Area_Sq_Ft           3.240e+01  4.761e+01   0.681    0.500
## finaldf$Land_Sq_Ft               -1.659e+05  1.748e+05  -0.949    0.349
## 
## Residual standard error: 56290 on 37 degrees of freedom
## Multiple R-squared:  0.7987, Adjusted R-squared:  0.7769 
## F-statistic:  36.7 on 4 and 37 DF,  p-value: 2.084e-12

After removing some of the non-significant predictors from the model, the final model with the most signigicant predictors Improvement_Market_Value , Total_Land_Market_Value , Main_Area_Sq_Ft , and Land_Sq_Ft, statistics/summary indicate their strong association with the response variable. This improved interpretation and reduced over fitting by retaining only the most impacting predictors. Removing the three points with most leverage and influence does display a slightly lower R-Squared in the thousandths place, though it does not as much influence and leverage as expected. Therefore, given these results it can be concluded that these points do have leverage and influence, but if they were to be exclude it does not make a great impact to the data set.

6. FORECAST FOR 6321 88TH St

In addition, multicolinearity between predictors was assessed with the Variance Inflation Factor (VIF), which indicated strong collinearity between squared terms and their linear equivalents. This type of multicolinearity threatens coefficient interpretation ability and estimation accuracy, making model refinement imperative. After removing the non significant predictors, checking for Variance of influential factors was important to ensure validity of the model. The vif function will be used. \[ VIF_i = \frac{1}{1-R_{2}^{i}} \] ## Check pairwise correlation

cor_matrix <- cor(df[-1,])
corrplot(cor_matrix, method = "color", type = "upper")

## Multicolinearity Check

#check VIF 
vif_final <- vif(model3)
print(vif_final)

## df$Improvement_Market_Value  df$Total_Land_Market_Value 
##                    1.201379                 1865.183007 
##          df$Main_Area_Sq_Ft               df$Land_Sq_Ft 
##                    1.130723                 1864.151900

It can be observed that there were High VIFs, greater than 1,000 for some variables indicate unresolved colinearity involving Total_Land_Market_Value and Land_Sq_Ft variables.

The final model was used to forecast the market value for the home at address 6321 for comparison with the assessor value and determine if it is overvalued or undervalued. This involved using the data points including improvement market value, total land market value, main area and land area.

#predict value for the home 6321
df_6321 <- data.frame(
  Improvement_Market_Value = 494642,
  Total_Land_Market_Value = 43767,
  Main_Area_Sq_Ft = 2773,
  Land_Sq_Ft = 7546
)

# prediction with confidence interval
predicted_conf <- predict(model3, newdata = df_6321, interval = "confidence", level = 0.95)

## Warning: 'newdata' had 1 row but variables found have 45 rows

print(predicted_conf) #House 6321 Data on row 16

##          fit       lwr       upr
## 1   525974.9  507775.0  544174.8
## 2   523137.1  504066.8  542207.4
## 3   529222.0  507237.6  551206.3
## 4   514938.8  486802.9  543074.8
## 5  1093362.2  987756.1 1198968.4
## 6   655783.2  628924.9  682641.5
## 7   602465.7  489581.9  715349.4
## 8   509302.8  472486.1  546119.5
## 9   950022.7  870999.7 1029045.7
## 10  514193.0  494292.9  534093.2
## 11  774423.1  737539.5  811306.7
## 12  514861.3  489141.0  540581.7
## 13  533193.5  511944.7  554442.4
## 14  506679.0  477772.1  535585.8
## 15  649850.7  627413.2  672288.1
## 16  498318.9  477832.0  518805.7
## 17 1205501.1 1117595.2 1293407.1
## 18  530161.2  510854.2  549468.2
## 19  828797.5  787518.9  870076.2
## 20  509359.6  490563.4  528155.8
## 21  518846.3  490345.1  547347.6
## 22  508856.5  487232.8  530480.3
## 23  511061.3  485930.0  536192.5
## 24  498893.2  479861.8  517924.6
## 25  504926.1  466568.0  543284.2
## 26  512613.8  487092.0  538135.6
## 27  507951.5  487317.1  528585.8
## 28  518686.7  498639.3  538734.1
## 29  509530.6  475204.5  543856.6
## 30  532445.9  490639.7  574252.0
## 31  522870.0  504917.2  540822.7
## 32  518442.5  499885.2  536999.8
## 33  523776.0  505813.7  541738.3
## 34  521574.4  503040.1  540108.7
## 35  514938.8  486802.9  543074.8
## 36  519092.1  500049.5  538134.7
## 37  524577.7  506290.7  542864.6
## 38  519805.6  494848.5  544762.7
## 39  523137.1  504066.8  542207.4
## 40  518419.8  499638.5  537201.1
## 41  529222.0  507237.6  551206.3
## 42  486815.3  458073.2  515557.4
## 43  525974.9  507775.0  544174.8
## 44  545693.7  526455.7  564931.6
## 45  498318.9  477832.0  518805.7

Therefore upper limit for confidence predicted value is 518805.7, lower limit is 477832.0 and confidence predicted value is 498318.9

# prediction with prediction interval
predicted_pred <- predict(model3, newdata = df_6321, interval = "prediction", level = 0.95)

## Warning: 'newdata' had 1 row but variables found have 45 rows

print(predicted_pred) #House 6321 Data on row 16

##          fit       lwr       upr
## 1   525974.9  411633.4  640316.5
## 2   523137.1  408653.8  637620.4
## 3   529222.0  414217.3  644226.6
## 4   514938.8  398601.5  631276.2
## 5  1093362.2  938780.9 1247943.6
## 6   655783.2  539748.2  771818.2
## 7   602465.7  442823.9  762107.4
## 8   509302.8  390566.9  628038.7
## 9   950022.7  812228.0 1087817.4
## 10  514193.0  399568.6  628817.5
## 11  774423.1  655666.4  893179.8
## 12  514861.3  399084.4  630638.2
## 13  533193.5  418327.3  648059.8
## 14  506679.0  390152.8  623205.2
## 15  649850.7  534758.6  764942.7
## 16  498318.9  383591.1  613046.6
## 17 1205501.1 1062427.0 1348575.2
## 18  530161.2  415638.2  644684.1
## 19  828797.5  708603.2  948991.9
## 20  509359.6  394921.7  623797.6
## 21  518846.3  402420.1  635272.6
## 22  508856.5  393920.3  623792.7
## 23  511061.3  395413.8  626708.7
## 24  498893.2  384416.4  613370.0
## 25  504926.1  385703.2  624149.0
## 26  512613.8  396880.9  628346.7
## 27  507951.5  393197.3  622705.6
## 28  518686.7  404036.6  633336.8
## 29  509530.6  391543.1  627518.0
## 30  532445.9  412069.4  652822.4
## 31  522870.0  408567.5  637172.4
## 32  518442.5  404043.5  632841.4
## 33  523776.0  409472.1  638079.9
## 34  521574.4  407179.2  635969.6
## 35  514938.8  398601.5  631276.2
## 36  519092.1  404613.4  633570.8
## 37  524577.7  410222.3  638933.1
## 38  519805.6  404195.9  635415.3
## 39  523137.1  408653.8  637620.4
## 40  518419.8  403984.3  632855.3
## 41  529222.0  414217.3  644226.6
## 42  486815.3  370329.9  603300.7
## 43  525974.9  411633.4  640316.5
## 44  545693.7  431182.3  660205.0
## 45  498318.9  383591.1  613046.6

Therefore upper limit for prediction predicted value is 613046.6, lower limit is 383591.1 and confidence predicted value is 498318.9
Keeping in mind, prediction interval is always going to be wider than the confidence interval because the prediction interval depends on both the error term of the fitted model and also the error that could be related to future observations.

The house value was forecast and included the fitted value, the upper and lower limit at $\alpha=0.05$.

Comparing the assessed market value $538,409 and the predicted value $498318.9, the home 6321 88TH Street is overvalued by $40090.1 .

Questions Answered

How well does the 4 factors forecast 2025 Home Market Value?

The model achieves a good fit (R² = 89%), indicating that the four predictors (Improvement_Market_Value, Total_Land_Market_Value, Main_Area_Sq_Ft, and Land_Sq_Ft) collectively explain all variability in 2025 market values. However, an R² of .89 is highly unusual and suggests over fitting (e.g., duplicated variables or circular dependencies in the data). While the predictors appear to forecast values perfectly in this data set, the model’s reliability for new data is questionable without resolving these issues.

What implicates do the results have for property tax and value assessment?

Over evaluation Evidence: The 95% prediction interval for 6321 88th Street is $477832.0–$518805.7, far below the assessed value of $538,409. This suggests the home is overvalued by $40090.1, leading to an unfair tax burden.

7. Summary & Conclusions

The residual standard error (RSE) of the model is reported as 55850, indicating the some deviation of predicted values from the observed values in the dataset. This measure provides an estimate of the standard deviation of the residuals, or the unexplained variability in the response variable (2025_Market_Value ) after accounting for the predictors. With 40 degrees of freedom, the RSE suggests that, on average, the model’s predictions may be off by a decent amount. While this value is relatively small compared to the scale of the market values being predicted, it highlights that there is still some unexplained variation in the data that the model does not account for, potentially due to omitted variables or inherent randomness. The multiple R-squared value for the model is extremely high, effectively equal to 0.89, indicating that nearly all the variability in the response variable (2025_Market_Value ) is explained by the predictors included in the model. This near-perfect fit suggests that the combination of Improvement_Market_Value , Total_Land_Market_Value , Main_Area_Sq_Ft , and Land_Sq_Ft captures almost all the patterns in the data. However, while a high R-squared is desirable, it should be interpreted cautiously, as it may also indicate overfitting, especially given the very high value. The adjusted R-squared, which penalizes the inclusion of unnecessary predictors, is also close to 1, reinforcing that the selected predictors contribute meaningfully to the model’s explanatory power. The p-value for the overall F-statistic of the model is very small (< 2.2e-16), confirming the statistical significance of the model as a whole, indicating that at least one predictor has a meaningful relationship with the response variable. Additionally, the individual p-values for the predictors show that Land_Sq_Ft` is highly significant (p < 0.05), while Main_Area_SqFt is marginally non-significant (p = 0.8019). Despite the borderline significance of Main_Area_SqFt , its inclusion may still improve the model if it contributes to theoretical or practical relevance. Overall, the p-values reinforce the importance of the significant predictors in explaining the variability in 2025_Market_Value.

In conclusion, after the exploratory data analysis, the property’s characteristics consistently aligned with the average values across all major variables. This includes both land size, house area, improvement market value, and total land market value, which are factors that show a strong correlation with market value. As a result, the home’s assessed value would also be expected to fall within the typical price range. Based on the Multiple Linear Regression model, which compares the property to neighboring homes, the 95% confidence predictor interval estimates the market value to fall between $477832.0 and $538,409, with a predicted value of $498318.9. This suggests a potential risk of overvaluation. Homes assessed at similar values often feature upgrades such as swimming pools, secondary garages, or other enhancements—features that 6321 88th Street, Lubbock, Texas lacks. Therefore, it should not be valued at the same level as those more amenity-rich properties. Given these factors, it is reasonable to conclude that the home is overvalued by approximately $40,090.1 leading to an unjustified tax burden. A reassessment is recommended to better reflect its true market value.

IE 5344 Project 2 – Property Tax Assessment Analysis

Ahmed Abudaqqa

2025-04-20