This report presents a statistical analysis of property tax assessments, using regression models to predict the 2025 Market Value of residential properties. A key focus is determining whether the estimated value of the property at 6321 88th St is fair, undervalued, or overvalued. The analysis follows a structured process: data validation, exploratory analysis, model development, diagnostics, and prediction. Analysis key factors such as improvement value, land market value, main area in footage and value, garage footage and value, and the land footage to determine influential factors, and determine whether the home in address 6321 88TH Street is over evaluated or under evaluated using the multiple linear regression model. The findings advocate for re-evaluation to align the home’s taxable value with empirical evidence, ensuring fairness and equity in property taxation.
The multiple linear regression equation is given by \[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_nX_n + \varepsilon \]
Where:
-\(Y: Dependent~variable\)
-\(X_1, X_2, \dots, X_n: Independent~variables\)
-\(\beta_0: Intercept\)
-\(\beta_1, \dots, \beta_n: Coefficients\)
-\(\varepsilon: Error~term\)
The 2025 market value assessment of $538,409 for the home at 631 88th Street is overvalued, resulting in an unfairly high property tax burden. This report demonstrates this using data from 45 neighboring properties along 88th Street (addresses 6303–6351). Variables analyzed include:
2025_Market_ValueImprovement_Market_Value (value of structures on the
property)Total_Land_Market_ValueMain_Area_Sq_Ft (square footage of the main living
area)Main_Area_ValueGarage_Sq_FtGarage_ValueLand_Sq_FtThe goal of this project is to show that the assessed value of home at 6321 88th Street exceeds or lower than the statistically reasonable range, and urge the county tax assessor to re-evaluate the home’s value and adjust taxes accordingly.
The initial data analysis involved checking the data distribution to ensure reliability for use in regression modeling. This involved checking the assumptions including normality test, outliers/influential points. —
Data for this analysis was collected from the Lubbock Central Appraisal District for properties located on 88th Street, Lubbock, Texas.
For the analysis, the required R packages including dplyr, ggplot2, car, and MASS were loaded.
#load necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.3
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(MASS)
## Warning: package 'MASS' was built under R version 4.4.3
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
The focus includes all homes with addresses ranging from 6309 to 6351 88th Street, comprising 45 properties in total. Below is a breakdown of key variables collected:
## Warning: package 'knitr' was built under R version 4.4.3
| Variable | Explanation |
|---|---|
| 2025 Market Value | Total appraised value of property (house + land) |
| Total Improvement Market Value | Total appraised value of structural improvements (house) |
| Total Land Market Value | Total appraised value of land |
| Homestead Cap Loss | Discount applied if prior year’s appraisal increased by >10% (excluded from analysis) |
| Total Main Area (Sq. Ft.) | Total square footage of house (heated + non-heated areas) |
| Main Area (Sq. Ft.) | Square footage of heated living area |
| Main Area (Value) | Appraised value of heated living area |
| Garage (Sq. Ft.) | Square footage of garage/non-heated areas |
| Garage (Value) | Appraised value of garage/non-heated areas |
| Land (Sq. Ft.) | Total square footage of land |
The data was saved into a comma-delimited file .csv and uploaded to Github account for easy access. The function below calls the data and checks using the head() function for verification.
Missing values were checked and verified to be missing, ensuring the data satisfied the minimum assumptions for regression modeling. Such operations comply with best practice in data science and statistical computing, enforcing the theme that valid inference depends on high-quality input data.
Data validation and verification are the groundwork for sound statistical modeling. In this study, several steps were taken in checking data quality to guarantee the consistency and validity of the dataset. Specifically, variables calculated from their component values like total area and improvement value were computed again from the components and matched against original figures reported to check for accuracy.
# Load the dataset
#Note that House 6321 Data is on row 16 from the data set excluding the header row
df <- read.csv("https://raw.githubusercontent.com/Ahmedja96/IE-5320-Project-2-Dataset/refs/heads/main/IE%205344%20Project%202%20Dataset.csv")
head(df)
## X2025_Market_Value Improvement_Market_Value Total_Land_Market_Value
## 1 531703 485373 46330
## 2 504815 458572 46243
## 3 573558 527274 46284
## 4 469131 422975 46156
## 5 1218146 116617 101529
## 6 569992 511992 58000
## Main_Area_Sq_Ft Main_Area_Value Garage_Sq_Ft Garage_Value Land_Sq_Ft
## 1 2743 449668 484 35705 7988
## 2 2610 419843 525 38729 7973
## 3 2851 460543 918 66731 7980
## 4 2991 390541 552 32434 7958
## 5 3097 624126 1095 96763 17505
## 6 3036 447924 575 38175 10000
# Verify calculated fields
df$Check_Main_Area <- df$Main_Area_Sq_Ft + df$Garage_Sq_Ft
df$Check_Improvement_Value <- df$Main_Area_Value + df$Garage_Value
df$Check_Market_Value <- df$Improvement_Market_Value + df$Total_Land_Market_Value
# Find mismatches & Removing Irrelevant Data
which(df$Check_Main_Area != df$Total_Main_Area_Sq_Ft)
## integer(0)
which(df$Check_Improvement_Value != df$Improvement_Market_Value)
## [1] 5 6 7 9 10 11 12 13 15 17 21 26 30 34 38 41
which(df$Check_Market_Value != df$X2025_Market_Value)
## [1] 5
summary(df)
## X2025_Market_Value Improvement_Market_Value Total_Land_Market_Value
## Min. : 418286 Min. : 116617 Min. : 43506
## 1st Qu.: 504815 1st Qu.: 458572 1st Qu.: 45112
## Median : 534991 Median : 485962 Median : 45658
## Mean : 575245 Mean : 502189 Mean : 50834
## 3rd Qu.: 573558 3rd Qu.: 527274 3rd Qu.: 46330
## Max. :1218146 Max. :1116617 Max. :101529
## Main_Area_Sq_Ft Main_Area_Value Garage_Sq_Ft Garage_Value
## Min. :2041 Min. :331544 Min. : 325.0 Min. :24934
## 1st Qu.:2610 1st Qu.:415934 1st Qu.: 506.0 1st Qu.:36674
## Median :2745 Median :443268 Median : 528.0 Median :38729
## Mean :2721 Mean :440626 Mean : 570.9 Mean :41673
## 3rd Qu.:2902 3rd Qu.:453758 3rd Qu.: 552.0 3rd Qu.:41033
## Max. :3219 Max. :624126 Max. :1119.0 Max. :96763
## Land_Sq_Ft Check_Main_Area Check_Improvement_Value Check_Market_Value
## Min. : 7501 Min. :2591 Min. :368218 Min. : 218146
## 1st Qu.: 7778 1st Qu.:3132 1st Qu.:453413 1st Qu.: 504815
## Median : 7872 Median :3286 Median :484488 Median : 532125
## Mean : 8756 Mean :3292 Mean :482300 Mean : 553023
## 3rd Qu.: 7988 3rd Qu.:3477 3rd Qu.:494642 3rd Qu.: 573558
## Max. :17505 Max. :4192 Max. :720889 Max. :1218146
The data for “Homestead Cap Loss” variables was redundant and hence removed because it was not relevant to the analysis since it is for the current tax year only and the assessed value will only go up over time. There was no home at address 6351 88TH street and the home at address 6310 88TH street data was incomplete, so those two were excluded from the data set. The Market Value for the neighbored ranges between $418,286 to $1,218,146 and exhibits a right-skewed distribution, evidenced by the mean ($575,245) exceeding the median ($534,991). The range is substantial, with a minimum of $418,286 and a maximum of $1,218,146, suggesting outliers at the upper end, likely the $1.22M property). The interquartile range (IQR: $504,815–$573,558) captures typical values, while the 3rd quartile ($573,558) aligns closely with the median, indicating clustering of mid-to-high values before the extreme upper tail. This skewness may necessitate outlier treatment for robust modeling. The summary gives a general overview of the market value variable. Histogram and box plot will be used to determine the distribution of the variable.
The Total House Area variable combines the square footage of both heated living space (Main Area) and non-heated garage space (Garage Area). In the dataset, this variable ranges from ~3,000 to ~4,200 sq. ft., reflecting the physical footprint of the homes analyzed. Larger total areas generally align with higher Market Values, as they represent more usable space such as living rooms, bedrooms, garages, which is a key factor in property appraisal. This variable provides a measure of property size, critical for assessing functional utility and buyer preferences in the housing market and assessing the market value.
# summary of total house area variable
df$Total_House_Area_Sq_Ft <- df$Main_Area_Sq_Ft + df$Garage_Sq_Ft
summary(df$Total_House_Area_Sq_Ft)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2591 3132 3286 3292 3477 4192
The variable has a near-symmetric distribution, as the median (3,286) and mean (3,292) are nearly identical. The middle 50% of values fall within a narrow range (3,132–3,477), indicating consistency for most observations. The maximum value (4,192) significantly exceeds the third quartile (3,477), suggesting a potential outlier or a small subset of exceptionally large values in the upper tail.
# histogram and scatterplot
par(mfrow = c(1, 2), mar = c(4, 4, 3, 2))
# histogram with density curve
hist(df$Total_House_Area_Sq_Ft,
main = "Histogram of Total House Area",
xlab = "Market Value ($)",
col = "blue",
border = "yellow",
breaks = 15)
lines(density(df$Total_House_Area_Sq_Ft), col = "black", lwd = 2)
# market value vs house area
plot(df$X2025_Market_Value, df$Total_House_Area_Sq_Ft,
main = "Market Value vs. House Area",
xlab = "2025 Market Value ($)",
ylab = "Total Main Area (Sq. Ft.)",
col ="orange",
pch=19)
points(538409, 3365, cex = 2, pch = 18, col ="grey")
text(628409, 3330,labels="#6321", col="grey", cex=0.8, font=2)
mtext("Total House Area Distribution", outer = TRUE, cex = 2, font = 4)
# reset plot layout
par(mfrow = c(1, 1))
The histogram shows the distribution of total house areas, measured in square feet. The data is right-skewed, with most houses clustered around lower market values (approximately 250,000 to 350,000). There is a peak in frequency around the 300,000 to 350,000 range, indicating that this is the most common price range for the houses in the dataset. Fewer houses are observed at higher market values, suggesting that larger or more expensive houses are less frequent.
The scatter plot illustrates the relationship between the total main area (in square feet) and the market value (in dollars). There is a general positive trend, indicating that as the total main area increases, the market value tends to increase as well. However, the relationship is not perfectly linear, as there is some variability in market values for houses of similar sizes. For example, “#6321,” has a relatively high market value compared to other houses with similar total main areas, suggesting it may be an Outlier or have additional features that justify its higher price.
The Land Area variable is the total square footage of the property’s land, ranging from 7,546 to 17,505 sq. ft. in the data set. While larger land areas often correlate with higher market values in real estate makes this variable important to the data analysis.
summary(df$Land_Sq_Ft)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7501 7778 7872 8756 7988 17505
# Market Value vs.Total Land Sqft
par(mfrow = c(1, 2), mar = c(4, 4, 3, 0.2) + 0.1, oma = c(2,0,2,0))
hist(df$Land_Sq_Ft,
main = "Histogram of Land Area",
xlab = "Total Main Area (Sq. Ft.)",
col ="lightblue",
border = 'darkblue',
labels = TRUE)
plot(df$X2025_Market_Value, df$Land_Sq_Ft,
main = "Market Value vs. Total Land",
xlab = "2025 Market Value ($)",
ylab = "Total Main Area (Sq. Ft.)",
col = ifelse(df$Land_Sq_Ft < 8169, "purple", "orange"), #highlight outliers
pch=19)
points(538409, 7546, cex = 2, pch = 18, col ="lightgreen")
text(718000, 7546,labels="#6321", col="darkred", cex=0.8, font=2)
legend("topleft", c("Within Specification", "Outliers"), border="black",inset=.02, fill = c("grey", "black"))
mtext("Market Value vs. Land Sqft", outer = TRUE, cex = 2, font = 4)
The land area distribution is right-skewed, evidenced by the mean (8,756 sq. ft.) exceeding the median (7,872 sq. ft.). Most properties cluster tightly between 7,501–7,988 sq. ft. (IQR), while the maximum (17,505 sq. ft.) represents an extreme Outlier. This skewness suggests that while land area varies minimally for the majority of homes, a few large lots disproportionately inflate the average, diminishing its utility as a standalone predictor of market value.
Exploratory data analysis is conducted in order to visualize the distribution of the target variable—2025 Market Value. A histogram demonstrated positive skewness in the data, which goes against the linear regression assumption of normally distributed residuals.
hist(df$X2025_Market_Value,
main = "Histogram of 2025 Market Value",
xlab = "2025 Market Value",
col = "skyblue",
breaks = 30)
What can be observed from the Histogram is that 2025 Market Value histogram shows the spread of residential property prices in the dataset. The resultant visualization of the output shows a right-skewed (positively skewed) distribution, which means most properties fall in the lower to mid-value market prices, approximately around dollars 400,000 to 700,000. Conversely, fewer properties have considerably larger market values that range beyond dollars 1,000,000 and form a long tail on the right side of the distribution. This is an indication of outliers or high-end properties that greatly differ from the central tendency of the data set. The skewness of the distribution indicates violation of the assumption of normality necessary for analysis using linear regression.
boxplot(df$X2025_Market_Value,
main = "Boxplot of 2025 Market Value",
ylab = "2025 Market Value",
col = "lightgreen",
horizontal = TRUE)
What can be observed from the boxplot is that the 2025 Market Value reveals that most houses range from around dollars 490,000 to dollars 610,000, with the median slightly more than dollars 550,000. There are a couple of high-value houses as outliers at around dollars 700,000, dollars 950,000, and more than dollars 1,200,000. The existence of these outliers and a longer upper whisker confirm a right-skewed distribution. Such outliers can have an effect on regression output, and hence additional diagnostic checks are required to evaluate their effect on the model.
For futher Clarification see visualization of data using a scatterplots
#scatterplot matrix
plot(df$Improvement_Market_Value, df$`2025_Market_Value`,
main = "Improvement Market Value vs 2025 Value",
xlab = "Improvement Value", ylab = "2025 Market Value",
col = "blue")
plot(df$Total_Land_Market_Value, df$`2025_Market_Value`,
main = "Total Land Value vs 2025 Value",
xlab = "Land Value", ylab = "2025 Market Value",
col = "blue")
plot(df$Main_Area_Sq_Ft, df$`2025_Market_Value`,
main = "Main Area vs 2025 Value",
xlab = "Sq Ft", ylab = "2025 Market Value",
col = "blue")
plot(df$Main_Area_Value, df$`2025_Market_Value`,
main = "Main Area Value vs 2025 Value",
xlab = "Value", ylab = "2025 Market Value",
col = "blue")
plot(df$Garage_Sq_Ft, df$`2025_Market_Value`,
main = "Garage Sq Ft vs 2025 Value",
xlab = "Sq Ft", ylab = "2025 Market Value",
col = "blue")
plot(df$Garage_Value, df$`2025_Market_Value`,
main = "Garage Value vs 2025 Value",
xlab = "Value", ylab = "2025 Market Value",
col = "blue")
plot(df$Land_Sq_Ft, df$`2025_Market_Value`,
main = "Land Sq Ft vs 2025 Value",
xlab = "Sq Ft", ylab = "2025 Market Value",
col = "blue")
The research tried out a number of regression model specifications.
#fit initial multiple regression model
model_initial <- lm(X2025_Market_Value ~ ., data = df)
summary(model_initial)
##
## Call:
## lm(formula = X2025_Market_Value ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107849 -9540 -1424 15793 109994
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.327e+05 6.084e+04 -2.182 0.0356 *
## Improvement_Market_Value 4.802e-01 7.283e-02 6.593 9.93e-08 ***
## Total_Land_Market_Value 3.355e+01 2.221e+01 1.511 0.1393
## Main_Area_Sq_Ft 6.740e+02 1.461e+02 4.614 4.62e-05 ***
## Main_Area_Value -3.813e+00 8.476e-01 -4.499 6.56e-05 ***
## Garage_Sq_Ft -3.975e+03 7.147e+02 -5.561 2.47e-06 ***
## Garage_Value 5.472e+01 9.524e+00 5.746 1.39e-06 ***
## Land_Sq_Ft -1.603e+02 1.308e+02 -1.226 0.2281
## Check_Main_Area NA NA NA NA
## Check_Improvement_Value NA NA NA NA
## Check_Market_Value NA NA NA NA
## Total_House_Area_Sq_Ft NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34410 on 37 degrees of freedom
## Multiple R-squared: 0.9628, Adjusted R-squared: 0.9558
## F-statistic: 136.9 on 7 and 37 DF, p-value: < 2.2e-16
The initial model included all variables to predict
2025_Market_Value: \(2025_Market_Value = Improvement_Market_Value X1 +
Total_Land_Market_Value X2 + Main_Area_Sq_Ft X3 + Main_Area_Value X4 +
Garage_Sq_Ft X5 + Garage_Value X6 + Land_Sq_Ft X7\) where X1 to
X7 are coefficients. The multiple regression model is. \[Market Value =−132,700+0.48(Improvement Market
Value)+33.55(Land Market Value)+4,649(Main Area Sq Ft)−3.81(Main Area
Value)+54.72(Garage Value)−160.3(Land Sq Ft)\] The model has a
high R-squared value of 0.9628, indicating that approximately 96.3% of
the variability in market value can be explained by the included
predictors. The adjusted R-squared of 0.9558 confirms the model’s strong
explanatory power while accounting for the number of predictors. The
F-statistic of 136.9 with a p-value less than 2.2e-16 suggests the
overall model is highly significant. The level of significance used to
determine significant factors is 0.05. Most of the predictors were
statistically significant at the 0.05 level, including Improvement
Market Value, Main Area Square Footage, Main Area Value, Garage Square
Footage, and Garage Value. Notably, Improvement Market Value has a
positive relationship with the 2025 market value, while Main Area Value
and Garage Sq Ft show a negative impact, which may warrant further
investigation or could indicate multicollinearity. On the other hand,
Total Land Market Value and Land Square Footage were not statistically
significant in this model.
plot(model_initial, which = 1:5)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Second linear regression will be specified in using main area square
footage, garage size, and land square footage as independent
variables.
model1 <- lm(X2025_Market_Value ~ Main_Area_Sq_Ft + Garage_Sq_Ft + Land_Sq_Ft, data = df)
summary(model1)
##
## Call:
## lm(formula = X2025_Market_Value ~ Main_Area_Sq_Ft + Garage_Sq_Ft +
## Land_Sq_Ft, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -181371 -40589 8719 43257 88072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30699.347 91695.465 0.335 0.739
## Main_Area_Sq_Ft -17.895 35.826 -0.499 0.620
## Garage_Sq_Ft 28.411 60.786 0.467 0.643
## Land_Sq_Ft 65.897 4.166 15.819 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58780 on 41 degrees of freedom
## Multiple R-squared: 0.8798, Adjusted R-squared: 0.871
## F-statistic: 100 on 3 and 41 DF, p-value: < 2.2e-16
plot(model1, which = 1:5)
This produced a very high R-squared value, but also flagged the
inclusion of more variables.
The full model below was then formulated by adding economic variables like main area value and land market value. Although this better fit the model, residual patterns indicated additional complexity in the data-generating process.
model2 <- lm(X2025_Market_Value ~ df$Main_Area_Sq_Ft + df$Main_Area_Value + df$Garage_Sq_Ft +
df$Garage_Value + df$Land_Sq_Ft + df$Total_Land_Market_Value, data = df)
summary(model2)
##
## Call:
## lm(formula = X2025_Market_Value ~ df$Main_Area_Sq_Ft + df$Main_Area_Value +
## df$Garage_Sq_Ft + df$Garage_Value + df$Land_Sq_Ft + df$Total_Land_Market_Value,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145451 -27441 -23 23165 110511
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45210.7926 79352.6611 0.570 0.572
## df$Main_Area_Sq_Ft -55.5944 138.7699 -0.401 0.691
## df$Main_Area_Value 0.4346 0.8015 0.542 0.591
## df$Garage_Sq_Ft -419.9346 682.7132 -0.615 0.542
## df$Garage_Value 6.3317 8.8313 0.717 0.478
## df$Land_Sq_Ft 174.0064 175.4109 0.992 0.327
## df$Total_Land_Market_Value -20.8122 30.0055 -0.694 0.492
##
## Residual standard error: 50070 on 38 degrees of freedom
## Multiple R-squared: 0.9191, Adjusted R-squared: 0.9064
## F-statistic: 71.99 on 6 and 38 DF, p-value: < 2.2e-16
plot(model2, which = 1:5)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Final Model With Only Siginificant Variables Considered ( Market
Value, Appraied Value, Total House Area, and Total Land Area) Those
Variables are determined to be the most significant as they appear to
sum up all factors that each house is priced by (for example garage
area, yard area, etc.) considering appraised and market price per
sqft.
model3 <- lm( X2025_Market_Value ~ df$Improvement_Market_Value + df$Total_Land_Market_Value + df$Main_Area_Sq_Ft + df$Land_Sq_Ft, data = df)
summary (model3)
##
## Call:
## lm(formula = X2025_Market_Value ~ df$Improvement_Market_Value +
## df$Total_Land_Market_Value + df$Main_Area_Sq_Ft + df$Land_Sq_Ft,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -183549 -38744 5728 35926 124784
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.861e+04 9.063e+04 -0.205 0.8383
## df$Improvement_Market_Value 1.121e-01 6.856e-02 1.636 0.1098
## df$Total_Land_Market_Value -4.604e+01 2.682e+01 -1.717 0.0938 .
## df$Main_Area_Sq_Ft -8.510e+00 3.369e+01 -0.253 0.8019
## df$Land_Sq_Ft 3.313e+02 1.560e+02 2.124 0.0399 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 55850 on 40 degrees of freedom
## Multiple R-squared: 0.8941, Adjusted R-squared: 0.8835
## F-statistic: 84.43 on 4 and 40 DF, p-value: < 2.2e-16
plot(model3, which = 1:5)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
What can be concluded from the results of the final model (model2) is
that for the Multiple R-squared values, the plot is showing a slight
deviation from a typical bell-shaped curve. This is evident from the
unexpected rise in the bars near both ends of the distribution. The
plot’s bin labels clearly highlight the symmetry, with most homes
concentrated in the 3,000 to 3,700 square foot range. As for the
P-values, the majority of data points are grouped between 3,000 and
3,600 square feet, as shown in the histogram. The plot also identifies
homes near the upper and lower boundaries of the range that are close to
being outliers but do not qualify as such. Furthermore, house at address
6321 is specifically marked to demonstrate that its value falls within
the main cluster, suggesting it represents an average case.
Additionally, The Residuals vs. Fitted plot shows a relatively straight trend line, with most data points either on or near the reference line, suggesting only a mild presence of non-constant variance. The plot also identifies three potential outliers—points 9, 19, and 30— that will require further examination. The Q-Q plot of residuals indicates that the residuals mostly follow a normal distribution, with only slight deviations. While a few points stray from the line and three are flagged as possible outliers, the majority align closely with the reference line, supporting the assumption of normality. The Scale-Location plot, in contrast to the earlier model, demonstrates that all points fall below the 1.8 threshold, indicating minimal deviation. However, point 19 exceeds the 1.5 mark, raising a concern that will be addressed later. The Residuals vs. Leverage plot confirms that the same three points identified by Cook’s Distance (CD) are also flagged by the Leverage statistic (HII) since \(\frac{2p}{n}\Rightarrow \frac{2(4)}{45}=0.18\), suggesting they have both leverage and influence. Among them, point 7 appears to be the most influential, lying just outside the CD reference line and farther from the HII threshold. Point 9 shows high leverage but is nearer to the 1.0 line, while point 17 is closer to the HII limit but also near the reference line. All three points demonstrate notable influence and should be investigated further.
For the multiple linear regression model to be valid and reliable, key assumptions must be satisfied. First, model can be checked for linearity which assumes a straight-line relationship between the independent variables and the dependent variable. This ensures that the model accurately captures the true relationship. Also, model can be checked for independence means that the residuals (errors) are not correlated with one another. The other assumption is homoscedasticity which requires that the residuals have constant variance across all levels of the predictors—any patterns or funnel shapes in residual plots may indicate a violation. The fourth assumption is normality of residuals which ensures that hypothesis tests and confidence intervals derived from the model are valid. This is typically assessed using a Q-Q plot. Lastly, which will be shown later is checking that the model assumes no multicollinearity, meaning that the independent variables are not highly correlated with each other, as multicollinearity can inflate standard errors and make coefficient estimates unstable.
After model estimation, diagnostic plots were checked to evaluate the linearity assumptions of linear regression. These comprised residuals vs. fitted values, normal Q-Q plots, scale-location plots for homogeneity of variance, and residuals vs. leverage for outlier identification. Cook’s Distance was used to measure the influence of each data point on regression estimates in terms of leverage and residual impact theory.
Verifying points of leverage, outliers, and points of influence
# Outliers
model3$residuals
## 1 2 3 4 5
## 5728.05372 -18322.08187 44336.04713 -45807.83214 124783.75252
## 6 7 8 9 10
## -85791.21310 -38.65217 -49014.80516 18743.26907 35925.97255
## 11 12 13 14 15
## -64819.12000 66107.68646 -53059.54661 -38743.95631 -39532.65417
## 16 17 18 19 20
## 40090.11931 12644.87811 4829.81127 -183548.54871 1755.36374
## 21 22 23 24 25
## 75585.65371 -10499.53131 -19687.27484 12272.78179 -86640.09914
## 26 27 28 29 30
## 67566.18543 63010.54297 19183.30384 -40520.55166 77961.14119
## 31 32 33 34 35
## 2703.02943 3514.50490 6827.99801 33471.60472 -45807.83214
## 36 37 38 39 40
## 19281.88608 7547.33413 36996.44124 -18322.08187 13200.17453
## 41 42 43 44 45
## 44336.04713 -52470.31854 5728.05372 -31595.65626 40090.11931
model3$fitted.values
## 1 2 3 4 5 6 7 8
## 525974.9 523137.1 529222.0 514938.8 1093362.2 655783.2 602465.7 509302.8
## 9 10 11 12 13 14 15 16
## 950022.7 514193.0 774423.1 514861.3 533193.5 506679.0 649850.7 498318.9
## 17 18 19 20 21 22 23 24
## 1205501.1 530161.2 828797.5 509359.6 518846.3 508856.5 511061.3 498893.2
## 25 26 27 28 29 30 31 32
## 504926.1 512613.8 507951.5 518686.7 509530.6 532445.9 522870.0 518442.5
## 33 34 35 36 37 38 39 40
## 523776.0 521574.4 514938.8 519092.1 524577.7 519805.6 523137.1 518419.8
## 41 42 43 44 45
## 529222.0 486815.3 525974.9 545693.7 498318.9
dfOutliers <- df
dfOutliers$fitted <- model3$fitted.values
dfOutliers$residuals <- model3$residuals
#Note that points 9 and 30 are outliers, but point 30 does not appear to have leverage per the plots
# Leverage & Influence
hatvalues(model3)
## 1 2 3 4 5 6 7
## 0.02599412 0.02853996 0.03792851 0.06212403 0.87521625 0.05661006 0.99999919
## 8 9 10 11 12 13 14
## 0.10637205 0.49005419 0.03107769 0.10675883 0.05191473 0.03543308 0.06557517
## 15 16 17 18 19 20 21
## 0.03950801 0.03293721 0.60642007 0.02925276 0.13371755 0.02772535 0.06374753
## 22 23 24 25 26 27 28
## 0.03669430 0.04956392 0.02842347 0.11546540 0.05111619 0.03341317 0.03153928
## 29 30 31 32 33 34 35
## 0.09246670 0.13715690 0.02529293 0.02702496 0.02531986 0.02695823 0.06212403
## 36 37 38 39 40 41 42
## 0.02845697 0.02624340 0.04887929 0.02853996 0.02768140 0.03792851 0.06482963
## 43 44 45
## 0.02599412 0.02904384 0.03293721
sort(hatvalues(model3))
## 31 33 1 43 37 34 32
## 0.02529293 0.02531986 0.02599412 0.02599412 0.02624340 0.02695823 0.02702496
## 40 20 24 36 39 2 44
## 0.02768140 0.02772535 0.02842347 0.02845697 0.02853996 0.02853996 0.02904384
## 18 10 28 16 45 27 13
## 0.02925276 0.03107769 0.03153928 0.03293721 0.03293721 0.03341317 0.03543308
## 22 3 41 15 38 23 26
## 0.03669430 0.03792851 0.03792851 0.03950801 0.04887929 0.04956392 0.05111619
## 12 6 35 4 21 42 14
## 0.05191473 0.05661006 0.06212403 0.06212403 0.06374753 0.06482963 0.06557517
## 29 8 11 25 19 30 9
## 0.09246670 0.10637205 0.10675883 0.11546540 0.13371755 0.13715690 0.49005419
## 17 5 7
## 0.60642007 0.87521625 0.99999919
#point 7 seems to have the most leverage as its closest to 1
cooksd <- cooks.distance(model3)
plot(cooksd, main = "Cook's Distance")
abline(h = 4/nrow(df), col = "red")
Interpretation: Cook’s Distance plot offers a diagnostic measure for finding influential points in the regression model. Every vertical column in the plot measures the Cook’s distance for a single observation within the dataset, indicating how much the observation affects the regression coefficients that are fitted. The horizontal red line, placed at the 4/n threshold where n represents the number of observations, is a rule-of-thumb cut value to identify potentially influential points. Points with Cook’s distance values that surpass this figure should be explored more closely because they can unfairly affect the estimates of the model and invalidate the inference. Few observations in the plot of the last quadratic model (model2) are over the red line, indicating those points have considerable leverage or are large residuals—or both—and may be affecting the fit of the model. These powerful observations may be outliers, data entry mistakes, or otherwise legitimately distinctive properties with traits poorly represented by the predictors in the model. Cook’s Distance plot identifies influential observations (points above the red line at 4/n). These points can effect the model disproportionately will be removed to see the impact. From a theoretical point of view, detecting and testing influential points are crucial in regression diagnostics because they can bias parameter estimates, increase standard errors, and affect predictive accuracy. Where they are discovered to be an outlier or an error in measurement, they may be deleted or trimmed. Otherwise, if they indicate genuine variability within the population, robust regression methods or model re-specification would be needed to reduce their effect.
# remove 7, 9, and 17
finaldf <- df[-c(7,9,17), ]
model_final <- lm( X2025_Market_Value ~ finaldf$Improvement_Market_Value + finaldf$Total_Land_Market_Value + finaldf$Main_Area_Sq_Ft + finaldf$Land_Sq_Ft, data = finaldf)
summary(model_final)
##
## Call:
## lm(formula = X2025_Market_Value ~ finaldf$Improvement_Market_Value +
## finaldf$Total_Land_Market_Value + finaldf$Main_Area_Sq_Ft +
## finaldf$Land_Sq_Ft, data = finaldf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -163007 -22110 165 29504 117256
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.607e+03 1.055e+05 0.044 0.965
## finaldf$Improvement_Market_Value -4.890e-02 1.430e-01 -0.342 0.734
## finaldf$Total_Land_Market_Value 2.861e+04 3.013e+04 0.949 0.349
## finaldf$Main_Area_Sq_Ft 3.240e+01 4.761e+01 0.681 0.500
## finaldf$Land_Sq_Ft -1.659e+05 1.748e+05 -0.949 0.349
##
## Residual standard error: 56290 on 37 degrees of freedom
## Multiple R-squared: 0.7987, Adjusted R-squared: 0.7769
## F-statistic: 36.7 on 4 and 37 DF, p-value: 2.084e-12
After removing some of the non-significant predictors from the model, the final model with the most signigicant predictors Improvement_Market_Value , Total_Land_Market_Value , Main_Area_Sq_Ft , and Land_Sq_Ft, statistics/summary indicate their strong association with the response variable. This improved interpretation and reduced over fitting by retaining only the most impacting predictors. Removing the three points with most leverage and influence does display a slightly lower R-Squared in the thousandths place, though it does not as much influence and leverage as expected. Therefore, given these results it can be concluded that these points do have leverage and influence, but if they were to be exclude it does not make a great impact to the data set.
In addition, multicolinearity between predictors was assessed with the Variance Inflation Factor (VIF), which indicated strong collinearity between squared terms and their linear equivalents. This type of multicolinearity threatens coefficient interpretation ability and estimation accuracy, making model refinement imperative. After removing the non significant predictors, checking for Variance of influential factors was important to ensure validity of the model. The vif function will be used. \[ VIF_i = \frac{1}{1-R_{2}^{i}} \] ## Check pairwise correlation
cor_matrix <- cor(df[-1,])
corrplot(cor_matrix, method = "color", type = "upper")
## Multicolinearity Check
#check VIF
vif_final <- vif(model3)
print(vif_final)
## df$Improvement_Market_Value df$Total_Land_Market_Value
## 1.201379 1865.183007
## df$Main_Area_Sq_Ft df$Land_Sq_Ft
## 1.130723 1864.151900
It can be observed that there were High VIFs, greater than 1,000 for some variables indicate unresolved colinearity involving Total_Land_Market_Value and Land_Sq_Ft variables.
The final model was used to forecast the market value for the home at address 6321 for comparison with the assessor value and determine if it is overvalued or undervalued. This involved using the data points including improvement market value, total land market value, main area and land area.
#predict value for the home 6321
df_6321 <- data.frame(
Improvement_Market_Value = 494642,
Total_Land_Market_Value = 43767,
Main_Area_Sq_Ft = 2773,
Land_Sq_Ft = 7546
)
# prediction with confidence interval
predicted_conf <- predict(model3, newdata = df_6321, interval = "confidence", level = 0.95)
## Warning: 'newdata' had 1 row but variables found have 45 rows
print(predicted_conf) #House 6321 Data on row 16
## fit lwr upr
## 1 525974.9 507775.0 544174.8
## 2 523137.1 504066.8 542207.4
## 3 529222.0 507237.6 551206.3
## 4 514938.8 486802.9 543074.8
## 5 1093362.2 987756.1 1198968.4
## 6 655783.2 628924.9 682641.5
## 7 602465.7 489581.9 715349.4
## 8 509302.8 472486.1 546119.5
## 9 950022.7 870999.7 1029045.7
## 10 514193.0 494292.9 534093.2
## 11 774423.1 737539.5 811306.7
## 12 514861.3 489141.0 540581.7
## 13 533193.5 511944.7 554442.4
## 14 506679.0 477772.1 535585.8
## 15 649850.7 627413.2 672288.1
## 16 498318.9 477832.0 518805.7
## 17 1205501.1 1117595.2 1293407.1
## 18 530161.2 510854.2 549468.2
## 19 828797.5 787518.9 870076.2
## 20 509359.6 490563.4 528155.8
## 21 518846.3 490345.1 547347.6
## 22 508856.5 487232.8 530480.3
## 23 511061.3 485930.0 536192.5
## 24 498893.2 479861.8 517924.6
## 25 504926.1 466568.0 543284.2
## 26 512613.8 487092.0 538135.6
## 27 507951.5 487317.1 528585.8
## 28 518686.7 498639.3 538734.1
## 29 509530.6 475204.5 543856.6
## 30 532445.9 490639.7 574252.0
## 31 522870.0 504917.2 540822.7
## 32 518442.5 499885.2 536999.8
## 33 523776.0 505813.7 541738.3
## 34 521574.4 503040.1 540108.7
## 35 514938.8 486802.9 543074.8
## 36 519092.1 500049.5 538134.7
## 37 524577.7 506290.7 542864.6
## 38 519805.6 494848.5 544762.7
## 39 523137.1 504066.8 542207.4
## 40 518419.8 499638.5 537201.1
## 41 529222.0 507237.6 551206.3
## 42 486815.3 458073.2 515557.4
## 43 525974.9 507775.0 544174.8
## 44 545693.7 526455.7 564931.6
## 45 498318.9 477832.0 518805.7
Therefore upper limit for confidence predicted value is 518805.7, lower limit is 477832.0 and confidence predicted value is 498318.9
# prediction with prediction interval
predicted_pred <- predict(model3, newdata = df_6321, interval = "prediction", level = 0.95)
## Warning: 'newdata' had 1 row but variables found have 45 rows
print(predicted_pred) #House 6321 Data on row 16
## fit lwr upr
## 1 525974.9 411633.4 640316.5
## 2 523137.1 408653.8 637620.4
## 3 529222.0 414217.3 644226.6
## 4 514938.8 398601.5 631276.2
## 5 1093362.2 938780.9 1247943.6
## 6 655783.2 539748.2 771818.2
## 7 602465.7 442823.9 762107.4
## 8 509302.8 390566.9 628038.7
## 9 950022.7 812228.0 1087817.4
## 10 514193.0 399568.6 628817.5
## 11 774423.1 655666.4 893179.8
## 12 514861.3 399084.4 630638.2
## 13 533193.5 418327.3 648059.8
## 14 506679.0 390152.8 623205.2
## 15 649850.7 534758.6 764942.7
## 16 498318.9 383591.1 613046.6
## 17 1205501.1 1062427.0 1348575.2
## 18 530161.2 415638.2 644684.1
## 19 828797.5 708603.2 948991.9
## 20 509359.6 394921.7 623797.6
## 21 518846.3 402420.1 635272.6
## 22 508856.5 393920.3 623792.7
## 23 511061.3 395413.8 626708.7
## 24 498893.2 384416.4 613370.0
## 25 504926.1 385703.2 624149.0
## 26 512613.8 396880.9 628346.7
## 27 507951.5 393197.3 622705.6
## 28 518686.7 404036.6 633336.8
## 29 509530.6 391543.1 627518.0
## 30 532445.9 412069.4 652822.4
## 31 522870.0 408567.5 637172.4
## 32 518442.5 404043.5 632841.4
## 33 523776.0 409472.1 638079.9
## 34 521574.4 407179.2 635969.6
## 35 514938.8 398601.5 631276.2
## 36 519092.1 404613.4 633570.8
## 37 524577.7 410222.3 638933.1
## 38 519805.6 404195.9 635415.3
## 39 523137.1 408653.8 637620.4
## 40 518419.8 403984.3 632855.3
## 41 529222.0 414217.3 644226.6
## 42 486815.3 370329.9 603300.7
## 43 525974.9 411633.4 640316.5
## 44 545693.7 431182.3 660205.0
## 45 498318.9 383591.1 613046.6
Therefore upper limit for prediction predicted value is 613046.6,
lower limit is 383591.1 and confidence predicted value is 498318.9
Keeping in mind, prediction interval is always going to be wider than
the confidence interval because the prediction interval depends on both
the error term of the fitted model and also the error that could be
related to future observations.
The house value was forecast and included the fitted value, the upper and lower limit at \(\alpha=0.05\).
Comparing the assessed market value $538,409 and the
predicted value $498318.9, the home 6321 88TH Street is
overvalued by $40090.1 .
Questions Answered
How well does the 4 factors forecast 2025 Home Market Value?
The model achieves a good fit (R² = 89%), indicating that the four predictors (Improvement_Market_Value, Total_Land_Market_Value, Main_Area_Sq_Ft, and Land_Sq_Ft) collectively explain all variability in 2025 market values. However, an R² of .89 is highly unusual and suggests over fitting (e.g., duplicated variables or circular dependencies in the data). While the predictors appear to forecast values perfectly in this data set, the model’s reliability for new data is questionable without resolving these issues.
What implicates do the results have for property tax and value assessment?
Over evaluation Evidence: The 95% prediction interval for 6321 88th Street is $477832.0–$518805.7, far below the assessed value of $538,409. This suggests the home is overvalued by $40090.1, leading to an unfair tax burden.
The residual standard error (RSE) of the model is reported as 55850, indicating the some deviation of predicted values from the observed values in the dataset. This measure provides an estimate of the standard deviation of the residuals, or the unexplained variability in the response variable (2025_Market_Value ) after accounting for the predictors. With 40 degrees of freedom, the RSE suggests that, on average, the model’s predictions may be off by a decent amount. While this value is relatively small compared to the scale of the market values being predicted, it highlights that there is still some unexplained variation in the data that the model does not account for, potentially due to omitted variables or inherent randomness. The multiple R-squared value for the model is extremely high, effectively equal to 0.89, indicating that nearly all the variability in the response variable (2025_Market_Value ) is explained by the predictors included in the model. This near-perfect fit suggests that the combination of Improvement_Market_Value , Total_Land_Market_Value , Main_Area_Sq_Ft , and Land_Sq_Ft captures almost all the patterns in the data. However, while a high R-squared is desirable, it should be interpreted cautiously, as it may also indicate overfitting, especially given the very high value. The adjusted R-squared, which penalizes the inclusion of unnecessary predictors, is also close to 1, reinforcing that the selected predictors contribute meaningfully to the model’s explanatory power. The p-value for the overall F-statistic of the model is very small (< 2.2e-16), confirming the statistical significance of the model as a whole, indicating that at least one predictor has a meaningful relationship with the response variable. Additionally, the individual p-values for the predictors show that Land_Sq_Ft` is highly significant (p < 0.05), while Main_Area_SqFt is marginally non-significant (p = 0.8019). Despite the borderline significance of Main_Area_SqFt , its inclusion may still improve the model if it contributes to theoretical or practical relevance. Overall, the p-values reinforce the importance of the significant predictors in explaining the variability in 2025_Market_Value.
In conclusion, after the exploratory data analysis, the property’s characteristics consistently aligned with the average values across all major variables. This includes both land size, house area, improvement market value, and total land market value, which are factors that show a strong correlation with market value. As a result, the home’s assessed value would also be expected to fall within the typical price range. Based on the Multiple Linear Regression model, which compares the property to neighboring homes, the 95% confidence predictor interval estimates the market value to fall between $477832.0 and $538,409, with a predicted value of $498318.9. This suggests a potential risk of overvaluation. Homes assessed at similar values often feature upgrades such as swimming pools, secondary garages, or other enhancements—features that 6321 88th Street, Lubbock, Texas lacks. Therefore, it should not be valued at the same level as those more amenity-rich properties. Given these factors, it is reasonable to conclude that the home is overvalued by approximately $40,090.1 leading to an unjustified tax burden. A reassessment is recommended to better reflect its true market value.