Property tax is a form of revenue that the county uses to fund municipal operations. A tax assessor estimates the value of the property (land plus house), and then tax is levied as a percentage of the assessed value, less any exemptions. The problem is that the assessed value of property isn’t very scientific, rather the tax assessor simply adjusts the value of the property based on what he/she feels is right. Because of this, homeowners have the chance to appeal their assessment to the assessor, even taking it to court if necessary.
This project is concerned with the assessed value of property at
6321 88th Street in Lubbock, Texas, specifically
whether the assessment is justified given the assessed property of other
homes in the neighborhood.
We will use the muliple linear regression equation \(y=b_0+b_1X_1+b_2X_2+b_3X_3+.....b_nX_n\) and least squares estimates coefficients to evaluate our model
with:
\(b_0= intercept\)
\(b_1, b_2, b_3...b_n\) are all slopes values
The data that is used for this project may be collected by following this link DataSet_Property_Assessment
The Initial breakdown of key variables is like in the table below as follows the information on the website ( https://lubbockcad.org/ )
Variable | Explanation |
2025 Market Value | Total value of property appraisal (house+land) |
Total Improvement Market Value | Total value of house appraisal |
Total Land Market Value | Total value of land appraisal |
Homestead Cap Loss | Represents a discount only in the current tax year if the appraised value from the previous year went up by more than 10% |
Total Main Area (Sq. Ft.) | Total square footage of house |
Main Area (Sq. Ft.) | Total square footage of heated house area |
Main Area (Value) | Total value of heated house area |
Garage (Sq. Ft.) | Total square footage of non-heated house area |
Garage (Value) | Total value of non-heated house area |
Land (Sq. Ft.) | Total square footage of land |
Following those assumptions when collecting the data:
Second assumption : We excluded “Total Land Market Value” as a predictor variable because it is directly derived from”Land (Sq. Ft)“. Since land value is inherently dependent on parcel size and not vice versa we retained ”Land (Sq. Ft)” as the more significant variable for analysis. This avoids multicollinearity issues while preserving interpretability, as land area serves as a root driver of market value.
Third assumption : We didn’t use Total Improvement Market Value because it’s meant for predicting future prices (like what a property might be worth in 2027 or 2030). Since we’re only looking at what the property at 6321 88th Street is worth right now by comparing it to the model given by the other homes on the same street in the same year, this future-value number wasn’t helpful for our analysis.
We end up with the following breakdown of our variables
Predicted variable | Predictors variables |
” TOTAL MAIN AREA (Sq. Ft) ” | |
” 2025 Market Value ” | ” MAIN AREA (Sq. Ft) ” |
” GARAGE ( Sq. Ft) ” | |
” LAND (Sq. Ft) ” |
Data Loading and Initial Analysis: Importing and summarizing the dataset to assess completeness and structure.
Exploratory Data Analysis (EDA): Conducted to visualize the overall structure of the dataset, examine the distribution of the response variable, and explore correlations among predictor variables.
Built an initial regression model and examined both the estimated coefficients and the Variance Inflation Factors (VIFs) to check for multicollinearity.
Tested various combinations of predictor variables and refined our models to find the most accurate and reliable one.
Using the final model make the prediction of the market value of property “6321 88th street” and compare to the actual value
In this part we will do the initial analysis of our dataset examine its basic structure, visually assess the distribution of the response variable and identify relationships between predictors. This will guide us and help us to better understand our datas.
# Load our dataset in R
dataproperty <- read.csv("https://raw.githubusercontent.com/weglo/Data-Analytics-course/refs/heads/main/Dataset_Property_assesment.csv")
colnames(dataproperty) <- c("Market_value_2025", "Total_main_area", "Main_area", "Garage", "Land")
# Structure and summary
head(dataproperty)
## Market_value_2025 Total_main_area Main_area Garage Land
## 1 735026 3462 3192 1063 10000
## 2 663907 3226 3226 1078 10463
## 3 569992 3036 3036 965 10000
## 4 602427 3277 2877 909 10625
## 5 550119 2582 2582 942 7749
## 6 709604 3147 3147 875 11633
str(dataproperty)
## 'data.frame': 34 obs. of 5 variables:
## $ Market_value_2025: int 735026 663907 569992 602427 550119 709604 480134 467935 610318 534991 ...
## $ Total_main_area : int 3462 3226 3036 3277 2582 3147 2842 2382 2925 2806 ...
## $ Main_area : int 3192 3226 3036 2877 2582 3147 2842 2382 2925 2806 ...
## $ Garage : int 1063 1078 965 909 942 875 970 520 957 552 ...
## $ Land : int 10000 10463 10000 10625 7749 11633 8206 7749 9821 8057 ...
colSums(is.na(dataproperty))
## Market_value_2025 Total_main_area Main_area Garage
## 0 0 0 0
## Land
## 0
The dataset contains variables such as Main_area, Garage, Total_main_area, and Land that can influence 2025_Market_Value. No missing values were detected, and the data types are appropriate for analysis.
lets explore the response variable distribution
# Histogram of market value
hist(dataproperty$Market_value_2025,
main = "Distribution of Market Value",
xlab = "Market Value",
col = "skyblue", breaks = 20)
The market value distribution shows a slight right skew, possibly due to high-value outliers.
# Correlation matrix
library(tidyverse)
library(corrplot)
data <- dataproperty[-1]
corrplot(cor(data %>% select_if(is.numeric)), method = "circle")
# Pairwise scatterplots (optional)
pairs(~ Market_value_2025 + Total_main_area + Garage + Main_area, data = dataproperty)
The first figure shows the ” Correlation matrix” and we can notice the High correlation between Total Main Area, Main Area, and Garage, which could suggest potential multicollinearity.
The pairwise visual analysis confirms a positive linear relationship between size-related variables and market value.
lets fit a full regression model with all the predictor variables and evaluate multicollinearity
# Fit the initial model
library(car)
model_initial <- lm(Market_value_2025 ~ Total_main_area + Main_area + Garage + Land, data = dataproperty)
summary(model_initial)
##
## Call:
## lm(formula = Market_value_2025 ~ Total_main_area + Main_area +
## Garage + Land, data = dataproperty)
##
## Residuals:
## Min 1Q Median 3Q Max
## -89907 2088 9016 12713 59906
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3465.85 74238.09 0.047 0.963084
## Total_main_area 121.66 27.45 4.431 0.000123 ***
## Main_area 14.92 36.36 0.410 0.684517
## Garage 73.01 34.80 2.098 0.044759 *
## Land 13.11 7.23 1.814 0.080080 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33570 on 29 degrees of freedom
## Multiple R-squared: 0.7967, Adjusted R-squared: 0.7687
## F-statistic: 28.42 on 4 and 29 DF, p-value: 1.163e-09
# Check Variance Inflation Factors
vif(model_initial)
## Total_main_area Main_area Garage Land
## 1.975669 2.583788 1.697125 2.465179
the equation for the initial model is :
\[ y=3465.85 + 121.66X_1 + 14.92X_2 + 73.01X_3 + 13.11X_4 \]
\(B_o = 23210.230\)
\(B_1 =121.66\)
\(B_2 =14.92\)
\(B_3 = 73.01\)
\(B_4 = 13.11\)
In this part we are going to evaluate whether we have properties with unusually high or low predictors values ( Total_main_area, Main_area, Garage, Land ) given the Initial model.
# Load necessary package
library(car)
# Plot 1: Influence Plot (Outliers + Leverage + Cook's Distance)
influencePlot(model_initial,
id.method = "identify",
main = "Influence Plot",
sub = "Circle size ∝ Cook’s Distance")
## StudRes Hat CookD
## 6 2.1682997 0.2361676 0.2578225
## 7 -3.2803401 0.1090277 0.1970369
## 11 -0.8242488 0.4277808 0.1027149
## 20 0.3416426 0.9134231 0.2540254
## 25 -3.4792559 0.2239673 0.5052477
Total_main_area is highly significant (p < 0.0001): for every additional 1 sq ft of total main area, the market value increases by $122.
Main_area is not significant adds little to the model and may overlap with Total_main_area (multicolinearity)
Garage is significant each additional sq ft of garage adds $73 to the market value.
Land Not quite statistically significant (p ≈ 0.08), but shows that land size may contribute to value $13/sq ft.
Multiple R-squared (\(R^2\)): The model explains 79.7% of the variability in market value. That’s a very strong fit.
Adjusted R-squared (0.7687) is high enough, meaning even with 4 predictors, the model performs well without being overly complex.
F-statistic p-value 1.16e-09: shows that the model overall is highly statistically significant
Predictor | VIF | Interpretation |
Total_main_area | 1.98 | No multicollinearity concern |
Main_area | 2.58 | Acceptable, but relatively high could be overlapping with Total_main_area |
Garage | 1.70 | Low multicollinearity |
Land | 2.47 | Acceptable |
observation 25 is the most problematic data point; very large negative residual it’s far below what the model predicts; medium leverage and large cook’s distance, it’s pulling the model down
Observation 7 is also a strong outlier with moderate influence.
Observation 20 isn’t an outlier but has very high leverage, meaning its predictors are extreme maybe an unusually large house or land area.
Observation 11 has high leverage but doesn’t appear to distort the model much (low Cook’s Distance)
Observation 6 is a moderate outlier.
All VIFs are under 5 that means no severe multicollinearity, but Main_area could be somewhat redundant with Total_main_area.
For our refined model 1, let’s remove the Main_area variable since previously we observed that this variable could be somewhat redundant with Total_main_area also we will remove the observation 25.
# Fit the model 1
datarefine <- dataproperty[,-25]
model1 <- lm(Market_value_2025 ~ Total_main_area + Garage + Land, data = datarefine)
summary(model1)
##
## Call:
## lm(formula = Market_value_2025 ~ Total_main_area + Garage + Land,
## data = datarefine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -89558 -1758 9995 13683 59866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23210.230 55751.978 0.416 0.6801
## Total_main_area 123.627 26.655 4.638 6.46e-05 ***
## Garage 78.290 31.885 2.455 0.0201 *
## Land 14.525 6.269 2.317 0.0275 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33100 on 30 degrees of freedom
## Multiple R-squared: 0.7956, Adjusted R-squared: 0.7751
## F-statistic: 38.91 on 3 and 30 DF, p-value: 1.835e-10
# Check Variance Inflation Factors
vif(model1)
## Total_main_area Garage Land
## 1.915278 1.465023 1.906577
the equation for this model 1 is :
\[ y=23210.230 + 123.627X_1 + 78.290X_2 + 14.525X_3 \]
with:
\(B_o = 23210.230\)
\(B_1 =123.627\)
\(B_2 =78.290\)
\(B_3 = 14.525\)
Total_main_area: Highly significant, for every 1 sq ft increase in total main area, market value increases by $123.63
Garage is significant, for each additional 1 sq ft of garage increases market value by $78.29
Land is significant, for each additional 1 sq ft of lands adds $14.50 to market value.
Multiple R-squared (\(R^2\)): the model explains 79.6% of variation in market value very strong fit.
Adjusted R-squared (0.7751) adjusts for number of predictors, still strong meaning the model isn’t overfitted
F-statistic p-value 1.84e-10 the model is highly significant overall.
All predictors are statistically significant (p < 0.05), meaning they meaningfully contribute to predicting market value.
Predictor | VIF | Interpretation |
Total_main_area | 1.92 | low multicollinearity |
Garage | 1.47 | low multicollinearity |
Land | 1.91 | low multicollinearity |
All predictors are statistically significant (p < 0.05), meaning they meaningfully contribute to predicting market value and also All VIFs < 2 that means no multicollinearity concerns the predictors are independently contributing.
Model 1 is a the strong, clean, and well-behaved model:
All predictors (Total_main_area, Garage, Land) are statistically significant.
Model explains 79.6% of the variability in home values which is excellent.
Residual error is reasonable (~$33k), given the scale of the housing market.
No multicollinearity, meaning our variables are not redundant.
We will now use the model 1 ( \(y=23210.230 + 123.627X_1 + 78.290X_2 + 14.525X_3\) ) and the 95% prediction interval to predict the 2025 market value of 6321 88th Street, based on its property characteristics ( “Total_main_area”, “Garage”, “Land”)
# Create new data frame for prediction
home_6321 <- data.frame(
Total_main_area = 2773,
Garage = 592,
Land = 7546
)
# Make prediction with prediction interval
predict(model1, newdata = home_6321, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 521981.7 452551.3 591412.1
The predicted market value for the home at 6321 88th street is $521981.7
The 95% prediction interval ranges from $452551.3 to $591412.1
The actual value of $538409 is within the prediction interval but near the upper bound
The predicted market value for 6321 88th Street based on a carefully refined regression model is slightly lower than the actual assessed value of $538409. However, since the actual value falls within the model’s 95% prediction interval, the actaul assessment is statistically reasonable. It is not significantly overvalued, but it is priced at the higher end of what the model would expect for its characteristics.