1. Introduction

This report compares that 2025 market value for 6321 88th Street in Lubbock, TX, to similarly sized and close homes within two miles of the subject domicile, using a regression analysis. Using techniques such as statistical modeling, prediction intervals, and visual diagnostics, we will explore whether the assessed value is in line with the general trends in the market for comparable properties.

The first step we need to do is understand the data and its structure. This introduces a more insightful analysis that parses out the relationship between some core features of the property and its impact on market value

2. Exploratory Data Analysis (EDA)

It is very important to perform an EDA (Exploratory Data Analysis) on the dataset before we get into regression modeling, as EDA helps us to understand the structure of the dataset, patterns in the dataset, and the underlying distributions. Now, EDA assists in spotting outliers, possible data issues, and important variables in terms of market estimation value. This analysis will guide us in refining our model and ensuring it accurately reflects the data.

2.1 Data Loading and Initial Analysis

In this section, we import and divide the dataset. The data was entered by hand into a CSV and saved in a Github repository for easy download. Once we uploaded the data in R, we carried out some initial tests to make sure that everything was correctly formatted and that all the variables of interest were included for further analysis.

Next we describe each of the key variables collected by using the Lubbock Central Appraisal District (CAD) website (which contains most of the meaningful property information).

2.1.1 Data Collection and Key Variables

All data for this project was pulled from the Lubbock Central Appraisal District (CAD) website. A particularly valuable resource with details on every property in the city is the public property tax database, which has general property information as well as ownership information. We used a specific subset of the data in this analysis to compute and evaluate the market value of 6321 88th Street.

Below is a breakdown of the key variables we used:

Variable Name Explanation
2025 Market Value Total value of property appraisal = Total Improvement Market Value + Total Land Market Value
Total Improvement Market Value Total value of house appraisal = Main Area (Value) + Garage (Value)
Total Land Market Value Total value of land appraisal
Homestead Cap Loss Represents a discount only in the current tax year if the appraised value from the previous year went up by more than 10%
Total Main Area Total square footage of house = Main Area (Sq. Ft.) + Garage (Sq. Ft.)
Main Area Total square footage of heated house area
Main Area (Value) Total value of heated house area
Garage (Sq. Ft.) Total square footage of non-heated house area
Garage (Value) Total value of non-heated house area
Land (Sq. Ft.) Total square footage of land
Additions Combined value of second garages and additional living areas

Also included in the CAD were features such as second garages and extra living areas, which were folded into a new variable called Additions. It is important to mention that due to the small number of the properties in the dataset that also included a pool feature, we removed properties with pools from the analysis because this feature was not relevant to the overall analysis.

url <- "https://raw.githubusercontent.com/Roymorales806/PT_Additonal-Space/refs/heads/main/PT_ADDITIONALSPACE.csv"
df <- read_csv(url) %>% clean_names()
df[] <- lapply(df, as.character)
df$value <- df$value

# Reshape wide to long
df_long <- df %>%
  pivot_longer(-value, names_to = "address", values_to = "val") %>%
  pivot_wider(names_from = value, values_from = val) %>%
  mutate(across(-address, as.numeric)) %>%
  mutate(
    address_number = as.numeric(gsub("^x", "", address)),
    target_6321 = ifelse(address_number == 6321, 1, 0)
  )

# Clean column names
df_long <- df_long %>% rename_with(~ stringi::stri_trim_both(.x)) %>%
  rename(
    market_value_2025 = "2025 Market Value",
    improvement_value = "Total Improvement Market Value",
    land_value = "Total Land Market Value",
    total_main_area = "Total Main Area (Sq. Ft.)",
    main_area = "Main Area (Sq. Ft.)",
    main_area_value = "Main Area (Value)",
    garage_area = "Garage (Sq. Ft.)",
    garage_value = "Garage (Value)",
    land_sqft = "Land (Sq. Ft.)",
    additions = "Additions (Sq. Ft.)"
  )

Observations:

  • The dataset consists of 43 homes and 11 variables.

  • There are no missing values, ensuring the dataset is complete for analysis.

2.2 Dristribuition Visualization

Now we know what data we have along with important variables now we need to visualize the distribution of these variables. This is to catch any patterns or outliers that could be problematic before we continue the regression analysis. Boxplots are important visual tools that summarize data distribution and identify outliers, which become central in determining how much an property’s market value Is fair.

2.2.1 BoxPlots

The boxplots below display the distribution of two important variables: Market Value 2025 and Main Area (Sq. Ft.). By examining the boxplots, we can better understand the spread, central tendency, and presence of any outliers.

# Diagnostic print
print(summary(df_long$market_value_2025))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  418286  511115  537870  572871  580969 1218146       2
print(summary(df_long$main_area))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2041    2585    2744    2721    2908    3226       1
str(df_long$market_value_2025)
##  num [1:43] 735026 663907 569992 602427 NA ...
str(df_long$main_area)
##  num [1:43] 3192 3226 3036 2877 2241 ...
# Safe base plotting to confirm
par(mfrow = c(1, 2), mar = c(5, 3, 4, 0.5) + 0.1, oma = c(2, 0, 2, 0)) 

# No xlim — confirm boxplots show
boxplot(df_long$market_value_2025,
        main = "Boxplot of Market Value 2025",
        xlab = "Market Value 2025",
        col = "lightblue", border = "black",
        horizontal = TRUE)

boxplot(df_long$main_area,
        main = "Boxplot of Main Area",
        xlab = "Main Area (Sq. Ft.)",
        col = "lightgreen", border = "black",
        horizontal = TRUE)

mtext("Comparison of Market Value and Main Area", 
      outer = TRUE, cex = 1.5, font = 2)

Observations:

Observations from the boxplots:

  • Skewness in Market Value: The Market Value 2025 boxplot shows a slight positive skew, indicating that a few properties are significantly higher in value than the majority, suggesting potential outliers that need further examination.

  • Symmetry in Main Area: The Main Area (Sq. Ft.) boxplot is fairly symmetric, indicating that property sizes in the neighborhood are fairly consistent, with no extreme outliers.

These visual insights set the stage for the next step: examining the relationship between Main Area and Market Value to understand how strongly property size influences market value.

2.3 Relationship Analysis

We will investigate the effect of key property features on market value using scatterplots in this section and a correlation matrix. What you want to do here is explore how your features relate to each other and how they help you explain the market value of the properties. Understanding these relationships both provides guidance for the regression model and which variables are impactful.

2.3.1 Scatterplot and Correlation matrix

By plotting the Main Area against the Market Value 2025, we aim to visualize the relationship between property size and value. Additionally, the correlation matrix provides a quantitative measure of the strength of relationships between several key variables, including Land Area, Additions, and Market Value.

# Install ggcorrplot if not already installed
if (!requireNamespace("ggcorrplot", quietly = TRUE)) {
  install.packages("ggcorrplot")
}

# Load necessary libraries
library(ggcorrplot)


plot(df_long$main_area, df_long$market_value_2025,
     main = "Scatter Plot: Main Area vs Market Value 2025",
     xlab = "Main Area (Sq. Ft.)",
     ylab = "Market Value 2025",
     pch = 20, col = "blue")

# Compute the correlation matrix for numeric columns
cor_matrix <- cor(df_long %>% select(main_area, market_value_2025, land_sqft, additions))

# Visualize the correlation matrix using ggcorrplot
ggcorrplot(cor_matrix, 
           lab = TRUE,               
           colors = c("red", "white", "#4A90E2"), 
           outline.color = "black",   
           show.legend = TRUE)

Observations: - Positive Relationship Between Main Area and Market Value: The scatter plot clearly shows a positive linear relationship between Main Area (Sq. Ft.) and Market Value 2025. As the size of the property increases, so does its market value, which aligns with common expectations in property assessments.

  • Strong Correlation Among Key Variables: The correlation matrix shows a perfect correlation (1.0) between Main Area, Land Area, and Market Value 2025, indicating that these variables are highly related. Larger homes (with more square footage) typically correspond to higher property values.

  • Additions’ Correlation with Other Variables: Additions (Sq. Ft.) has strong correlations with both Main Area and Land Area, suggesting that properties with more square footage in terms of main area and land also tend to have larger additions. However, Additions is less correlated with Market Value 2025, which may indicate that it plays a secondary role in determining the market value compared to size.

  • No Negative Correlations: There are no negative correlations between the variables, which suggests that increasing one property feature (like main area or land area) generally leads to increases in market value and related dimensions, reflecting a consistent trend of larger properties being valued higher.

These findings suggest a strong relationship between property size and market value, with Main Area and Land Area playing crucial roles in determining the assessed value of properties.

3. Regression model Fitting and Assumption Checking

With an understanding of the data, its distribution, and the relationships between variables, we proceed to build a Multiple Linear Regression (MLR) model to predict the 2025 market value. This model will allow us to quantify the impact of various property features on the assessed value.

In this section, we fit the model and check the assumptions of linearity, normality, and homoscedasticity to ensure that the model is well-specified and that the results are reliable. In this section, we fit a multiple linear regression model to predict the 2025 Market Value based on key property characteristics. The general formula for the regression model is:

\[ \text{Market Value 2025} = \beta_0 + \beta_1 \times \text{Main Area} + \beta_2 \times \text{Garage Size} + \beta_3 \times \text{Land Size} + \epsilon \]

Where: - \(\text{Market Value 2025}\) is the dependent variable, - \(\text{Main Area}\), \(\text{Garage Size}\), and \(\text{Land Size}\) are the independent variables, - \(\beta_0\) is the intercept (the predicted market value when all predictors are zero), - \(\epsilon\) is the error term (the difference between the observed and predicted values).

The coefficients (\(\beta_1, \beta_2, \beta_3\)) are estimated by the model and represent the change in the market value for a one-unit change in each predictor (holding other predictors constant).

\[ \text{Market Value}_{2025} = \beta_0 + \beta_1(\text{Main Area}) + \beta_2(\text{Garage Area}) + \beta_3(\text{Land SqFt}) + \beta_4(\text{Additions}) + \epsilon \]

Where: - \(\beta_0\): Intercept (baseline value) - \(\beta_i\): Coefficients (unit effect of predictors) - \(\epsilon\): Error term assumed \(N(0, \sigma^2)\)

model <- lm(market_value_2025 ~ main_area + garage_area + land_sqft + additions, data = df_long)
summary(model)
## 
## Call:
## lm(formula = market_value_2025 ~ main_area + garage_area + land_sqft + 
##     additions, data = df_long)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -96968 -16880   8356  19849 126537 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2202.220  78356.188  -0.028   0.9777    
## main_area      71.295     34.141   2.088   0.0439 *  
## garage_area    -6.502     59.671  -0.109   0.9138    
## land_sqft      40.693      6.276   6.484 1.57e-07 ***
## additions     110.316     23.864   4.623 4.73e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47990 on 36 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.8925, Adjusted R-squared:  0.8806 
## F-statistic: 74.72 on 4 and 36 DF,  p-value: < 2.2e-16
vif(model)
##   main_area garage_area   land_sqft   additions 
##    1.591044    1.056230    2.791303    2.574771

Observation:

Significant Predictors: - Main Area (Sq. Ft.) and Additions (Sq. Ft.) are statistically significant predictors for Market Value 2025 as their p-values are both less than 0.05 (0.0439 for main_area and 4.73e-05 for additions). -Garage Area (Sq. Ft.) has a non-significant p-value of 0.9138, indicating it does not contribute significantly to explaining the variance in market value.

R-squared Value:

The R-squared value is calculated to assess how well the model fits the data. It represents the proportion of the variance in the dependent variable (Market Value 2025) that is explained by the independent variables. The formula for R-squared is:

\[ R^2 = 1 - \frac{\sum_{i=1}^n (Y_i - \hat{Y}_i)^2}{\sum_{i=1}^n (Y_i - \bar{Y})^2} \]

Where: - \(Y_i\) are the observed values of Market Value 2025, - \(\hat{Y}_i\) are the predicted values from the regression model, - \(\bar{Y}\) is the mean of the observed values.

A higher R-squared value indicates a better model fit, with values closer to 1 indicating that the model explains most of the variance in the data.

Residuals: - The residuals show a wide range, with the minimum residual being -96,968 and the maximum residual being 126,537, suggesting some large deviations from the predicted values. This might indicate outliers or areas where the model doesn’t fit well.

Variance Inflation Factor (VIF): Multicollinearity occurs when two or more predictor variables are highly correlated, which can distort the regression model. To detect multicollinearity, we use the Variance Inflation Factor (VIF), which quantifies how much the variance of the estimated regression coefficients is inflated due to collinearity with other predictors.

The formula for VIF is:

\[ VIF = \frac{1}{1 - R^2} \]

Where \(R^2\) is the R-squared value obtained by regressing each predictor on all the other predictors. A VIF greater than 10 indicates problematic multicollinearity, meaning that the predictor is highly correlated with other predictors in the model.

These observations highlight the strengths and weaknesses of the regression model, indicating that Main Area and Additions are crucial for determining the Market Value 2025, but Garage Area is not a significant predictor in this case.

Interpretation: - Coefficients represent the marginal dollar contribution per square foot. - main_area, land_sqft, and additions are statistically significant (p < 0.05). - garage_area is not significant and may be omitted in future modeling. - Adjusted \(R^2 = 0.88\): Model explains 88% of variance in market value.

4. Prediction for 6321

Using the model, we calculate: - Confidence Interval (CI): expected mean value of homes like 6321. - Prediction Interval (PI): range a new individual observation might fall.

To understand the precision of the estimated regression coefficients, we calculate confidence intervals (CIs). A 95% confidence interval provides a range of values within which we are 95% confident that the true coefficient lies.

The formula for the confidence interval of each regression coefficient is:

\[ CI = \hat{\beta} \pm t_{\alpha/2, n-p} \cdot SE(\hat{\beta}) \]

Where: - \(\hat{\beta}\) is the estimated coefficient, - \(t_{\alpha/2, n-p}\) is the critical value from the t-distribution, - \(SE(\hat{\beta})\) is the standard error of the coefficient.

ci <- predict(model, newdata = df_long[df_long$target_6321 == 1, ], interval = "confidence")
pi <- predict(model, newdata = df_long[df_long$target_6321 == 1, ], interval = "prediction")
ci
##        fit      lwr      upr
## 1 498719.6 479949.6 517489.6
pi
##        fit      lwr      upr
## 1 498719.6 399593.3 597845.9

Interpretation: - Predicted Value: ~$498,720 - 95% CI: [$479,950 – $517,490] → uncertainty around the mean - 95% PI: [$399,593 – $597,846] → uncertainty around a single home

If 6321’s assessed value lies outside the PI, it likely indicates unfair valuation.

5. Visualization

The following plot visualizes the actual vs predicted market values, highlighting 6321 88th Street and showing the prediction bands. This allows us to assess how well the model fits the data and whether the assessed value of the property is reasonable compared to the prediction.

pred_df <- df_long %>%
  mutate(
    predicted = predict(model, newdata = df_long),
    pred_interval = predict(model, newdata = df_long, interval = "prediction"),
    ci_lower = pred_interval[, "lwr"],
    ci_upper = pred_interval[, "upr"]
  )

ggplot(pred_df, aes(x = address_number)) +
  geom_point(aes(y = market_value_2025, color = target_6321 == 1), size = 3) +
  geom_line(aes(y = predicted), linetype = "dashed") +
  geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), alpha = 0.2, fill = "blue") +
  scale_color_manual(values = c("black", "red")) +
  labs(title = "Actual vs Predicted Market Values with Prediction Bands",
       x = "Address Number", y = "Market Value", color = "6321 Highlight") +
  theme_minimal()

6. Comparison Table: Actual vs Predicted Values

A table comparing the actual market value with the predicted value and the confidence intervals provides a detailed comparison of each property in the dataset. This helps further confirm the reliability of the model and its predictions.

pred_df %>%
  select(address_number, market_value_2025, predicted, ci_lower, ci_upper) %>%
  mutate(diff = market_value_2025 - predicted) %>%
  arrange(address_number)

7. Model Diagnostics

In this section, we assess the diagnostics of the regression model. We evaluate the residuals for linearity, normality, and homoscedasticity. This step is crucial for validating the assumptions of the regression model and ensuring the accuracy of the predictions.

par(mfrow = c(2, 2))
plot(model)

par(mfrow = c(1, 1))

Evaluation of Assumptions: - Linearity: Mostly linear patterns observed. - Normality: Residuals nearly normal. - Homoscedasticity: Slight variance imbalance. - Influence: No points show excessive leverage or Cook’s Distance.

8. Conclusion

The estimated market value of 6321 88th Street is based on the regression results for neighboring houses and is approximately $498,720±$99,127 (95% prediction interval: $399,593–597,846). The tax authorities assessed the market value at $538,409, top end of the prediction interval, which indicates that this property could be overvalued compared to similar homes.

Even though the assessed value technically falls in the prediction interval, it is so close to the upper end that this suggests that the property is overpriced. Since the predicted value takes into consideration important property attributes like size, land area, and additions, a closer amount to the predicted value of $498,720 would be more aligned with what we see on the broader market with similar properties.

Dockblock Note: The assessed value is, on paper, within the prediction interval — but just barely, and closer to the upper bound (in fact, at the top of our interval), indicating to us that 6321 88th Street is relatively high-priced compared to what similar homes have sold for. Because the predicted value is based around key property features (e.g. size, land area, additions), the assessed market value of $538,409 is higher than what would be predicted based on the aforementioned features.

If you believe your taxes are way too high, you should keep in mind the amount the property is worth may cause the tax assessment to go over the current assessment and therefore, be over the current value of $498,720. With a wide prediction interval, we have enough uncertainty in our pipeline to say the home’s price is unsupported by the current model in the market, and a downward adjustment closer to the predicted sales price would be more consistent with members of the housing market broader market.

This involves weaving between statistical theory and applied assessment to make a strong case in favor of the fair property tax appeal.

Full Code

# Setup chunk
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
library(readr)
library(dplyr)
library(janitor)
library(ggplot2)
library(car)
library(broom)
library(tidyr)
library(stringi)

# Data Cleaning
url <- "https://raw.githubusercontent.com/Roymorales806/PT_Additonal-Space/refs/heads/main/PT_ADDITIONALSPACE.csv"
df <- read_csv(url) %>% clean_names()
df[] <- lapply(df, as.character)
df$value <- df$value

# Reshape wide to long
df_long <- df %>%
  pivot_longer(-value, names_to = "address", values_to = "val") %>%
  pivot_wider(names_from = value, values_from = val) %>%
  mutate(across(-address, as.numeric)) %>%
  mutate(
    address_number = as.numeric(gsub("^x", "", address)),
    target_6321 = ifelse(address_number == 6321, 1, 0)
  )

# Clean column names
df_long <- df_long %>% rename_with(~ stringi::stri_trim_both(.x)) %>%
  rename(
    market_value_2025 = "2025 Market Value",
    improvement_value = "Total Improvement Market Value",
    land_value = "Total Land Market Value",
    total_main_area = "Total Main Area (Sq. Ft.)",
    main_area = "Main Area (Sq. Ft.)",
    main_area_value = "Main Area (Value)",
    garage_area = "Garage (Sq. Ft.)",
    garage_value = "Garage (Value)",
    land_sqft = "Land (Sq. Ft.)",
    additions = "Additions (Sq. Ft.)"
  )

# EDA: Boxplot visualization
print(summary(df_long$market_value_2025))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  418286  511115  537870  572871  580969 1218146       2
print(summary(df_long$main_area))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2041    2585    2744    2721    2908    3226       1
str(df_long$market_value_2025)
##  num [1:43] 735026 663907 569992 602427 NA ...
str(df_long$main_area)
##  num [1:43] 3192 3226 3036 2877 2241 ...
par(mfrow = c(1, 2), mar = c(5, 3, 4, 0.5) + 0.1, oma = c(2, 0, 2, 0)) 
boxplot(df_long$market_value_2025,
        main = "Boxplot of Market Value 2025",
        xlab = "Market Value 2025",
        col = "lightblue", border = "black",
        horizontal = TRUE)

boxplot(df_long$main_area,
        main = "Boxplot of Main Area",
        xlab = "Main Area (Sq. Ft.)",
        col = "lightgreen", border = "black",
        horizontal = TRUE)

mtext("Comparison of Market Value and Main Area", 
      outer = TRUE, cex = 1.5, font = 2)

# Correlation and Scatter plot
if (!requireNamespace("ggcorrplot", quietly = TRUE)) {
  install.packages("ggcorrplot")
}

library(ggcorrplot)

plot(df_long$main_area, df_long$market_value_2025,
     main = "Scatter Plot: Main Area vs Market Value 2025",
     xlab = "Main Area (Sq. Ft.)",
     ylab = "Market Value 2025",
     pch = 20, col = "blue")

# Correlation matrix
cor_matrix <- cor(df_long %>% select(main_area, market_value_2025, land_sqft, additions))

# Visualize correlation matrix
ggcorrplot(cor_matrix, 
           lab = TRUE,               
           colors = c("red", "white", "#4A90E2"), 
           outline.color = "black",   
           show.legend = TRUE)

# Regression Model
model <- lm(market_value_2025 ~ main_area + garage_area + land_sqft + additions, data = df_long)
summary(model)
## 
## Call:
## lm(formula = market_value_2025 ~ main_area + garage_area + land_sqft + 
##     additions, data = df_long)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -96968 -16880   8356  19849 126537 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2202.220  78356.188  -0.028   0.9777    
## main_area      71.295     34.141   2.088   0.0439 *  
## garage_area    -6.502     59.671  -0.109   0.9138    
## land_sqft      40.693      6.276   6.484 1.57e-07 ***
## additions     110.316     23.864   4.623 4.73e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47990 on 36 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.8925, Adjusted R-squared:  0.8806 
## F-statistic: 74.72 on 4 and 36 DF,  p-value: < 2.2e-16
vif(model)
##   main_area garage_area   land_sqft   additions 
##    1.591044    1.056230    2.791303    2.574771
# Prediction intervals for 6321
ci <- predict(model, newdata = df_long[df_long$target_6321 == 1, ], interval = "confidence")
pi <- predict(model, newdata = df_long[df_long$target_6321 == 1, ], interval = "prediction")
ci
##        fit      lwr      upr
## 1 498719.6 479949.6 517489.6
pi
##        fit      lwr      upr
## 1 498719.6 399593.3 597845.9
# Prediction Plot
pred_df <- df_long %>%
  mutate(
    predicted = predict(model, newdata = df_long),
    pred_interval = predict(model, newdata = df_long, interval = "prediction"),
    ci_lower = pred_interval[, "lwr"],
    ci_upper = pred_interval[, "upr"]
  )

ggplot(pred_df, aes(x = address_number)) +
  geom_point(aes(y = market_value_2025, color = target_6321 == 1), size = 3) +
  geom_line(aes(y = predicted), linetype = "dashed") +
  geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), alpha = 0.2, fill = "blue") +
  scale_color_manual(values = c("black", "red")) +
  labs(title = "Actual vs Predicted Market Values with Prediction Bands",
       x = "Address Number", y = "Market Value", color = "6321 Highlight") +
  theme_minimal()

# Comparison Table
pred_df %>%
  select(address_number, market_value_2025, predicted, ci_lower, ci_upper) %>%
  mutate(diff = market_value_2025 - predicted) %>%
  arrange(address_number)
# Model Diagnostics
par(mfrow = c(2, 2))

plot(model)

par(mfrow = c(1, 1))

End of report.