1 Introduction

It can be said that it is our civil duty to ensure that we are contributing fairly to society; this can be done by voting, going to jury duty, and paying our taxes. Taxes is x-amount of money we pay so that our government can help maintain public services such as roads, schools, emergency responders, and to fund the government itself. There are various kinds of taxes however for this case we will focus on property taxes, which is a tax imposed on property paid by the property owner. The calculation of property tax rates seems simpler than what it is, fundamentally property tax may be found by multiplying the value of the property by a tax rate, which is set by the local government by appraising a home’s value.

The issue with appraisals done by the county appraisal district (CAD) is that they may be subjective and can lead to discrepancies in property taxes between similar homes. This can affect how much is paid in taxes by the homeowner which can lead for an appeal, which is what this is. Specifically this is concerned with the assessed 2025 market value of 6321 88th Street, Lubbock, Texas. The goal is to determine if the appraisal done on this property is fair when comparing it to other homes in the surrounding neighborhood (6309-6351 88th Street). To make sure this appeal is accurate and fair, statistical modeling will be used to evaluate if the assessment made can be supported with real-world data.

Statistical modeling is essentially using mathematical models and statistical assumptions to make predictions about the real world. For this case we will apply a Multiple Linear Regression (MLR) model, which is a technique in statistics that helps model the relationship between dependent and independent variables assuming they’re in a linear relationship. Using nearby home data we’ll get variables such as main living area, garage size, and land area. This will help predict what the 2025 Market Value should be for 6321 and we will evaluate if the appraisal is within statistically reasonable range. If it falls significantly above or below, then we will argue that the property is overvalued or undervalued, respectively.

2 Methodology

2.1 Dataset

For us to evaluate if the Market Value assessment for 2025 was done in a fair manner, we used 6321’s neighboring properties’ data. The data we used for our evaluation contains each property’s square footage, garage size, total land area, and the market value assessed by our county (essentially the homes from 6309 through 6351). We needed to pull our data from the Lubbock Appraisal District website, however we did need to manually sort through the data to ensure we were only using the appropriate values to evaluate the appraisal for 6321.

We had to export a large amount of data, from the website, into excel and from there sorted through the data until we were left with 6321’s neighboring properties and their information. We then uploaded our excel table into an online software for code storage/development called Github. This is so that we did not have to “paste” all our values one-by-one when creating our model. The beginning of our code does include our Github link and looks like this:

# Load libraries
library(ggplot2)
library(dplyr)
library(gridExtra)
library(car)
library(tibble)

#Load the dataset from Github
url <- "https://raw.githubusercontent.com/cri07llo/IE5344-P2/refs/heads/main/property_data.csv"
raw_data <- read.csv(url, stringsAsFactors = FALSE)

When looking at the data set, each row represents a home and also included the following variables:

  • 2025 Market Value – the county’s assessed market value of the home (land + structure)
  • Main Area (Sq. Ft.) – the living area
  • Garage (Sq. Ft.) – non-heated garage space
  • Land (Sq. Ft.) – total lot area
  • Address – used to identify homes and target home (stored as “x6321”)

2.2 Data Preparation

When the neighborhood data was uploaded to Github, it was uploaded as a CSV file which stands for “Comma Separated Values”. Essentially this is just a plain text file that has rows representing a record and each comma-separated value represents a field within that record. This makes the data in the Github link be very raw which can cause some hiccups when we want to evaluate 6321’s market value correctly. To make sure we can interpret the data smoothly we needed to clean and reformat it.

We first needed to transpose our data, which just means swapping the rows and columns of the data, because our data was appearing rotated and this can ensure each row represents a home. We then needed to format the data cleanly by removing dollar signs and commas from numeric fields as to not have any extra characters that may cause our code to have errors. We then needed to trim the white-space from the column names and address fields to ensure the code is able to correctly identify the homes. Lastly we converted columns to numeric values to facilitate the analysis and filtered out any empty or invalid column that the excel table may have exported into our Github CSV.

# Transpose and clean the dataset
transposed <- as.data.frame(t(raw_data[-1]))
colnames(transposed) <- raw_data$VALUE
transposed$Address <- rownames(transposed)

# Clean money formatting and trim whitespace
clean_money <- function(x) {
  as.numeric(gsub("[\\$,]", "", x))
}
colnames(transposed) <- trimws(colnames(transposed))
cols_to_clean <- grep("Main Area|Garage|Land|Market Value", colnames(transposed), value = TRUE)
transposed[cols_to_clean] <- lapply(transposed[cols_to_clean], clean_money)

# Remove empty columns and confirm structure
transposed <- transposed[, colnames(transposed) != ""]
str(transposed)
## 'data.frame':    43 obs. of  11 variables:
##  $ 2025 Market Value             : num  735026 663907 569992 602427 46028 ...
##  $ Total Improvement Market Value: num  677026 603222 511992 538677 415135 ...
##  $ Total Land Market Value       : num  58000 60685 58000 63750 45153 ...
##  $ Homestead Cap Loss            : chr  "0" "0" "0" "0" ...
##  $ Total Main Area (Sq. Ft.)     : num  3905 3898 3611 3786 2747 ...
##  $ Main Area (Sq. Ft.)           : num  3192 3226 3036 2877 2241 ...
##  $ Main Area (Value)             : num  558004 NA 447924 432168 376845 ...
##  $ Garage (Sq. Ft.)              : num  713 672 575 909 506 550 550 325 550 550 ...
##  $ Garage (Value)                : num  56089 NA 38175 61445 38290 ...
##  $ Land (Sq. Ft.)                : num  10000 10463 10000 10625 7785 ...
##  $ Address                       : chr  "X6309" "X6310" "X6311" "X6312" ...

Once the neighborhood data is cleaned up, the evaluation of 6321’s market value can begin.

2.3 Exploratory Data Analysis (EDA)

Before we can begin to model and evaluate the CAD’s appraisal we performed an exploratory data analysis (EDA) on our data. Performing an EDA helps us identify existing patterns, understand data-characteristics, and discover any sorts of anomalies in the appraisal data we collected. Specifically this will help us further understand and validate any assumptions made about the properties’ values and identify if there are any existing relationships that can impact appraisals. To visualize this we will plot our collected data into histograms, boxplots, scatterplots, and residual plots.

p1_hist <- ggplot(transposed, aes(x = `2025 Market Value`)) +
  geom_histogram(binwidth = 50000, fill = "lightblue", color = "darkblue") +
  labs(title = "Histogram: 2025 Market Value", x = "Market Value ($)", y = "Count") +
  theme_minimal()

p1_box <- ggplot(transposed, aes(y = `2025 Market Value`)) +
  geom_boxplot(fill = "cyan", outlier.colour = "orange", outlier.shape = 19) +
  labs(title = "Boxplot: 2025 Market Value", y = "Market Value ($)") +
  theme_minimal()

grid.arrange(p1_hist, p1_box, ncol = 2)

When looking at the histogram we can see that it has a rough bell-shaped distribution centered at around $550,000. This means that there is a near-normal distribution and the average market value is at around $550,000. The majority of the homes appear to fall in a range of $450k - $650k which does shows a relatively consistent appraisal range.

There are outliers however, both in the higher and lower range of things. On the upper end there are multiple homes that have a market value above $750k with a few close to if not exceeding $1M. On the opposite end of the spectrum there is one home that is extremely undervalued in comparison to the rest of the neighborhood.

Our boxplot supports these observations as well. Our interquartile range (IQR), which is the width of the box itself representing the middle 50% of the data used, is narrow and centered which supports that the data distribution is fairly symmetric. We can also see various points above and below which indicate our outliers in appraisal that we highlighted. The recognition of these outliers is crucial when we fit our model as these extreme values can skew our analysis and have a negative impact.

Now that there is an understanding of the distribution of market values,we can follow up by analyzing the physical characteristics of the properties. The first analysis will be done on the Main area (in sq. ft.) which is the total heated living space in each home. Due to this being a key factor in appraisal, examining how it may vary across the neighborhood will help identify if there are any irregularities or outliers.

p2_hist <- ggplot(transposed, aes(x = `Main Area (Sq. Ft.)`)) +
  geom_histogram(binwidth = 100, fill = "lightpink", color = "darkred") +
  labs(title = "Histogram: Main Area", x = "Main Area (Sq. Ft.)", y = "Count") +
  theme_minimal()

p2_box <- ggplot(transposed, aes(y = `Main Area (Sq. Ft.)`)) +
  geom_boxplot(fill = "lightpink", outlier.colour = "red", outlier.shape = 19) +
  labs(title = "Boxplot: Main Area", y = "Main Area (Sq. Ft.)") +
  theme_minimal()

grid.arrange(p2_hist, p2_box, ncol = 2)

Looking at the histogram above, we can see that the distribution is skewed slightly to the left, also referred to as negatively skewed, indicating that a majority of our data is to the right. In reference to the data, it means that a majority of the homes have around 2,500 to 3,000 square feet of heated living space. There is a cluster at around 2,750 square feet, which means this can be the most common size for the main area in homes.

There is an outlier on the lower end with a home having a significantly smaller main area of near 2,000 square feet. The boxplot helps support this, since we do have a single data point well below our IQR. Also, referring to the IQR, in this boxplot, it spans from around 2,550 to 2,950 square feet, and the median is slightly left-centered, which does support our histogram’s left skew. This could impact the regression model we’ll create later if the corresponding market value is not scaled proportionally to the smaller square footage. Next we’ll be exploring the Garage Area data.

p3_hist <- ggplot(transposed, aes(x = `Garage (Sq. Ft.)`)) +
  geom_histogram(binwidth = 50, fill = "lightgreen", color = "darkgreen") +
  labs(title = "Histogram: Garage Size", x = "Garage (Sq. Ft.)", y = "Count") +
  theme_minimal()

p3_box <- ggplot(transposed, aes(y = `Garage (Sq. Ft.)`)) +
  geom_boxplot(fill = "lightgreen", outlier.colour = "orange", outlier.shape = 19) +
  labs(title = "Boxplot: Garage Size", y = "Garage (Sq. Ft.)") +
  theme_minimal()

grid.arrange(p3_hist, p3_box, ncol = 2)

Looking at the histogram for garage size, it is doing the opposite of the main area one, it has a very strong right skew. Most of the homes seem to have a garage with an area of about 500 to 550 square feet, based on the histogram’s cluster at these values. This is most likely the average size of the standard two-car garages available in the neighborhood. There are, however, multiple homes that have relatively larger garage areas, with some exceeding 1,000 square feet, although it is less common. We also see that there are few homes below the cluster, indicating that some homes do have a smaller garage than the average home in our neighborhood.

The boxplot also backs this up, with the IQR being fairly narrow and being just above 500 square feet, which is where we saw the clusters in the histogram. The high-end outliers are clearly present as well in the boxplot, with us being able to see the homes exceeding the average garage area, and our extreme outlier exceeding 1,000 square feet. Additionally, our lower-end outliers are also being supported by our boxplot, with there being two points drastically below the IQR.

The uneven spread of data and multiple outliers could have an impact on the model if the relationship between garage space and market value is not consistent or linear. There is a possibility that larger garages may contribute to disproportionate appraisals, or not at all, and we will verify this further in the model; however, let’s move on to exploring our land area data.

p4_hist <- ggplot(transposed, aes(x = `Land (Sq. Ft.)`)) +
  geom_histogram(binwidth = 500, fill = "plum", color = "purple") +
  labs(title = "Histogram: Land Area", x = "Land Area (Sq. Ft.)", y = "Count") +
  theme_minimal()

p4_box <- ggplot(transposed, aes(y = `Land (Sq. Ft.)`)) +
  geom_boxplot(fill = "plum", outlier.colour = "orange", outlier.shape = 19) +
  labs(title = "Boxplot: Land Area", y = "Land Area (Sq. Ft.)") +
  theme_minimal()

grid.arrange(p4_hist, p4_box, ncol = 2)

Based on the histogram for land area data, we see a similar visual as the garage size histogram, with the plot being heavily right-skewed. There is a cluster at around 8,000 to 9,000 square feet, indicating that the majority of properties have land at around this size, making this the average lot size for our neighborhood. Additionally, there are outliers, as with the garage size, with some homes having much larger land areas, and even a rare case of almost 18,000 square feet of land area.

The boxplot helps support the histogram’s visualization of the right skew as well. The IQR is very close-knit and centered towards the lower end, with multiple high-end outliers extending far above the IQR, and there is even one at around 17,500 square feet. Something interesting to note is that these deviations could reflect homes that are placed on street corners, cul-de-sacs, or even custom-built homes that deviate from the “original” zoning area.

With these outliers being very extreme, their influence on the regression model we create can be very impactful if we do not account for them correctly. The consistency that land size contributes to the property’s market value will be something worth looking into in the model we create. To visualize and explore how each physical characteristic may impact property market value, we will examine scatter plots that compare market value against each of our three predictor variables we have highlighted above (main area, garage size, and land area).

ggplot(transposed, aes(x = `Main Area (Sq. Ft.)`, y = `2025 Market Value`)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Market Value vs Main Area")

ggplot(transposed, aes(x = `Garage (Sq. Ft.)`, y = `2025 Market Value`)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "green") +
  labs(title = "Market Value vs Garage Size")

ggplot(transposed, aes(x = `Land (Sq. Ft.)`, y = `2025 Market Value`)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "purple") +
  labs(title = "Market Value vs Land Area")

The first set of data we are going to compare is the properties’ market value versus the main area square footage, which is the first scatter plot. In this first scatter plot, we can see that there is a moderately positive relationship between the properties’ main area (square feet) and the 2025 market value. We can see that the relationship between the size of the heated living area increases and 2025 the market value does as well, however it is not perfectly linear. Some extreme data points are present, such as the home with the highest market value but not the largest main area, which could make this an outlier. The trend does indicate that homes with larger living areas are appraised higher, for the most part, which does match expectations.

Contrary to the relationship between market value and main area, the relationship between market value and garage size seems to be much weaker. The majority of the garages are clustered closely at a size of around 500 to 550 square feet, and the trend line is almost horizontal. This does imply that a garage’s size alone will not have a large impact on a home’s market value. There are outliers on the market value axis, which means that most likely other factors have a heavier influence on the high-value assessments.

The strongest trend happens to be in the last scatter plot, which compares the relationship between market value and land area. We can see that there is clearly a positive relationship between these two factors, as when land area increases, so does market value. Numerous homes are densely clustered at around 8,000 to 9,000 square feet, however the homes with much larger plots of land also tend to have a much higher appraisal value. This trend could be driven by outliers, so it will be beneficial to validate if land area has a consistent contribution to market value when we evaluate the regression model.

With the neighborhood’s key data variables explored and their relationship to market value evaluated, we can move on to quantifying these relationships with a linear regression model (LMR). Doing so will enable us to assess the actual combined effect that main area, garage size, and land area have on the 2025 appraised market value. Additionally, we can validate if the assessment done on 6321 88th Street is statistically justified or not.

2.4 Modeling

  • With the EDA completed, we will evaluate how the three main physical characteristics of the homes influence the market value appraisal, which will be done using a multiple linear regression (MLR) model. To reiterate, a MLR predicts a single outcome (dependent variable) based on the combined influence of multiple factors (independent variables). In this case, our dependent variable is the 2025 market value, and the model will assume a linear relationship between each of the three predictors and the market value, which appears to be supported by the scatterplots shown above.

In mathematical terms the equation for multiple linear regression is \[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_nX_n + \varepsilon \]

  • Y is the dependent variable (Market Value)

  • \(X_1, X_2, ... X_n\) are the independent variables (predictors)

  • \(\beta_0, \beta_1, \beta_n\) are coefficients that represent intercepts and influence of each variable

The coefficients are estimated by the model using the observed data. This means that the predictors we choose will have a direct influence on the output, and they should be selected carefully to ensure that the model is both valid and useful. Since we did establish the three predictors in the model (main area, garage size, and land area), we can analyze the MLR to see how each predictor impacts the model.

# Fit a multiple linear regression model
model <- lm(`2025 Market Value` ~ 
              `Main Area (Sq. Ft.)` +
              `Garage (Sq. Ft.)` +
              `Land (Sq. Ft.)`, 
            data = transposed)

# View regression summary
summary(model)
## 
## Call:
## lm(formula = `2025 Market Value` ~ `Main Area (Sq. Ft.)` + `Garage (Sq. Ft.)` + 
##     `Land (Sq. Ft.)`, data = transposed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -431086  -28176   14084   46337  133804 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.256e+05  1.452e+05  -0.865    0.392    
## `Main Area (Sq. Ft.)`  5.768e+01  5.538e+01   1.041    0.304    
## `Garage (Sq. Ft.)`    -2.685e+01  1.166e+02  -0.230    0.819    
## `Land (Sq. Ft.)`       6.257e+01  7.781e+00   8.041 1.01e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 94150 on 38 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.6768, Adjusted R-squared:  0.6513 
## F-statistic: 26.52 on 3 and 38 DF,  p-value: 1.998e-09

The breakdown for the MLR model is as follows:

  • Residuals

    • These are the differences between predicted and actual values

    • The residuals are wider than expected, with some homes being off by more than $400k, a fairly wide residual means that the model’s predictions are further from the actual observed values.

    • The median residual is fairly low, actually, being around $14k, showing that most of the predictions are not too far off

    • The large residual does suggest some outliers, which we had already flagged

  • Coefficients Table

    • Our land area is extremely significant with a p-value of < 0.001

      • The p-value is a representation of the statistical significance a variable’ has on the outcome, a low p-value suggests the relationship is likely real and not a random variation

      • Looking back at our results, with every extra square foot of land the model predicts market value increases by $62.57

    • Since main area and garage have p-values > 0.05, this could mean they are not statistically significant, meaning we cannot be certain that their effects are not due to random variations based on the data

    • The garage has a negative coefficient, but yet is considered not significant at all, so we should not over-interpret that result

  • Model Fit

    • R-squared value: 0.6768

      • In MLR, the R-squared value indicates how well a model predicts the outcome variable based on the predictor variable(s), sort of like a percentage score for the model’s performance

      • In this model, about 67.7% of variability in market value can be explained by the model itself

    • Adjusted R-squared: 0.6513

      • The adjusted R-squared value is a modified version of R-squared that measures how well a model fits the data, adjusting for the number of predictor variables

      • With the adjusted value of 65.13%, it is still a strong value indicating that our predictors are overall meaningful

    • F-statistic (26.52) and p-value (1.998e-09)

      • The F-statistic and p-value help determine the overall significance of the model

      • The F-statistic measures the ratio of variance to unexplained variance, basically indicating how well the model fits the data

      • The p-value shows how likely we observe the data we have if there is no relationship between the predictors and the response

      • The F-statistic of 26.52 and associated p-value of 1.998e-09 prove that the model is statistically significant

      • This means that three predictors combined explain a significant portion of the variability in appraised market value, and the model performs better than one with no predictors at all

To summarize the information above, the MLR output indicates that land area has the most statistical impact on the market value from 2025, with a coefficient of $62.57 / square foot and a p-value below 0.001. This shows the consistency associated with increasing lot size and higher appraisal values. On the other hand, main area and garage do not have a statistical significance, indicating that their estimated impacts could be due to random variation. Although the garage has a negative coefficient, its lack of significance means it is best not to over-interpret this result. The model is overall strong, with an R-squared value of 0.6768, which means it explains almost 68% of the variability in market value. The F-statistic of 26.52 and the associated p-value (1.998e-09) confirm that the model is statistically significant overall and performs significantly better than a model with no predictors.

Relating these results back to our neighborhood, this helps show that land area is the most impactful factor in influencing a home’s appraised market value. Even though living area and garage size may still matter, their impact is less clear, based on this model, and could vary from house to house. To validate the impact each predictor has, we ran a 95% confidence interval. A 95% confidence interval gives a range of values that is believed to likely contain the true population value (like the true effect of a predictor on market value) with 95% confidence.

# Confidence intervals for model coefficients
confint(model)
##                               2.5 %       97.5 %
## (Intercept)           -419608.01401 168343.66659
## `Main Area (Sq. Ft.)`     -54.43809    169.79687
## `Garage (Sq. Ft.)`       -262.96055    209.26087
## `Land (Sq. Ft.)`           46.81384     78.31728

The confidence interval for land area is from $46.81 to $78.32, confirming that land area has a statistical significance and consistently positive influence on market value. However, main area (−$54.44 to $169.80) and garage size (−$262.96 to $209.26) both cross zero. This further validates that their effects on market value may be caused by random variation, and we cannot be confident that their impact on the model is meaningful. In turn, this does support that land size plays the most critical role in justifying or contradicting the 2025 appraisal value for home 6321.

2.5 Model Evaluation

With the MLR model completed, we can assess if the model’s assumptions are met and there is any data influencing results. We will further go into the model’s residuals, check if any independent variables are highly correlated with each other (multicollinearity), and identify outliers that may influence interpretation or predictions.

We will examine a residuals versus fitted values plot, this is to validate if the model meets assumptions about linearity and error distribution. In a proper MLR model, we should expect the relationship between the predictors and the outcome (market value) to be linear; market value should have a constant increase as square footage increases. Also the residuals should be spread out with no patterns suggesting variance (homoscedasticity). If these assumptions are not met then the model’s predictions loose credibility.

ggplot(model, aes(.fitted, .resid)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs. Fitted Values", 
       x = "Fitted Values", 
       y = "Residuals") +
  theme_minimal()

In the plot above, each point represents a home in our neighborhood, the horizontal axis represents the predicted market value (fitted values) from the model, and the vertical axis shows the residuals. In an ideal model the points should be somewhat parallel to zero with no clear shape or pattern, meaning errors are spread evenly supporting constance variance. In this plot however, various homes are clustered in the middle range with small residuals but some points do become more spread out as predicted values increase. This implies that the model is not too accurate for homes with high predicted values, meaning they could be over or under estimated more frequently. These homes could have unusual characteristics or extreme land sizes which will be explored later on. This type of spread is known as heteroscedasticity, while it does not break the model it is something to keep in mind when evaluating the reliability of higher-valued homes’ predictions. We can further explore these outliers, which homes they are and why they matter.

# Identify potential outliers, leverage, and influential points
influence_metrics <- data.frame(
  Address = rownames(model$model),  # pulled from the actual model
  Fitted = fitted(model),
  Residuals = resid(model),
  Std_Residuals = rstandard(model),
  Student_Residuals = rstudent(model),
  Leverage = hatvalues(model),
  CooksD = cooks.distance(model)
)

# Flagging rule-of-thumb thresholds
n <- nrow(model$model) #number of observations
p <- length(coef(model))  # includes intercept

influence_metrics$Flagged <- with(influence_metrics, 
                                  abs(Student_Residuals) > 2 | 
                                    Leverage > (2 * p / n) | 
                                    CooksD > (4 / n)
)

# View flagged points
subset(influence_metrics, Flagged == TRUE)
##       Address    Fitted  Residuals Std_Residuals Student_Residuals   Leverage
## X6312   X6312  680664.0  -78237.02    -0.9395229        -0.9380372 0.21778614
## X6313   X6313  477114.2 -431086.22    -4.8057097        -7.5716435 0.09232828
## X6314   X6314  839978.0  128787.98     1.8722787         1.9390811 0.46626557
## X6316   X6316  774983.9  -65379.85    -0.7878561        -0.7838488 0.22319925
## X6322   X6322 1135242.6   82903.41     1.2719533         1.2827084 0.52079869
## X6324   X6324  827787.0 -182538.01    -2.1668157        -2.2838559 0.19946875
## X6332   X6332  480288.1   90673.95     1.3799228         1.3970991 0.51295279
##           CooksD Flagged
## X6312 0.06144117    TRUE
## X6313 0.58730140    TRUE
## X6314 0.76557745    TRUE
## X6316 0.04458789    TRUE
## X6322 0.43957628    TRUE
## X6324 0.29246946    TRUE
## X6332 0.50136718    TRUE

In order to understand why some homes may be skewing/distorting the regression model, we analyzed three key diagnostic values:

  • Studentized Residuals: An adjusted version of regular residuals. They highlight homes that have appraised values differing drastically from the model expectation. Normally any value above +2 or below -2 is considered to be unusually high. These could represent incorrectly priced homes or unusual cases that are not captured by the model

  • Leverage: This measures how “far” a home’s characteristics (land area) may be from the rest of the data. Homes that have an unusually sized land area may have more “pull” on the regression line. If the leverage is high (typically above 0.2 or 2×(p/n)), the model may rely heavily on property values to fit the trend

  • Cook’s Distance: This showcases change in the regression model if a particular home was removed. A high Cook’s Distance (typically above 0.1 or 4/n) means the home has a disproportionate influence on the results and could be distorting the regression

In total seven homes were flagged from the neighborhood. This may be due to their appraisal value being far from the model’s prediction (outliers), or they have characteristics that disproportionately affect the model (leverage/influence).

(Note: the homes have an x in front of their number due to the way the data was formatted in the CSV file)

Notably, X6313 had a very large negative residual (−$431,086) and an extreme Studentized residual (−7.57), prompting it as a strong outlier. Homes likes X6314, X6322, and X6332 showed high leverage, meaning they have more influence on the regression line.

While some flagged properties may have valid variation, they do suggest some homes disproportionately impact the model’s behavior. This can reduce how well the regression represents more typical homes in our neighborhood.

Identifying outliers, leverage, and observation points are great ways to see if our analysis can be trusted. However we also need check if the predictors are highly correlated with each other. If they are it can distort how we understand each variable’s individual impact. This is called multicollinearity, and it leads to unreliable regression coefficients. To check this, we used a measure known as Variance Inflation Factor (VIF).

vif(model)
## `Main Area (Sq. Ft.)`    `Garage (Sq. Ft.)`      `Land (Sq. Ft.)` 
##              1.169536              1.052759              1.120601

Variance Inflation Factor (VIF) measures how much the variance of a regression coefficient is impacted by the presence of other predictor variables in the model. Essentially it tells you how much a variable’s coefficient estimate is unreliable due to multicollinearity. If two predictors have overlapping data, then it becomes difficult to identify which ones is actually having an influence on the outcome. Traditionally, VIF < 5 means no cause for concern, > 5 indicates some multicollinearity, and > 10 is considered problematic. For our predictors all VIF values are well below 5, which helps us see that none of them have strong correlation with each other. This helps validate that each predictor’s impact on 2025 market value is being measured independently.

Through residual analysis, outlier diagnostics, and multicollinearity checks, we validated that the model is generally well-behaved, with land area being the most reliable predictor. We can now move forward to evaluate how well the model performs for the specific case of 6321 88th Street.

3 Results

With the model built and validated, we can apply it to our main point of identifying if the 2025 market value of 6321 88th Street is over, under, or at an appropriate value. Using our cleaned data set from our CSV file, we can identify our home (shown as “x6321”) and its appraised 2025 market value. From there we can use the MLR model to predict what the home’s value should be based on the 3 independent variables we identified.

We will also calculate a confidence interval (tells us the likely range for the average value of similar homes) and a prediction interval (tells us the range where a new home with the same features might fall). Comparing these to the actual CAD appraisal will help us assess whether the assigned value appears justified or statistically questionable.

# Clean up address format and confirm 6321 is present
transposed$Address <- trimws(transposed$Address)
transposed$Address <- tolower(transposed$Address)
unique(transposed$Address)  # should show "x6321"
##  [1] "x6309" "x6310" "x6311" "x6312" "x6313" "x6314" "x6315" "x6316" "x6317"
## [10] "x6318" "x6319" "x6320" "x6321" "x6322" "x6323" "x6324" "x6325" "x6326"
## [19] "x6327" "x6328" "x6329" "x6330" "x6331" "x6332" "x6333" "x6334" "x6335"
## [28] "x6336" "x6337" "x6338" "x6339" "x6340" "x6341" "x6342" "x6343" "x6344"
## [37] "x6345" "x6346" "x6347" "x6348" "x6349" "x6350" "x6351"
# Extract actual assessed market value
actual_value <- transposed[transposed$Address == "x6321", "2025 Market Value"]
actual_value
## [1] 538409
# Create a new data frame for prediction
new_data <- transposed %>%
  filter(Address == "x6321") %>%
  dplyr::select(`Main Area (Sq. Ft.)`, `Garage (Sq. Ft.)`, `Land (Sq. Ft.)`)

# Predict using your regression model
predicted_value <- predict(model, newdata = new_data)
predicted_value
##    X6321 
## 490537.4

The code block above enables us to identify the predicted 2025 market value of x6321 based on the MLR model, indicating that based on our model it is $490,537.40 , NOT the CAD’s appraised value of $538,409

# Get prediction interval for X6321
interval_result <- predict(model, newdata = new_data, interval = "prediction", level = 0.95)
interval_result
##            fit      lwr      upr
## X6321 490537.4 296444.2 684630.6
# Get confidence interval for X6321
ci_result <- predict(model, newdata = new_data, interval = "confidence", level = 0.95)
ci_result
##            fit      lwr      upr
## X6321 490537.4 453912.4 527162.4

To verify if CAD’s appraisal for home 6321’s 2025 market value is fair, we used the MLR model to calculate a confidence interval and a prediction interval for its value. The model predicts the home’s market value at $490,537.40 based on its physical characteristics (main area, garage size, and land area). The 95% confidence interval ranges from $453,912.40 to $527,162.40, which tells us the average appraised value for similar homes is expected to fall within this range. On the other hand, the 95% prediction interval is wider, from $296,444.20 to $684,630.60, representing the likely range for a home like 6321.

Looking at CAD’s appraised value, it does not fall within the 95% confidence interval but it does fall within the 95% prediction interval. This means the home is appraised at a higher value than similar homes but it is not impossible to be at this value. This indicates that the home may be overvalued in comparison to similar homes in our neighborhood.

4 Conclusion

Based on the MLR analysis of similar homes in our neighborhood, the evidence is strong enough to suggest CAD overvalued 6321 88th Street in their 2025 market value. Although the appraised value of $538,409 falls within the prediction interval; it is above the confidence interval indicating that it is higher than what is typically expected for homes with similar characteristics.

We are asking that the appraisal be re-evaluated and potentially adjusted downward for 6321 88th Street. A value close to the predicted estimate of $490,537.40, or at least within the confidence interval range, would be consistent with the market values of similar properties based on their data. This would help ensure a fair and equitable property tax assessment for the homeowner.

5 Completed Code

# # Load libraries
# library(ggplot2)
# library(dplyr)
# library(gridExtra)
# library(car)
# library(tibble)
# 
# #Load the dataset from Github
# url <- "https://raw.githubusercontent.com/cri07llo/IE5344-P2/refs/heads/main/property_data.csv"
# raw_data <- read.csv(url, stringsAsFactors = FALSE)
# 
# # Transpose the dataset so each row equals one house (our data was looking flipped in prev iterations)
# transposed <- as.data.frame(t(raw_data[-1]))  # remove 'VALUE' label row
# colnames(transposed) <- raw_data$VALUE        # Set column names
# transposed$Address <- rownames(transposed)    # Add address column
# 
# # Clean formatting (remove $, commas, etc.)
# clean_money <- function(x) {
#   as.numeric(gsub("[\\$,]", "", x))
# }
# 
# # Erase white-space from column names (fixes Excel formatting)
# colnames(transposed) <- trimws(colnames(transposed))
# 
# # Auto-detect and clean monetary/square footage columns
# cols_to_clean <- grep("Main Area|Garage|Land|Total Main|Market Value", colnames(transposed), value = TRUE)
# transposed[cols_to_clean] <- lapply(transposed[cols_to_clean], clean_money)
# 
# # Confirm structure
# str(transposed)
# 
# # Drop columns with empty names from excel (this will help data look cleaner)
# transposed <- transposed[, colnames(transposed) != ""]
# str(transposed)
# 
# 
# #Now we'll plot boxplots and histograms for our data
# # 1. Market Value
# p1_hist <- ggplot(transposed, aes(x = `2025 Market Value`)) +
#   geom_histogram(binwidth = 50000, fill = "lightblue", color = "darkblue") +
#   labs(title = "Histogram: 2025 Market Value", x = "Market Value ($)", y = "Count") +
#   theme_minimal()
# 
# p1_box <- ggplot(transposed, aes(y = `2025 Market Value`)) +
#   geom_boxplot(fill = "cyan", outlier.colour = "orange", outlier.shape = 19) +
#   labs(title = "Boxplot: 2025 Market Value", y = "Market Value ($)") +
#   theme_minimal()
# 
# # 2. Main Area
# p2_hist <- ggplot(transposed, aes(x = `Main Area (Sq. Ft.)`)) +
#   geom_histogram(binwidth = 100, fill = "lightpink", color = "darkred") +
#   labs(title = "Histogram: Main Area", x = "Main Area (Sq. Ft.)", y = "Count") +
#   theme_minimal()
# 
# p2_box <- ggplot(transposed, aes(y = `Main Area (Sq. Ft.)`)) +
#   geom_boxplot(fill = "lightpink", outlier.colour = "red", outlier.shape = 19) +
#   labs(title = "Boxplot: Main Area", y = "Main Area (Sq. Ft.)") +
#   theme_minimal()
# 
# # 3. Garage Size
# p3_hist <- ggplot(transposed, aes(x = `Garage (Sq. Ft.)`)) +
#   geom_histogram(binwidth = 50, fill = "lightgreen", color = "darkgreen") +
#   labs(title = "Histogram: Garage Size", x = "Garage (Sq. Ft.)", y = "Count") +
#   theme_minimal()
# 
# p3_box <- ggplot(transposed, aes(y = `Garage (Sq. Ft.)`)) +
#   geom_boxplot(fill = "lightgreen", outlier.colour = "orange", outlier.shape = 19) +
#   labs(title = "Boxplot: Garage Size", y = "Garage (Sq. Ft.)") +
#   theme_minimal()
# 
# # 4. Land Area
# p4_hist <- ggplot(transposed, aes(x = `Land (Sq. Ft.)`)) +
#   geom_histogram(binwidth = 500, fill = "plum", color = "purple") +
#   labs(title = "Histogram: Land Area", x = "Land Area (Sq. Ft.)", y = "Count") +
#   theme_minimal()
# 
# p4_box <- ggplot(transposed, aes(y = `Land (Sq. Ft.)`)) +
#   geom_boxplot(fill = "plum", outlier.colour = "orange", outlier.shape = 19) +
#   labs(title = "Boxplot: Land Area", y = "Land Area (Sq. Ft.)") +
#   theme_minimal()
# 
# # This will help show each plot side-by-side
# grid.arrange(p1_hist, p1_box, ncol = 2)
# grid.arrange(p2_hist, p2_box, ncol = 2)
# grid.arrange(p3_hist, p3_box, ncol = 2)
# grid.arrange(p4_hist, p4_box, ncol = 2)
# 
# #Market Value vs Living Area (Main Area)
# ggplot(transposed, aes(x = `Main Area (Sq. Ft.)`, y = `2025 Market Value`)) +
#   geom_point(alpha = 0.6) +
#   geom_smooth(method = "lm", se = FALSE, color = "darkred") +
#   labs(title = "Market Value vs. Main Living Area",
#        x = "Main Living Area (sq ft)", y = "Market Value ($)") +
#   theme_minimal()
# 
# #Market Value vs Garage Size
# ggplot(transposed, aes(x = `Garage (Sq. Ft.)`, y = `2025 Market Value`)) +
#   geom_point(alpha = 0.6, color = "darkgreen") +
#   geom_smooth(method = "lm", se = FALSE, color = "black") +
#   labs(title = "Market Value vs. Garage Size",
#        x = "Garage Size (sq ft)", y = "Market Value ($)") +
#   theme_minimal()
# 
# #Market Value vs Land Size
# ggplot(transposed, aes(x = `Land (Sq. Ft.)`, y = `2025 Market Value`)) +
#   geom_point(alpha = 0.6, color = "purple") +
#   geom_smooth(method = "lm", se = FALSE, color = "black") +
#   labs(title = "Market Value vs. Lot Size",
#        x = "Land Area (sq ft)", y = "Market Value ($)") +
#   theme_minimal()
# 
# # Fit a multiple linear regression model
# model <- lm(`2025 Market Value` ~ 
#               `Main Area (Sq. Ft.)` +
#               `Garage (Sq. Ft.)` +
#               `Land (Sq. Ft.)`, 
#             data = transposed)
# 
# # View regression summary
# summary(model)
# # Confidence intervals for coefficients
# confint(model) 
# 
# # Residuals vs Fitted Plot to check if assumptions are reasonable
# ggplot(model, aes(.fitted, .resid)) +
#   geom_point(alpha = 0.6, color = "steelblue") +
#   geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
#   labs(title = "Residuals vs. Fitted Values", 
#        x = "Fitted Values", 
#        y = "Residuals") +
#   theme_minimal()
# 
# # Identify potential outliers, leverage, and influential points
# influence_metrics <- data.frame(
#   Address = rownames(model$model),  # pulled from the actual model
#   Fitted = fitted(model),
#   Residuals = resid(model),
#   Std_Residuals = rstandard(model),
#   Student_Residuals = rstudent(model),
#   Leverage = hatvalues(model),
#   CooksD = cooks.distance(model)
# )
# 
# # Flagging rule-of-thumb thresholds
# n <- nrow(model$model) #number of observations
# p <- length(coef(model))  # includes intercept
# 
# influence_metrics$Flagged <- with(influence_metrics, 
#                                   abs(Student_Residuals) > 2 | 
#                                     Leverage > (2 * p / n) | 
#                                     CooksD > (4 / n)
# )
# 
# # View flagged points
# subset(influence_metrics, Flagged == TRUE)
# 
# # Compute and display variance inflation factors
# vif_values <- vif(model)
# vif_values
# vif_df <- data.frame(Variable = names(vif_values), VIF = vif_values)
# print(vif_df)
# 
# # Now we will evaluate the 2025 Market Value of the house on 6321 88th street
# # We need to see how the address is stored
# unique(transposed$Address)
# transposed$Address <- trimws(transposed$Address)            # remove leading/trailing whitespace
# transposed$Address <- tolower(transposed$Address)           # make lowercase for consistent matching
# unique(transposed$Address)
# # Home 6321 is being represented as "X6321"
# 
# # Get the actual assessed value for x6321
# actual_value <- transposed[transposed$Address == "x6321", "2025 Market Value"]
# 
# # Create a new data frame for prediction
# new_data <- transposed %>%
#   filter(Address == "x6321") %>%
#   dplyr::select(`Main Area (Sq. Ft.)`, `Garage (Sq. Ft.)`, `Land (Sq. Ft.)`)
# 
# # Get prediction interval for X6321
# interval_result <- predict(model, newdata = new_data, interval = "prediction", level = 0.95)
# interval_result
# 
# # Get confidence interval for X6321
# ci_result <- predict(model, newdata = new_data, interval = "confidence", level = 0.95)
# ci_result
# 
# # Predict using your regression model
# predicted_value <- predict(model, newdata = new_data)
# 
# # Print the comparison
# cat("Actual Assessed Value: $", round(actual_value, 2), "\n")
# cat("Model Predicted Value: $", round(predicted_value, 2), "\n")