1 INTRODUCTION

In property taxation, accurate market value assessment is essential to ensure that homeowners contribute fairly to municipal revenue without being overcharged. However, in practice, the valuation process is often based on the discretion of the county tax assessor rather than strict statistical procedures. This raises the possibility of discrepancies between a property’s true market value and its assessed value.

This study focuses on the residential property located at 6321 88th Street, Lubbock, Texas, and aims to determine whether its 2025 assessed market value is statistically justified when compared to similar homes on the same street (address range 6309–6351). Given the availability of detailed property data—including land size, heated area, garage size, and their respective values—multiple regression modeling provides a suitable framework to assess this fairness.

This report applies Exploratory Data Analysis (EDA) and Multiple Linear Regression Modeling to estimate property market values based on physical and structural characteristics. The analysis is structured as follows:

  1. Data Collection and Initial Preparation: Manual extraction of property data from the Lubbock Central Appraisal District website, followed by standardization and cleaning of variables.

  2. Exploratory Data Analysis (EDA): Statistical and visual examination of distributions, detection of missing values, and correlation analysis to understand variable relationships and potential model risks (e.g., multicollinearity).

  3. Model Construction and Diagnostics: Fitting an initial full regression model with main effects, interaction, and quadratic terms, followed by checks for significance, residual behavior, and influential points.

  4. Model Refinement and Optimization: Stepwise reduction of predictors based on p-values and multicollinearity (GVIF) to improve interpretability while maintaining predictive accuracy.

  5. Prediction and Interval Estimation: Using the final model to predict the market value of 6321 88th Street, with 95% confidence and prediction intervals to assess valuation reliability.


2 DATA COLLECTION

The dataset was manually collected from the official website of the Lubbock Central Appraisal District at https://lubbockcad.org. The goal was to gather appraisal information on properties located along 88th Street in Lubbock, Texas, specifically within the street number range 6309 to 6351, which includes the target property: 6321 88th Street.

The data was obtained using the Advanced Search tool. By entering the street name (“88th St”), selecting the 2025 tax year

A list of relevant residential properties was generated. Each property record was accessed individually, and relevant details were manually recorded. The resulting dataset includes appraisal information for each property, with the following variables:

  • Property ID: Unique property identifier.

  • Situs Address: Property’s physical location.

  • Property Type: Classification (e.g., Residential Single Family).

  • 2025 Assessed Value: Taxable value used to calculate property taxes.

  • 2025 Market Value: Total market value (land + improvements).

  • Total Improvement Market Value: Appraised value of all structures.

  • Total Land Market Value: Appraised value of the land portion only.

  • Homestead Cap Loss: Tax discount based on the Texas 10% cap rule.

  • Total Main Area (Sq. Ft.): Enclosed area (heated + unheated).

  • Main Area (Sq. Ft.): Heated, livable square footage.

  • Main Area (Value): Value attributed to the livable area.

  • Garage (Sq. Ft.): Unheated/non-livable space (e.g., garage).

  • Garage (Value): Value of the garage or similar structures.

  • Pool (Sq. Ft.): Surface area of pool (if applicable).

  • Pool (Value): Appraised value of the pool.

  • Land (Sq. Ft.): Lot size in square feet.

All collected data was organized and stored in a CSV file, which was uploaded to a public GitHub repository. The dataset can be accessed directly via the following link:

https://raw.githubusercontent.com/JairoRodriguezB/Datasets/refs/heads/main/Property_Assessment_88thStreet_2025.csv



3 EXPLORATORY DATA ANALYSIS (EDA)

This section provides an exploration of the dataset to prepare it for regression modeling. The process begins with importing and inspecting the raw data, followed by renaming variables, converting data types, and handling missing values. The dataset is then filtered to retain only relevant residential properties and core numerical predictors.

Subsequent analyses focus on understanding the distribution of each variable through visual tools like boxplots and evaluating the relationships between features using a correlation matrix. These exploratory steps are essential for identifying patterns, detecting outliers, and diagnosing potential multicollinearity, informing later decisions on model construction and variable selection.


3.1 DATA LOADING AND INITIAL ANALYSIS

In this section, the dataset is imported directly from a public GitHub repository and prepared for analysis. Variable names are standardized for clarity, relevant columns are converted to numeric types, and missing values are addressed. Summary statistics and data structure are then reviewed to gain an initial understanding of the properties’ characteristics and to identify any potential data quality issues before proceeding with deeper analysis.

DATA LOADING

# Create the dataframe
# Raw GitHub URL
url <- "https://raw.githubusercontent.com/JairoRodriguezB/Datasets/refs/heads/main/Property_Assessment_88thStreet_2025.csv"

# Read the CSV file
data <- read.csv(url)
rmarkdown::paged_table(data)


INITIAL DATA ANALYSIS

To prepare the dataset for analysis, variable names were standardized for clarity and consistency, and key columns were converted from character strings to integers. Missing values originally labeled as “N/A” were appropriately recoded as NA to ensure proper handling in subsequent statistical procedures.

Variable Renaming and Type Conversion

# Rename columns to facilitate analysis
colnames(data) <- c(
  "property_id",
  "address",
  "property_type",
  "assessed_value_2025",
  "market_value_2025",
  "improvement_value_total",
  "land_value_total",
  "homestead_cap_loss",
  "area_total_sqft",
  "main_sqft",
  "main_value",
  "garage_sqft",
  "garage_value",
  "pool_sqft",
  "pool_value",
  "land_sqft"
)

# Convert selected columns from character to integer, replacing "N/A" with NA
data <- data %>%
  mutate(
    improvement_value_total = as.integer(na_if(as.character(improvement_value_total), "N/A")),
    land_value_total = as.integer(na_if(as.character(land_value_total), "N/A")),
    homestead_cap_loss = as.integer(na_if(as.character(homestead_cap_loss), "N/A")),
    area_total_sqft = as.integer(na_if(as.character(area_total_sqft), "N/A")),
    main_sqft = as.integer(na_if(as.character(main_sqft), "N/A")),
    main_value = as.integer(na_if(as.character(main_value), "N/A")),
    garage_sqft = as.integer(na_if(as.character(garage_sqft), "N/A")),
    garage_value = as.integer(na_if(as.character(garage_value), "N/A")),
    pool_sqft = as.integer(na_if(as.character(pool_sqft), "N/A")),
    pool_value = as.integer(na_if(as.character(pool_value), "N/A")),
    land_sqft = as.integer(na_if(as.character(land_sqft), "N/A"))
  )

Initial Data Inspection

# Summary statistics for numerical variables
summary(data)  
##  property_id          address          property_type      assessed_value_2025
##  Length:45          Length:45          Length:45          Min.   :   3000    
##  Class :character   Class :character   Class :character   1st Qu.: 498357    
##  Mode  :character   Mode  :character   Mode  :character   Median : 532125    
##                                                           Mean   : 533709    
##                                                           3rd Qu.: 580180    
##                                                           Max.   :1218146    
##                                                                              
##  market_value_2025 improvement_value_total land_value_total homestead_cap_loss
##  Min.   :   3000   Min.   : 373092         Min.   : 43506   Min.   :0         
##  1st Qu.: 498357   1st Qu.: 460406         1st Qu.: 44986   1st Qu.:0         
##  Median : 532125   Median : 490236         Median : 45658   Median :0         
##  Mean   : 533709   Mean   : 519710         Mean   : 50480   Mean   :0         
##  3rd Qu.: 580180   3rd Qu.: 535879         3rd Qu.: 47379   3rd Qu.:0         
##  Max.   :1218146   Max.   :1116617         Max.   :101529   Max.   :0         
##                    NA's   :3               NA's   :3        NA's   :3         
##  area_total_sqft   main_sqft      main_value      garage_sqft    
##  Min.   :2747    Min.   :2241   Min.   :338833   Min.   : 479.0  
##  1st Qu.:3160    1st Qu.:2614   1st Qu.:419843   1st Qu.: 528.0  
##  Median :3362    Median :2763   Median :444984   Median : 551.0  
##  Mean   :3637    Mean   :2857   Mean   :461270   Mean   : 695.0  
##  3rd Qu.:3913    3rd Qu.:2953   3rd Qu.:468781   3rd Qu.: 917.8  
##  Max.   :7117    Max.   :4582   Max.   :899451   Max.   :1119.0  
##  NA's   :3       NA's   :3      NA's   :4        NA's   :3       
##   garage_value     pool_sqft         pool_value       land_sqft    
##  Min.   :32308   Min.   :   0.00   Min.   :     0   Min.   : 7501  
##  1st Qu.:38290   1st Qu.:   0.00   1st Qu.:     0   1st Qu.: 7756  
##  Median :40721   Median :   0.00   Median :     0   Median : 7872  
##  Mean   :50217   Mean   :  85.69   Mean   :  6039   Mean   : 8695  
##  3rd Qu.:66658   3rd Qu.:   0.00   3rd Qu.:     0   3rd Qu.: 8169  
##  Max.   :96763   Max.   :1468.00   Max.   :120403   Max.   :17505  
##  NA's   :4       NA's   :3         NA's   :3        NA's   :3
# Overview of the dataset structure
glimpse(data)   
## Rows: 45
## Columns: 16
## $ property_id             <chr> "R322646", "R322647", "R322648", "R322649", "R…
## $ address                 <chr> "6310 88TH ST, LUBBOCK, TX  79424", "6312 88TH…
## $ property_type           <chr> "Real", "Real", "Real", "Real", "Real", "Real"…
## $ assessed_value_2025     <int> 663907, 602427, 569992, 735026, 514098, 434345…
## $ market_value_2025       <int> 663907, 602427, 569992, 735026, 514098, 434345…
## $ improvement_value_total <int> 603222, 538677, 511992, 677026, 465906, 390839…
## $ land_value_total        <int> 60685, 63750, 58000, 58000, 48192, 43506, 4565…
## $ homestead_cap_loss      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ area_total_sqft         <int> 4304, 4186, 4001, 4525, 3121, 2901, 3286, 3923…
## $ main_sqft               <int> 3226, 3277, 3036, 3462, 2593, 2417, 2758, 2927…
## $ main_value              <int> NA, 477232, 447924, 593404, 426798, 358531, 44…
## $ garage_sqft             <int> 1078, 909, 965, 1063, 528, 484, 528, 996, 528,…
## $ garage_value            <int> NA, 61445, 64068, 83622, 39108, 32308, 38545, …
## $ pool_sqft               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pool_value              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ land_sqft               <int> 10463, 10625, 10000, 10000, 8309, 7501, 7872, …
# Count the number of missing values (NA) in each column
colSums(is.na(data))
##             property_id                 address           property_type 
##                       0                       0                       0 
##     assessed_value_2025       market_value_2025 improvement_value_total 
##                       0                       0                       3 
##        land_value_total      homestead_cap_loss         area_total_sqft 
##                       3                       3                       3 
##               main_sqft              main_value             garage_sqft 
##                       3                       4                       3 
##            garage_value               pool_sqft              pool_value 
##                       4                       3                       3 
##               land_sqft 
##                       3

Observations

  • The dataset contains 45 residential properties on 88th Street, each described by 16 variables including identifiers, assessed and market values, structural dimensions, and improvement components.

  • Property ID, address, and property type are character variables used for reference and classification; the rest are numeric and suitable for regression modeling.

  • Key variables show the following central tendencies:

    • Assessed and Market Values (2025): Median values are approximately $532,125 and $533,709 respectively, with a maximum property value of over $1.2 million.

    • Improvement and Land Values: Mean improvement value is $519,710; land value averages around $50,480.

    • Main and Garage Areas: Heated living space (main_sqft) ranges from 2,241 to 4,582 sq. ft., while garage space varies between 479 and 1,119 sq. ft.

  • Missing values are present in several fields:

    • Main Area Value and Garage Value have 4 missing entries each.

    • Other variables like improvement_value_total, land_value_total, area_total_sqft, and land_sqft have 3 missing values.

  • Pool data (sq. ft. and value) are zero or NA for most properties, indicating that pools are not a common feature in this sample


3.2 DATA FILTERING AND PREPARATION

In this section, we refine the dataset by selecting only relevant property types and variables necessary for the regression analysis. Specifically, we filter out non-residential property types, extract useful identifiers, and retain only numeric variables that contribute to the predictive modeling of property values. Basic data cleaning is also performed to remove rows with missing values in key features.

Data Filtering

To ensure accuracy and relevance, only properties classified as “Real”—representing land and buildings—were retained for analysis. “Personal” properties, such as movable assets, were excluded. Additionally, street numbers were extracted to support identification.

unique(data$property_type)
## [1] "Real"     "Personal"
data_real <- data %>%
  filter(property_type == "Real")
data_real <- data_real %>%
  mutate(street_number = substr(address, 1, 4)) %>%
  relocate(street_number, .after = property_id)

glimpse(data_real)
## Rows: 42
## Columns: 17
## $ property_id             <chr> "R322646", "R322647", "R322648", "R322649", "R…
## $ street_number           <chr> "6310", "6312", "6311", "6309", "6349", "6347"…
## $ address                 <chr> "6310 88TH ST, LUBBOCK, TX  79424", "6312 88TH…
## $ property_type           <chr> "Real", "Real", "Real", "Real", "Real", "Real"…
## $ assessed_value_2025     <int> 663907, 602427, 569992, 735026, 514098, 434345…
## $ market_value_2025       <int> 663907, 602427, 569992, 735026, 514098, 434345…
## $ improvement_value_total <int> 603222, 538677, 511992, 677026, 465906, 390839…
## $ land_value_total        <int> 60685, 63750, 58000, 58000, 48192, 43506, 4565…
## $ homestead_cap_loss      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ area_total_sqft         <int> 4304, 4186, 4001, 4525, 3121, 2901, 3286, 3923…
## $ main_sqft               <int> 3226, 3277, 3036, 3462, 2593, 2417, 2758, 2927…
## $ main_value              <int> NA, 477232, 447924, 593404, 426798, 358531, 44…
## $ garage_sqft             <int> 1078, 909, 965, 1063, 528, 484, 528, 996, 528,…
## $ garage_value            <int> NA, 61445, 64068, 83622, 39108, 32308, 38545, …
## $ pool_sqft               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pool_value              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ land_sqft               <int> 10463, 10625, 10000, 10000, 8309, 7501, 7872, …

Data Cleaning – Removal of Irrelevant or Redundant Variables

To streamline the dataset and retain only variables relevant for the implementation of the regression model, the following columns were removed:

  • address: This column is a textual identifier used solely for property location. It provides no predictive information and is not suitable for regression modeling.
  • property_type: All observations are of type “Real”, offering no variability and therefore no contribution to explaining market value.
  • assessed_value_2025: This variable is redundant with the response variable market_value_2025.
  • homestead_cap_loss: All values in this column are zero, indicating no variability or explanatory power.
  • pool_sqft and pool_value: These were excluded due to the limited number of homes with pools in the dataset, which results in minimal variability and low predictive relevance.
# List of columns to retain for analysis
cols_keep <- c(
  "property_id",
  "street_number",
  "market_value_2025",
  "improvement_value_total",
  "land_value_total",
  "area_total_sqft",
  "main_sqft",
  "main_value",
  "garage_sqft",
  "garage_value",
  "land_sqft"
)

# Filter the dataframe to keep only the selected columns
data_real <- data_real[, names(data_real) %in% cols_keep]

# Count the number of missing values (NA) in each column
colSums(is.na(data_real))
##             property_id           street_number       market_value_2025 
##                       0                       0                       0 
## improvement_value_total        land_value_total         area_total_sqft 
##                       0                       0                       0 
##               main_sqft              main_value             garage_sqft 
##                       0                       1                       0 
##            garage_value               land_sqft 
##                       1                       0
# Remove rows where key variables are NA
data_real <- data_real %>%
  filter(!is.na(main_value) & !is.na(garage_value))

# Verify missing values
colSums(is.na(data_real))
##             property_id           street_number       market_value_2025 
##                       0                       0                       0 
## improvement_value_total        land_value_total         area_total_sqft 
##                       0                       0                       0 
##               main_sqft              main_value             garage_sqft 
##                       0                       0                       0 
##            garage_value               land_sqft 
##                       0                       0

Observations

  • After filtering out irrelevant entries and handling missing values, the dataset now includes 40 valid residential properties located on 88th Street.

  • Only properties classified as “Real” were retained to ensure the analysis focuses on land and structural assets.

  • The dataset was cleaned to include only variables relevant for the regression model, resulting in 11 core variables:

    • property_id: Unique identifier of each property

    • street_number: Extracted number for property identification

    • market_value_2025: Target variable, representing the full appraised value

    • improvement_value_total: Total value of built structures

    • land_value_total: Appraised value of the land component

    • area_total_sqft: Total enclosed square footage (heated and non-heated)

    • main_sqft: Heated living area in square feet

    • main_value: Value attributed to the main (heated) area

    • garage_sqft: Non-heated area (e.g., garage) in square feet

    • garage_value: Appraised value of the garage or similar areas

    • land_sqft: Total land area in square feet


3.3 DISTRIBUTION AND RELATIONSHIP ANALYSIS

Understanding the distribution of variables is a crucial step in preparing for multivariable regression modeling. This section explores the spread, central tendency, and potential outliers of each numerical variable using boxplots, ensuring that the dataset is well-behaved for statistical modeling.

In addition, a correlation matrix is presented to examine the strength and direction of linear relationships between predictors. This helps identify possible multicollinearity, guide variable selection, and inform model refinement. These exploratory visualizations provide essential context for evaluating variable behavior and improving the robustness of the regression model.

3.3.1 DISTRIBUTION ANALYSIS

# Boxplot Visualization
options(scipen = 999)
numeric_vars <- data_real[, sapply(data_real, is.numeric)]
par(mfrow = c(3, 3), mar = c(4, 4, 2, 1),
    cex.main = 1.2, cex.axis = 0.9)

for (var in names(numeric_vars)) {
  boxplot(numeric_vars[[var]],
          main = var,
          col = "lightblue",
          horizontal = TRUE)
}

Observations

  • market_value_2025, improvement_value_total, and main_value show a right-skewed distribution with several extreme outliers on the higher end, indicating that a few high-value properties may distort regression estimates if not addressed.

  • land_value_total and land_sqft exhibit a very narrow interquartile range (IQR) but have numerous outliers above the upper whisker, suggesting limited variability for most homes with a few properties having unusually high land values or lot sizes.

  • area_total_sqft and main_sqft are moderately skewed with some influential observations, though most properties fall within a consistent central range.

  • garage_sqft has a highly skewed distribution, with a large spread and several high-end outliers, indicating that some properties have significantly larger garage spaces than typical.

  • garage_value follows a similar pattern to garage_sqft, reinforcing the direct relationship between garage size and its appraised value.

3.3.2 RELATIONSHIP ANALYSIS

# Correlation Matrix Visualization
numeric_vars <- data_real[, sapply(data_real, is.numeric)]
cor_matrix <- cor(numeric_vars)
ggcorrplot(cor_matrix, 
           lab = TRUE,                          # Mostrar coeficientes
           colors = c("red", "white", "#4A90E2"),  # Rango de colores
           outline.color = "black",             # Borde de cada celda
           show.legend = TRUE,                  # Mostrar leyenda
           title = "Correlation Matrix of Numeric Variables",
)

Observations:

  • The target variable market_value_2025 exhibits near-perfect correlations with:

    • improvement_value_total (r = 1.00)

    • main_value (r = 0.99)

    • area_total_sqft (r = 0.96)

    • main_sqft (r = 0.93)

    • land_sqft (r = 0.91)

    • land_value_total (r = 0.90)

    • garage_value (r = 0.72)

  • improvement_value_total and main_value should be excluded from the regression model, despite their strong correlation with the target, because they are mathematically embedded within the target:

    • market_value_2025 = improvement_value_total + land_value_total

    • improvement_value_total = main_value + garage_value

    • Including these would violate the assumption of predictor independence and overfitting.

  • The features main_sqft, garage_sqft, and land_sqft remain in the model as they are direct physical measurements, not derived values. Their correlations with market_value_2025 (r = 0.93, 0.57, 0.91 respectively) confirm their predictive relevance while preserving model validity.

  • High intercorrelations among predictor variables are present:

    • main_sqft vs. area_total_sqft: r = 0.96

    • garage_sqft vs. garage_value: r = 0.96

    • main_sqft vs. main_value: r = 0.94

    • main_sqft vs. improvement_value_total: r = 0.92

  • These indicate potential multicollinearity, which should be addressed in later model diagnostics.



4 MODEL BUILDING

This section presents the process of constructing a multiple linear regression model to estimate the 2025 market value of residential properties using relevant physical and structural features. The model aims to quantify how various characteristics, such as land size, heated area, and garage attributes, jointly influence property valuation.

We begin by specifying an initial regression model that incorporates main effects, interaction terms, and quadratic terms to capture both linear and non-linear relationships. Careful consideration is given to the inclusion of predictors to avoid redundancy or data leakage.


4.1 INITIAL MODEL

Multiple linear regression is a fundamental statistical technique used to model the relationship between a continuous dependent variable and two or more independent variables. It enables us to quantify how changes in multiple property attributes jointly influence the outcome of interest.

In this case, the objective is to estimate the 2025 market value of residential properties on 88th Street in Lubbock, Texas, based on physical and structural characteristics such as land value, heated area, garage size, and lot size. By fitting a regression model to this data, we aim to assess the contribution of each feature, individually and through interaction effects, to the overall property value.

The general form of the model is:

\[ \text{market_value}_{2025} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \varepsilon \]

  • \(\text{market_value}_{2025}\): is the dependent variable (response), representing the assessed market value of the property in 2025.

  • \(\beta_0\): is the intercept, indicating the estimated value of the property when all predictors are zero.

  • \(\beta_1, \beta_2, \dots, \beta_p\): are the regression coefficients, estimating the effect of each predictor (e.g., land size, garage area) on the market value.

  • \(X_1, X_2, \dots, X_p\): are the independent variables (predictors) used in the model, such as:

    • \({land\_value\_total}\)
    • \({main\_sqft}\)
    • \({garage\_sqft}\)
    • \({garage\_value}\)
    • \({land\_sqft}\)
    • \({area\_total\_sqft}\)
    • plus interaction terms and squared terms (e.g., \(X_i^2\), \(X_i \cdot X_j\))
  • \(\varepsilon\): is the random error term, accounting for variability in market value not captured by the predictors.


Initial Regression Model

To construct the initial regression model for predicting the 2025 market value of residential properties, the general linear equation is expanded using the actual predictors selected from the dataset. This formulation incorporates:

  • Main effects (original features),

  • Pairwise interaction terms (to capture possible synergies between variables), and

  • Quadratic terms (to account for potential non-linear patterns).

Although some variables such as improvement_value_total and main_value show a very high correlation with the target variable market_value_2025, they were intentionally excluded from the model. This is because they are mathematically embedded within the target variable, and including them would violate the assumption of predictor independence and lead to overfitting.

The expanded expression of the initial model is as follows:

\[ \begin{aligned} \text{market_value_2025 } =\ & \beta_0 + \beta_1 \cdot \text{land_value_total} + \beta_2 \cdot \text{main_sqft} + \beta_3 \cdot \text{garage_sqft} + \beta_4 \cdot \text{garage_value} + \beta_5 \cdot \text{land_sqft} + \beta_6 \cdot \text{area_total_sqft} \\ &+ \beta_7 \cdot (\text{land_value_total} \cdot \text{main_sqft}) + \beta_8 \cdot (\text{land_value_total} \cdot \text{garage_sqft}) + \beta_9 \cdot (\text{land_value_total} \cdot \text{garage_value}) \\ &+ \beta_{10} \cdot (\text{land_value_total} \cdot \text{land_sqft}) + \beta_{11} \cdot (\text{land_value\_total} \cdot \text{area_total_sqft}) + \beta_{12} \cdot (\text{main_sqft} \cdot \text{garage_sqft}) \\ &+ \beta_{13} \cdot (\text{main_sqft} \cdot \text{garage_value}) + \beta_{14} \cdot (\text{main_sqft} \cdot \text{land_sqft}) + \beta_{15} \cdot (\text{main_sqft} \cdot \text{area_total_sqft}) \\ &+ \beta_{16} \cdot (\text{garage_sqft} \cdot \text{garage_value}) + \beta_{17} \cdot (\text{garage_sqft} \cdot \text{land_sqft}) + \beta_{18} \cdot (\text{garage_sqft} \cdot \text{area_total_sqft}) \\ &+ \beta_{19} \cdot (\text{garage_value} \cdot \text{land_sqft}) + \beta_{20} \cdot (\text{garage_value} \cdot \text{area_total_sqft}) + \beta_{21} \cdot (\text{land_sqft} \cdot \text{area_total_sqft}) \\ &+ \beta_{22} \cdot \text{land_value_total}^2 + \beta_{23} \cdot \text{main_sqft}^2 + \beta_{24} \cdot \text{garage_sqft}^2 + \beta_{25} \cdot \text{garage_value}^2 \\ &+ \beta_{26} \cdot \text{land_sqft}^2 + \beta_{27} \cdot \text{area_total_sqft}^2 + \varepsilon \end{aligned} \]

  • \(\text{market_value}_{2025}\): Target variable, the appraised market value of the property in 2025
  • \(\beta_0\): Intercept
  • \(\beta_1\) to \(\beta_6\): Coefficients for main predictors
  • \(\beta_7\) to \(\beta_{21}\)`: Coefficients for interaction terms
  • \(\beta_{22}\) to \(\beta_{27}\)`: Coefficients for quadratic terms
  • \(\varepsilon\):: Random error term

In the first step, we isolate the target property located at street number 6321 by storing it in a separate data frame named data_6321. This allows us to make an out-of-sample prediction. To avoid introducing bias into the model, we then remove this property from the training dataset (data_real) before fitting the model.

# Extract the property with street_number 6321 and store it in data_6321
data_6321 <- data_real[data_real$street_number == 6321, ]

# Remove that row from the training dataset
data_real <- data_real[data_real$street_number != 6321, ]
rownames(data_real) <- NULL


A full regression model is fitted using all predictors, their pairwise interactions, and squared terms.

model_full <- lm(
  market_value_2025 ~ 
    (land_value_total + main_sqft +
     garage_sqft + garage_value + land_sqft + area_total_sqft)^2 +
    I(land_value_total^2) + I(main_sqft^2) +
    I(garage_sqft^2) + I(garage_value^2) + I(land_sqft^2) +
    I(area_total_sqft^2),
  data = data_real
)
summary(model_full)
## 
## Call:
## lm(formula = market_value_2025 ~ (land_value_total + main_sqft + 
##     garage_sqft + garage_value + land_sqft + area_total_sqft)^2 + 
##     I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) + 
##     I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2), 
##     data = data_real)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6625.4  -133.2    11.9   407.3  2266.4 
## 
## Coefficients: (4 not defined because of singularities)
##                                          Estimate       Std. Error t value
## (Intercept)                       -85943.73425868  162015.89026558  -0.530
## land_value_total                   18808.53041293   23108.66694784   0.814
## main_sqft                            208.08228722     619.27299609   0.336
## garage_sqft                        -1057.47336410     768.61434331  -1.376
## garage_value                          14.19966474       3.93340643   3.610
## land_sqft                        -109083.75206096  134028.25421000  -0.814
## area_total_sqft                       34.33949201     593.70946479   0.058
## I(land_value_total^2)                  0.13879630       0.27657727   0.502
## I(main_sqft^2)                        -0.04215281       0.57776840  -0.073
## I(garage_sqft^2)                       0.84597907       0.34997850   2.417
## I(garage_value^2)                     -0.00000482       0.00008139  -0.059
## I(land_sqft^2)                        -4.66892112       9.30518331  -0.502
## I(area_total_sqft^2)                  -0.03681714       0.22025905  -0.167
## land_value_total:main_sqft            -8.63116334      12.00096550  -0.719
## land_value_total:garage_sqft           4.90735665      72.22682546   0.068
## land_value_total:garage_value         -0.20184358       0.89152744  -0.226
## land_value_total:land_sqft                     NA               NA      NA
## land_value_total:area_total_sqft       0.00268557       0.00358760   0.749
## main_sqft:garage_sqft                 -0.05738249       0.39983222  -0.144
## main_sqft:garage_value                 0.00132662       0.00123555   1.074
## main_sqft:land_sqft                   50.04506636      69.60166533   0.719
## main_sqft:area_total_sqft              0.06027726       0.79682355   0.076
## garage_sqft:garage_value              -0.01062309       0.00958458  -1.108
## garage_sqft:land_sqft                -28.48522680     418.90384959  -0.068
## garage_sqft:area_total_sqft                    NA               NA      NA
## garage_value:land_sqft                 1.17075009       5.17049868   0.226
## garage_value:area_total_sqft                   NA               NA      NA
## land_sqft:area_total_sqft                      NA               NA      NA
##                                  Pr(>|t|)   
## (Intercept)                       0.60307   
## land_value_total                  0.42764   
## main_sqft                         0.74123   
## garage_sqft                       0.18783   
## garage_value                      0.00235 **
## land_sqft                         0.42766   
## area_total_sqft                   0.95459   
## I(land_value_total^2)             0.62262   
## I(main_sqft^2)                    0.94274   
## I(garage_sqft^2)                  0.02794 * 
## I(garage_value^2)                 0.95351   
## I(land_sqft^2)                    0.62267   
## I(area_total_sqft^2)              0.86934   
## land_value_total:main_sqft        0.48239   
## land_value_total:garage_sqft      0.94667   
## land_value_total:garage_value     0.82376   
## land_value_total:land_sqft             NA   
## land_value_total:area_total_sqft  0.46498   
## main_sqft:garage_sqft             0.88767   
## main_sqft:garage_value            0.29888   
## main_sqft:land_sqft               0.48250   
## main_sqft:area_total_sqft         0.94064   
## garage_sqft:garage_value          0.28409   
## garage_sqft:land_sqft             0.94663   
## garage_sqft:area_total_sqft            NA   
## garage_value:land_sqft            0.82374   
## garage_value:area_total_sqft           NA   
## land_sqft:area_total_sqft              NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1956 on 16 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9998 
## F-statistic:  8792 on 23 and 16 DF,  p-value: < 0.00000000000000022


Initial Predictor Evaluation

The following terms are recommended for retention:

  • garage_value (p = 0.002) and I(garage_sqft²) (p = 0.028) show clear statistical significance and should be retained as key predictors of market value.

  • While main_sqft, land_value_total, area_total_sqft, and their interaction terms were not statistically significant individually (p > 0.1), they are conceptually relevant and may support the structure of the model when considered together.

  • Interaction terms and quadratic components such as main_sqft:garage_value, land_value_total:garage_value, and garage_sqft:garage_value do not show significance and may be removed to simplify the model.

  • Several terms resulted in singularities (NA coefficients)—specifically:

    • land_value_total:land_sqft
    • garage_sqft:area_total_sqft
    • garage_value:area_total_sqft
    • land_sqft:area_total_sqft
model2<- lm(
  market_value_2025 ~ 
    land_value_total + main_sqft + garage_sqft + garage_value + land_sqft + area_total_sqft +
    I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
    I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
    land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
    land_value_total:area_total_sqft +
    main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + main_sqft:area_total_sqft +
    garage_sqft:garage_value + garage_sqft:land_sqft,
  data = data_real
)

summary(model2)
## 
## Call:
## lm(formula = market_value_2025 ~ land_value_total + main_sqft + 
##     garage_sqft + garage_value + land_sqft + area_total_sqft + 
##     I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) + 
##     I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) + 
##     land_value_total:main_sqft + land_value_total:garage_sqft + 
##     land_value_total:garage_value + land_value_total:area_total_sqft + 
##     main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + 
##     main_sqft:area_total_sqft + garage_sqft:garage_value + garage_sqft:land_sqft, 
##     data = data_real)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6602.0  -167.8    12.5   416.5  2374.7 
## 
## Coefficients:
##                                         Estimate      Std. Error t value
## (Intercept)                      -73517.15783031 148123.05611120  -0.496
## land_value_total                  14520.54638818  12868.40740859   1.128
## main_sqft                           259.35307972    560.07736186   0.463
## garage_sqft                        -936.46600839    536.78318171  -1.745
## garage_value                         13.50713797      2.40333599   5.620
## land_sqft                        -84214.32287119  74638.73673539  -1.128
## area_total_sqft                     -25.83567017    515.88150733  -0.050
## I(land_value_total^2)                 0.16777345      0.23824874   0.704
## I(main_sqft^2)                       -0.10429218      0.49404133  -0.211
## I(garage_sqft^2)                      0.84860500      0.33988582   2.497
## I(garage_value^2)                    -0.00001351      0.00006975  -0.194
## I(land_sqft^2)                       -5.64385714      8.01559561  -0.704
## I(area_total_sqft^2)                 -0.05727685      0.19518616  -0.293
## land_value_total:main_sqft           -7.74459322     11.02317682  -0.703
## land_value_total:garage_sqft        -11.21049526     11.88951055  -0.943
## land_value_total:garage_value         0.00002396      0.00007889   0.304
## land_value_total:area_total_sqft      0.00199765      0.00185408   1.077
## main_sqft:garage_sqft                -0.10941276      0.31794502  -0.344
## main_sqft:garage_value                0.00140998      0.00114604   1.230
## main_sqft:land_sqft                  44.90814899     63.93806341   0.702
## main_sqft:area_total_sqft             0.14312407      0.68781794   0.208
## garage_sqft:garage_value             -0.00974084      0.00850900  -1.145
## garage_sqft:land_sqft                64.99644411     68.93852004   0.943
##                                   Pr(>|t|)    
## (Intercept)                         0.6260    
## land_value_total                    0.2748    
## main_sqft                           0.6492    
## garage_sqft                         0.0991 .  
## garage_value                     0.0000306 ***
## land_sqft                           0.2749    
## area_total_sqft                     0.9606    
## I(land_value_total^2)               0.4908    
## I(main_sqft^2)                      0.8353    
## I(garage_sqft^2)                    0.0231 *  
## I(garage_value^2)                   0.8487    
## I(land_sqft^2)                      0.4909    
## I(area_total_sqft^2)                0.7727    
## land_value_total:main_sqft          0.4918    
## land_value_total:garage_sqft        0.3589    
## land_value_total:garage_value       0.7651    
## land_value_total:area_total_sqft    0.2963    
## main_sqft:garage_sqft               0.7350    
## main_sqft:garage_value              0.2353    
## main_sqft:land_sqft                 0.4920    
## main_sqft:area_total_sqft           0.8376    
## garage_sqft:garage_value            0.2682    
## garage_sqft:land_sqft               0.3590    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1901 on 17 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9998 
## F-statistic:  9735 on 22 and 17 DF,  p-value: < 0.00000000000000022


Observations

  • The updated model exhibits excellent overall fit, with a residual standard error of approximately 1901 and an R² value of 0.9999, indicating that the model explains nearly all the variability in the response.
  • Only a few predictors are statistically significant (p < 0.05), particularly:
    • garage_value and the squared term I(garage_sqft²).
    • The linear term garage_sqft also approaches significance (p ≈ 0.099), warranting further consideration.
  • Most interaction terms—including combinations such as land_value_total:main_sqft, main_sqft:garage_value, and garage_sqft:land_sqft—exhibit high p-values (p > 0.1), suggesting they do not contribute meaningful interaction effects in the current formulation.
  • Similarly, variables like I(main_sqft²), I(area_total_sqft²), and land_sqft display non-significant effects (p > 0.1), indicating limited marginal impact and potential candidates for removal in a simplified model.



4.2 OUTLIER AND INFLUENCE DIAGNOSTICS

This section conducts a diagnostic assessment to identify influential observations that may disproportionately affect the fitted regression model. Influential points can distort parameter estimates, compromise model assumptions, and reduce predictive accuracy

Observations within a dataset can be labeled by the amount in which they are an outlier in their response, have leverage in their predictors, and influence the fit of the model.

  • Outliers are observations in a dataset whose response variable is distant from other observations.

  • Leverage points are observations in a dataset whose predictor variables are distant from the other observations.

  • Influence points are observations in a dataset that are both outliers in their response variable and have leverage in their predictor variables.

Points that have influence need to be carefully examined to determine whether they are mistakes (e.g., due to measurement errors) or extreme observations (which are possible but unlikely). Once identified, a decision must be made on how to handle these points—whether to leave them in the model or remove them.

4.2.1 MATHEMATICS BEHIND INFLUENCE DIAGNOSTICS

The diagnostics for influential observations are based on three main components:

1. Studentized Residuals (rstudent): Measures the residual of an observation scaled by its estimated standard deviation, providing an assessment of whether an observation is an outlier in terms of its response variable.

\[ \mathbf{e} = (\mathbf{I} - \mathbf{H}) \mathbf{y} \]

Where:

  • \(\mathbf{e}\) is the residual vector.
  • \(\mathbf{I}\) is the identity matrix.
  • \(\mathbf{H}\) is the hat matrix.

2. Leverage (hatvalues): Measures how much influence a particular predictor value has on the fitted values. The leverage for an observation \(i\) is calculated as:

\[ h_{ii} = \mathbf{x}_i^\top (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{x}_i \]

Where:

  • \(\mathbf{x}_i\) is the vector of predictors for observation \(i\).
  • \(\mathbf{X}\) is the matrix of all predictors.
  • \(h_{ii}\) is the leverage of the \(i\)-th observation on its fitted value.

A common rule of thumb to identify high-leverage points is:

\[ h_{ii} > \frac{2p}{n} \]

3. Cook’s Distance (cooks.distance): Combines the residuals and leverage information to assess the overall influence of each observation on the model. The Cook’s Distance for the \(\text{i-th}\) observation is calculated as:

\[ D_i = \frac{1}{p \cdot \text{MSE}} (\hat{\boldsymbol{\beta}}^{(i)} - \hat{\boldsymbol{\beta}})^\top \mathbf{X}^\top \mathbf{X} (\hat{\boldsymbol{\beta}}^{(i)} - \hat{\boldsymbol{\beta}}) \]

Where:

  • \(\hat{\boldsymbol{\beta}}^{(i)}\) is the estimated regression coefficient vector excluding the \(i\)-th observation.
  • \(\hat{\boldsymbol{\beta}}\) is the estimated regression coefficient vector using all observations.
  • \(p\) is the number of parameters in the model.
  • \(\text{MSE}\) is the mean squared error of the model.

4.2.2 DIAGNOSTICS IMPLEMENTATION

The following R code generates diagnostic plots and a summary table of key influence statistics:

plot(model2, 1)

plot(model2, 4)

The plots suggest possible influential observations at indexes 2, 28, and 38.


# Creating combined diagnostics dataframe
diagnostics_df <- data.frame(
  Obs = 1:nrow(data_real),
  Residual = rstudent(model2),
  Leverage = hatvalues(model2),
  CooksD = cooks.distance(model2)
)

# Thresholds for leverage and Cook's Distance
n <- nrow(data_real)
p <- length(coef(model2))
leverage_thresh <- 2 * p / n
cook_thresh <- 4 / n

# Flagging high influence points
diagnostics_df$High_Residual <- abs(diagnostics_df$Residual) > 2
diagnostics_df$High_Leverage <- diagnostics_df$Leverage > leverage_thresh
diagnostics_df$High_CooksD <- diagnostics_df$CooksD > cook_thresh

# Sorting by Cook's Distance
diagnostics_df <- diagnostics_df[order(-diagnostics_df$CooksD), ]

# Display top influential points
rownames(diagnostics_df) <- NULL
rmarkdown::paged_table(diagnostics_df)
plot(model2,5)

plot(model2,6)


Observations

  • Data Points 1 and 11 are the strongest candidates for exclusion. Observation 1 displays a Cook’s Distance of approximately 1408, with a leverage value of 1.00, the theoretical maximum. Though its studentized residual is not extreme (0.03), its magnitude of influence on the regression is exceptionally high. Similarly, Data Point 11 has a Cook’s Distance of 9.68 and leverage near 0.99, making it highly influential on the fitted model.

  • Data Point 37 also stands out with a studentized residual of −11.46, well beyond the ±2 threshold, and a Cook’s Distance of 0.17. While its leverage is moderate (≈ 0.20), the extreme residual marks it as a clear outlier in terms of model fit.

  • Data Point 3 shows a Cook’s Distance of 1.43, leverage of 0.99, and a moderately large residual (−0.67). This combination indicates a potential distortion of coefficient estimates and model stability.

  • Other points such as 32, 9, and 39 exceed the Cook’s Distance threshold of 4/n (≈ 0.1), though their studentized residuals remain within acceptable bounds. For instance, Data Point 39 has a residual of 1.90 and moderate leverage (≈ 0.50), suggesting a non-negligible but less severe influence.

  • Based on these results, Data Points 1, 11, 3, and 37 are strong candidates for exclusion in a revised model. Data Points 9, 32, and 39 will be temporarily retained but flagged for further inspection. Subsequent modeling will involve re-estimating the regression with and without these observations to assess impacts on adjusted R-squared, residual behavior, and the stability of coefficient estimates.



4.3 RE-ESTIMATE THE MODEL WITHOUT INFLUENTIAL DATA POINTS

In this section, we aim to improve the reliability and interpretability of the regression model by identifying and removing influential data points. These observations, typically characterized by high leverage, large residuals, or disproportionate influence on model coefficients, can distort estimates and compromise model performance.

By re-estimating the model without such points, we seek to achieve a more stable fit, enhance statistical significance of relevant predictors, and ensure the model better reflects the underlying patterns in the data.

4.3.1 MODEL RE-ESTIMATION

To improve model stability and reduce the effect of extreme values, we re-estimated the model after removing four previously identified influential observations (Data Points 1, 3, 11, and 37). These points were flagged based on high leverage, outlier status, and influence metrics.

# Remove influential observations
data_no_points <- data_real[-c(1, 11, 3, 37), ]

# Fit the new model
model_no_points <- lm(
  market_value_2025 ~ 
    land_value_total + main_sqft + garage_sqft + garage_value + land_sqft + area_total_sqft +
    I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
    I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
    land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
    land_value_total:area_total_sqft +
    main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + main_sqft:area_total_sqft +
    garage_sqft:garage_value + garage_sqft:land_sqft,
  data = data_no_points)

summary(model_no_points)
## 
## Call:
## lm(formula = market_value_2025 ~ land_value_total + main_sqft + 
##     garage_sqft + garage_value + land_sqft + area_total_sqft + 
##     I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) + 
##     I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) + 
##     land_value_total:main_sqft + land_value_total:garage_sqft + 
##     land_value_total:garage_value + land_value_total:area_total_sqft + 
##     main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + 
##     main_sqft:area_total_sqft + garage_sqft:garage_value + garage_sqft:land_sqft, 
##     data = data_no_points)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -231.516  -35.200   -0.174   44.186  186.723 
## 
## Coefficients:
##                                          Estimate       Std. Error t value
## (Intercept)                      -32532.655322552  23386.404808748  -1.391
## land_value_total                  -3314.749529057   2662.441382959  -1.245
## main_sqft                            65.896705356     63.043193006   1.045
## garage_sqft                        -751.515100888     57.588030943 -13.050
## garage_value                         10.011740905      0.251663117  39.782
## land_sqft                         19235.636345617  15441.581708770   1.246
## area_total_sqft                     104.869067722     53.194883065   1.971
## I(land_value_total^2)                 0.012594306      0.034824014   0.362
## I(main_sqft^2)                        0.045388164      0.046824511   0.969
## I(garage_sqft^2)                      0.867328814      0.030777077  28.181
## I(garage_value^2)                    -0.000005365      0.000007354  -0.730
## I(land_sqft^2)                       -0.423281744      1.171676752  -0.361
## I(area_total_sqft^2)                 -0.007849387      0.017810212  -0.441
## land_value_total:main_sqft            0.676598824      0.993448810   0.681
## land_value_total:garage_sqft         -0.040440798      1.113434660  -0.036
## land_value_total:garage_value        -0.000032808      0.000007526  -4.360
## land_value_total:area_total_sqft      0.002118823      0.000209319  10.122
## main_sqft:garage_sqft                -0.238098904      0.032219711  -7.390
## main_sqft:garage_value                0.003786818      0.000133024  28.467
## main_sqft:land_sqft                  -3.940827623      5.762604131  -0.684
## main_sqft:area_total_sqft            -0.031679026      0.064624283  -0.490
## garage_sqft:garage_value             -0.011362737      0.000903749 -12.573
## garage_sqft:land_sqft                 0.237610431      6.456075615   0.037
##                                             Pr(>|t|)    
## (Intercept)                                 0.187548    
## land_value_total                            0.235106    
## main_sqft                                   0.314949    
## garage_sqft                      0.00000000759469477 ***
## garage_value                     0.00000000000000575 ***
## land_sqft                                   0.234857    
## area_total_sqft                             0.070344 .  
## I(land_value_total^2)                       0.723420    
## I(main_sqft^2)                              0.350073    
## I(garage_sqft^2)                 0.00000000000048422 ***
## I(garage_value^2)                           0.478568    
## I(land_sqft^2)                              0.723707    
## I(area_total_sqft^2)                        0.666653    
## land_value_total:main_sqft                  0.507784    
## land_value_total:garage_sqft                0.971578    
## land_value_total:garage_value               0.000773 ***
## land_value_total:area_total_sqft 0.00000015619830432 ***
## main_sqft:garage_sqft            0.00000527093644677 ***
## main_sqft:garage_value           0.00000000000042543 ***
## main_sqft:land_sqft                         0.506070    
## main_sqft:area_total_sqft                   0.632159    
## garage_sqft:garage_value         0.00000001192089443 ***
## garage_sqft:land_sqft                       0.971200    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 160.5 on 13 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.302e+06 on 22 and 13 DF,  p-value: < 0.00000000000000022

Observations

  • The residual standard error decreased significantly from 1901 to 160.5, indicating improved fit.

  • The adjusted R² remained at 1.00, confirming that the model continues to explain nearly all variability in market_value_2025 even after removing outliers.

  • The following variables are highly significant (p < 0.01):

    • garage_sqft, garage_value, and I(garage_sqft²)
    • Interaction terms: land_value_total:garage_value, land_value_total:area_total_sqft, main_sqft:garage_sqft,
    • main_sqft:garage_value, and garage_sqft:garage_value.
  • Most other predictors, including land_value_total, main_sqft, and several squared and interaction terms, are not statistically significant (p > 0.1)`

  • The results confirm that garage-related variables and their interactions are the most influential in predicting market value in the cleaned model.

  • The elimination of the four influential observations led to a more stable, interpretable, and better statistically model.

4.3.2 TESTING ADDITIONAL OBSERVATIONS

To further evaluate the robustness of the model, we conducted a sensitivity analysis by testing the impact of individually removing three additional observations (9, 32, and 39). These were previously identified as having potential outlier behavior but were not removed in the initial round.

# List of observations to evaluate
outliers_to_test <- c(9, 32, 39)

for (obs in outliers_to_test) {
  cat("\n================ Data Point Removed:", obs, "================\n")
  
   # Remove observation
  temp_data <- data_no_points[-obs, ]
  
  # Refit the model
  temp_model <- lm(
  market_value_2025 ~ 
    land_value_total + main_sqft + garage_sqft + garage_value + land_sqft + area_total_sqft +
    I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
    I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
    land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
    land_value_total:area_total_sqft +
    main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + main_sqft:area_total_sqft +
    garage_sqft:garage_value + garage_sqft:land_sqft,
  data = temp_data)
  
  # Model summary
  s <- summary(temp_model)
  f <- s$fstatistic
  f_pval <- pf(f[1], f[2], f[3], lower.tail = FALSE)

  # Display global metrics
  cat("Residual standard error:", format(s$sigma, scientific = TRUE), "on", s$df[2], "degrees of freedom\n")
  cat("Multiple R-squared:", round(s$r.squared, 4), "\tAdjusted R-squared:", round(s$adj.r.squared, 4), "\n")
  cat("F-statistic:", format(f[1], scientific = TRUE), "on", f[2], "and", f[3], "DF, p-value:", format.pval(f_pval, digits = 4, eps = .0001), "\n\n")
  
  # Display rounded p-values
  pvals <- round(s$coefficients[, "Pr(>|t|)"], 4)
  for (i in seq_along(pvals)) {
    cat(names(pvals)[i], ":", pvals[i], "\n")
  }
}
## 
## ================ Data Point Removed: 9 ================
## Residual standard error: 1.503073e+02 on 12 degrees of freedom
## Multiple R-squared: 1    Adjusted R-squared: 1 
## F-statistic: 1.484054e+06 on 22 and 12 DF, p-value: < 0.0001 
## 
## (Intercept) : 0.1106 
## land_value_total : 0.1936 
## main_sqft : 0.1863 
## garage_sqft : 0 
## garage_value : 0 
## land_sqft : 0.1934 
## area_total_sqft : 0.1049 
## I(land_value_total^2) : 0.9028 
## I(main_sqft^2) : 0.4422 
## I(garage_sqft^2) : 0 
## I(garage_value^2) : 0.3897 
## I(land_sqft^2) : 0.9024 
## I(area_total_sqft^2) : 0.4997 
## land_value_total:main_sqft : 0.2192 
## land_value_total:garage_sqft : 0.9483 
## land_value_total:garage_value : 0.0004 
## land_value_total:area_total_sqft : 0 
## main_sqft:garage_sqft : 0 
## main_sqft:garage_value : 0 
## main_sqft:land_sqft : 0.2182 
## main_sqft:area_total_sqft : 0.7958 
## garage_sqft:garage_value : 0 
## garage_sqft:land_sqft : 0.9476 
## 
## ================ Data Point Removed: 32 ================
## Residual standard error: 1.557109e+02 on 12 degrees of freedom
## Multiple R-squared: 1    Adjusted R-squared: 1 
## F-statistic: 1.366883e+06 on 22 and 12 DF, p-value: < 0.0001 
## 
## (Intercept) : 0.2972 
## land_value_total : 0.1001 
## main_sqft : 0.2433 
## garage_sqft : 0 
## garage_value : 0 
## land_sqft : 0.1 
## area_total_sqft : 0.1095 
## I(land_value_total^2) : 0.6813 
## I(main_sqft^2) : 0.5653 
## I(garage_sqft^2) : 0 
## I(garage_value^2) : 0.1929 
## I(land_sqft^2) : 0.6815 
## I(area_total_sqft^2) : 0.4507 
## land_value_total:main_sqft : 0.2074 
## land_value_total:garage_sqft : 0.4676 
## land_value_total:garage_value : 0.0006 
## land_value_total:area_total_sqft : 0 
## main_sqft:garage_sqft : 0 
## main_sqft:garage_value : 0 
## main_sqft:land_sqft : 0.2067 
## main_sqft:area_total_sqft : 0.9049 
## garage_sqft:garage_value : 0 
## garage_sqft:land_sqft : 0.4671 
## 
## ================ Data Point Removed: 39 ================
## Residual standard error: 1.605191e+02 on 13 degrees of freedom
## Multiple R-squared: 1    Adjusted R-squared: 1 
## F-statistic: 1.302447e+06 on 22 and 13 DF, p-value: < 0.0001 
## 
## (Intercept) : 0.1875 
## land_value_total : 0.2351 
## main_sqft : 0.3149 
## garage_sqft : 0 
## garage_value : 0 
## land_sqft : 0.2349 
## area_total_sqft : 0.0703 
## I(land_value_total^2) : 0.7234 
## I(main_sqft^2) : 0.3501 
## I(garage_sqft^2) : 0 
## I(garage_value^2) : 0.4786 
## I(land_sqft^2) : 0.7237 
## I(area_total_sqft^2) : 0.6667 
## land_value_total:main_sqft : 0.5078 
## land_value_total:garage_sqft : 0.9716 
## land_value_total:garage_value : 0.0008 
## land_value_total:area_total_sqft : 0 
## main_sqft:garage_sqft : 0 
## main_sqft:garage_value : 0 
## main_sqft:land_sqft : 0.5061 
## main_sqft:area_total_sqft : 0.6322 
## garage_sqft:garage_value : 0 
## garage_sqft:land_sqft : 0.9712

Observations:

After previously excluding Data Points 1, 3, 11, and 37, additional outliers were individually removed to assess their influence on the model’s performance. The goal was to determine whether their presence distorted coefficient estimation, statistical significance, or model fit.

  • Data Point 9:
    • Residual Standard Error: 150.31
    • Slight improvement in overall fit; key predictors maintained or improved significance, including area_total_sqft, which approached the 0.05 threshold..
    • Strong candidate for exclusion.
  • Data Point 32:
    • Residual Standard Error: 155.71
    • No substantial change in predictor significance or model structure; RSE slightly lower than baseline.
    • Retain; low impact on model performance.
  • Data Point 39:
    • Residual Standard Error: 160.52
    • Model remained nearly identical to the baseline; no change in predictor behavior or significance.
    • Retain; minimal influence detected.

4.3.3 FINAL CLEANED MODEL

Based on the above sensitivity analysis, we updated the model one final time by additionally removing Data Point 9, resulting in a total of five excluded observations (1, 3, 9, 11, and 37). This version is expected to be more stable and better represent the central trend of the data.

# Remove influential observations
data_cleaned <- data_real[-c(1, 11, 3, 37, 9), ]

# Refit the regression model using the cleaned dataset
model_cleaned <- lm(
  market_value_2025 ~ 
    land_value_total + main_sqft + garage_sqft + garage_value + land_sqft + area_total_sqft +
    I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
    I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
    land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
    land_value_total:area_total_sqft +
    main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + main_sqft:area_total_sqft +
    garage_sqft:garage_value + garage_sqft:land_sqft,
  data =  data_cleaned
)

# Display key model results
summary(model_cleaned)
## 
## Call:
## lm(formula = market_value_2025 ~ land_value_total + main_sqft + 
##     garage_sqft + garage_value + land_sqft + area_total_sqft + 
##     I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) + 
##     I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) + 
##     land_value_total:main_sqft + land_value_total:garage_sqft + 
##     land_value_total:garage_value + land_value_total:area_total_sqft + 
##     main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + 
##     main_sqft:area_total_sqft + garage_sqft:garage_value + garage_sqft:land_sqft, 
##     data = data_cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -210.50  -40.01    0.00   40.52  215.30 
## 
## Coefficients:
##                                           Estimate        Std. Error t value
## (Intercept)                      -33301.0900866070  23988.2074520500  -1.388
## land_value_total                  -2920.3869524207   2799.7509292910  -1.043
## main_sqft                            55.8378979579     66.5652949434   0.839
## garage_sqft                        -765.5590030669     63.1471307680 -12.123
## garage_value                         10.0627898331      0.2704914353  37.202
## land_sqft                         16948.4185780307  16237.9474274157   1.044
## area_total_sqft                     115.2403937887     56.9747827963   2.023
## I(land_value_total^2)                -0.0063510047      0.0468630405  -0.136
## I(main_sqft^2)                        0.0560352900      0.0509159870   1.101
## I(garage_sqft^2)                      0.8821515313      0.0394884740  22.339
## I(garage_value^2)                     0.0000003396      0.0000118530   0.029
## I(land_sqft^2)                        0.2140794403      1.5766688364   0.136
## I(area_total_sqft^2)                 -0.0043050630      0.0191097567  -0.225
## land_value_total:main_sqft            0.8338762350      1.0484737628   0.795
## land_value_total:garage_sqft          1.6438229844      2.9326487547   0.561
## land_value_total:garage_value        -0.0000402454      0.0000142038  -2.833
## land_value_total:area_total_sqft      0.0022064611      0.0002563984   8.606
## main_sqft:garage_sqft                -0.2367018874      0.0330811502  -7.155
## main_sqft:garage_value                0.0038761209      0.0001977114  19.605
## main_sqft:land_sqft                  -4.8538465700      6.0820581652  -0.798
## main_sqft:area_total_sqft            -0.0455712390      0.0698499061  -0.652
## garage_sqft:garage_value             -0.0120579547      0.0014494027  -8.319
## garage_sqft:land_sqft                -9.5285074916     17.0047674499  -0.560
##                                            Pr(>|t|)    
## (Intercept)                                  0.1903    
## land_value_total                             0.3175    
## main_sqft                                    0.4180    
## garage_sqft                      0.0000000431449722 ***
## garage_value                     0.0000000000000914 ***
## land_sqft                                    0.3172    
## area_total_sqft                              0.0660 .  
## I(land_value_total^2)                        0.8944    
## I(main_sqft^2)                               0.2927    
## I(garage_sqft^2)                 0.0000000000381967 ***
## I(garage_value^2)                            0.9776    
## I(land_sqft^2)                               0.8942    
## I(area_total_sqft^2)                         0.8256    
## land_value_total:main_sqft                   0.4419    
## land_value_total:garage_sqft                 0.5854    
## land_value_total:garage_value                0.0151 *  
## land_value_total:area_total_sqft 0.0000017676926881 ***
## main_sqft:garage_sqft            0.0000115592776827 ***
## main_sqft:garage_value           0.0000000001760412 ***
## main_sqft:land_sqft                          0.4403    
## main_sqft:area_total_sqft                    0.5264    
## garage_sqft:garage_value         0.0000025131479113 ***
## garage_sqft:land_sqft                        0.5856    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 164.4 on 12 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.241e+06 on 22 and 12 DF,  p-value: < 0.00000000000000022



4.4 MODEL OPTIMIZATION

To improve the interpretability and robustness of the regression model, we perform step-by-step model refinement based on feature significance (p-values) and multicollinearity diagnostics (GVIF). The goal is to simplify the model while preserving predictive power and respecting hierarchical modeling principles.

4.4.1 STEP 1: INITIAL MULTICOLLINEARITY ASSESSMENT

We begin by assessing the cleaned model using vif() to detect potential multicollinearity issues and guide feature selection.

vif(model_cleaned, type = "predictor")
##                                                                     GVIF Df
## land_value_total            25036649388093432618668060480264664400280864 18
## main_sqft                                                              1 22
## garage_sqft                                            27566856153543144 18
## garage_value        3361717872367879852604402486620006628868448264606868 14
## land_sqft        2983421214152967732042488448860600064662086462046244044  9
## area_total_sqft                 4590214097808697878824662660488828020422  9
##                  GVIF^(1/(2*Df))
## land_value_total       16.051514
## main_sqft               1.000000
## garage_sqft             2.862051
## garage_value           69.220436
## land_sqft            1062.607879
## area_total_sqft       159.747902
##                                                                                           Interacts With
## land_value_total            I(land_value_total^2), main_sqft, garage_sqft, garage_value, area_total_sqft
## main_sqft        I(main_sqft^2), land_value_total, garage_sqft, garage_value, land_sqft, area_total_sqft
## garage_sqft                       I(garage_sqft^2), land_value_total, main_sqft, garage_value, land_sqft
## garage_value                                 I(garage_value^2), land_value_total, main_sqft, garage_sqft
## land_sqft                                                         I(land_sqft^2), main_sqft, garage_sqft
## area_total_sqft                                        I(area_total_sqft^2), land_value_total, main_sqft
##                                                 Other Predictors
## land_value_total                                       land_sqft
## main_sqft                                                   --  
## garage_sqft                                      area_total_sqft
## garage_value                          land_sqft, area_total_sqft
## land_sqft        land_value_total, garage_value, area_total_sqft
## area_total_sqft             garage_sqft, garage_value, land_sqft

Observations:

  • Among the quadratic terms, only I(garage_sqft^2) is statistically significant (p < 0.001), indicating a strong nonlinear effect. Other quadratic terms—I(land_value_total^2), I(main_sqft^2), I(garage_value^2), and I(area_total_sqft^2)—are not statistically significant (p > 0.29), which weakens the justification for their inclusion despite their role in maintaining model hierarchy.
  • Several interaction terms show clear statistical significance, including:
    • land_value_total:garage_value (p < 0.05),
    • land_value_total:area_total_sqft (p < 0.001),
    • main_sqft:garage_sqft, main_sqft:garage_value, and garage_sqft:garage_value (all with p < 0.001)
  • Based on previous multicollinearity diagnostics, high GVIF values remain a concern, particularly for:
    • garage_value: GVIF ≈ 69.2
    • area_total_sqft: GVIF ≈ 159.7
    • land_value_total: GVIF ≈ 16.1
    • land_sqft: GVIF ≈ 1062.6
  • Despite multicollinearity, the first three variables are retained due to their involvement in multiple statistically significant interactions, supporting their inclusion under hierarchical modeling principles.
  • The variable land_sqft, along with I(land_sqft^2) and the associated interactions main_sqft:land_sqft and garage_sqft:land_sqft, should be removed from the model. They exhibit very high multicollinearity, lack statistical significance, and do not contribute meaningfully to the model’s predictive power.

4.4.2 STEP 2: REDUCED MODEL

After removing land_sqft, we re-estimate the model and reassess both statistical significance and collinearity metrics.

model_reduced <- lm(
  market_value_2025 ~ 
    land_value_total + main_sqft + garage_sqft + garage_value + area_total_sqft +
    I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
    I(garage_value^2) +  I(area_total_sqft^2) +
    land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
    land_value_total:area_total_sqft +
    main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:area_total_sqft +
    garage_sqft:garage_value,
  data =  data_cleaned
)

summary(model_reduced)
## 
## Call:
## lm(formula = market_value_2025 ~ land_value_total + main_sqft + 
##     garage_sqft + garage_value + area_total_sqft + I(land_value_total^2) + 
##     I(main_sqft^2) + I(garage_sqft^2) + I(garage_value^2) + I(area_total_sqft^2) + 
##     land_value_total:main_sqft + land_value_total:garage_sqft + 
##     land_value_total:garage_value + land_value_total:area_total_sqft + 
##     main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:area_total_sqft + 
##     garage_sqft:garage_value, data = data_cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -308.84  -61.61    0.00   78.05  241.54 
## 
## Coefficients:
##                                          Estimate       Std. Error t value
## (Intercept)                      -42563.625366632  15508.741695771  -2.744
## land_value_total                      1.806761414      0.216719726   8.337
## main_sqft                            69.761883902     56.808643123   1.228
## garage_sqft                        -792.891203994     52.010135912 -15.245
## garage_value                         10.455571957      0.184198124  56.763
## area_total_sqft                     108.502398874     50.758240565   2.138
## I(land_value_total^2)                 0.000012278      0.000004675   2.627
## I(main_sqft^2)                        0.048351136      0.046676989   1.036
## I(garage_sqft^2)                      0.876504263      0.026691211  32.839
## I(garage_value^2)                    -0.000001470      0.000004972  -0.296
## I(area_total_sqft^2)                 -0.007460778      0.017893347  -0.417
## land_value_total:main_sqft           -0.003076362      0.000242480 -12.687
## land_value_total:garage_sqft          0.000352183      0.000411437   0.856
## land_value_total:garage_value        -0.000034838      0.000004862  -7.165
## land_value_total:area_total_sqft      0.002328307      0.000187760  12.400
## main_sqft:garage_sqft                -0.218463330      0.029943718  -7.296
## main_sqft:garage_value                0.003633922      0.000090667  40.080
## main_sqft:area_total_sqft            -0.036854201      0.064480738  -0.572
## garage_sqft:garage_value             -0.011783771      0.000624665 -18.864
##                                              Pr(>|t|)    
## (Intercept)                                    0.0144 *  
## land_value_total                 0.000000323432301037 ***
## main_sqft                                      0.2372    
## garage_sqft                      0.000000000059915986 ***
## garage_value                     < 0.0000000000000002 ***
## area_total_sqft                                0.0483 *  
## I(land_value_total^2)                          0.0183 *  
## I(main_sqft^2)                                 0.3157    
## I(garage_sqft^2)                 0.000000000000000413 ***
## I(garage_value^2)                              0.7714    
## I(area_total_sqft^2)                           0.6822    
## land_value_total:main_sqft       0.000000000914576672 ***
## land_value_total:garage_sqft                   0.4046    
## land_value_total:garage_value    0.000002246608368307 ***
## land_value_total:area_total_sqft 0.000000001276921815 ***
## main_sqft:garage_sqft            0.000001794065466160 ***
## main_sqft:garage_value           < 0.0000000000000002 ***
## main_sqft:area_total_sqft                      0.5756    
## garage_sqft:garage_value         0.000000000002352542 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 174 on 16 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.355e+06 on 18 and 16 DF,  p-value: < 0.00000000000000022
vif(model_reduced, type = "predictor")
##                              GVIF Df GVIF^(1/(2*Df))
## land_value_total                1 18        1.000000
## main_sqft                       1 18        1.000000
## garage_sqft      1598310432474622 14        3.491306
## garage_value     1598310432474622 14        3.491306
## area_total_sqft      374583515959  9        4.395162
##                                                                                Interacts With
## land_value_total I(land_value_total^2), main_sqft, garage_sqft, garage_value, area_total_sqft
## main_sqft        I(main_sqft^2), land_value_total, garage_sqft, garage_value, area_total_sqft
## garage_sqft                       I(garage_sqft^2), land_value_total, main_sqft, garage_value
## garage_value                      I(garage_value^2), land_value_total, main_sqft, garage_sqft
## area_total_sqft                             I(area_total_sqft^2), land_value_total, main_sqft
##                           Other Predictors
## land_value_total                      --  
## main_sqft                             --  
## garage_sqft                area_total_sqft
## garage_value               area_total_sqft
## area_total_sqft  garage_sqft, garage_value

Observations:

  • Among the quadratic terms, only I(garage_sqft^2) (p < 0.001) and I(land_value_total^2) (p = 0.0183) are statistically significant, indicating meaningful nonlinear effects. In contrast, I(main_sqft^2) is not statistically significant (p > 0.31), but is retained to preserve model hierarchy.
  • Several interaction terms are clearly significant, including:
    • land_value_total:main_sqft, land_value_total:garage_value, and land_value_total:area_total_sqft (all with p < 0.001),
    • main_sqft:garage_sqft, main_sqft:garage_value, and garage_sqft:garage_value (all with p < 0.001)
  • Based on GVIF diagnostics, no severe multicollinearity is detected:
    • land_value_total and main_sqft: GVIF = 1.0 (ideal)
    • garage_sqft and garage_value: GVIF ≈ 3.49
    • area_total_sqft: GVIF ≈ 4.40
  • These GVIF values are within acceptable ranges (GVIF^(1/2Df) < 5), especially in models with interactions and polynomial terms.
  • The terms I(garage_value^2), I(area_total_sqft^2), and land_value_total:garage_sqft should be removed from the model due to lack of statistical significance and contribution. The resulting model remains robust, interpretable, and free from multicollinearity concerns.

4.4.3 STEP 3: FINAL MODEL ESTIMATION

We now fit the final model, keeping only the most relevant predictors and interactions while ensuring that model hierarchy is respected and multicollinearity is minimized.

model_final <- lm(
  market_value_2025 ~ 
    land_value_total + main_sqft + garage_sqft + garage_value + area_total_sqft +
    I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
    land_value_total:main_sqft + land_value_total:garage_value +
    land_value_total:area_total_sqft +
    main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:area_total_sqft +
    garage_sqft:garage_value,
  data =  data_cleaned
)

summary(model_final)
## 
## Call:
## lm(formula = market_value_2025 ~ land_value_total + main_sqft + 
##     garage_sqft + garage_value + area_total_sqft + I(land_value_total^2) + 
##     I(main_sqft^2) + I(garage_sqft^2) + land_value_total:main_sqft + 
##     land_value_total:garage_value + land_value_total:area_total_sqft + 
##     main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:area_total_sqft + 
##     garage_sqft:garage_value, data = data_cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -305.23  -65.61   -9.06   95.84  201.65 
## 
## Coefficients:
##                                          Estimate       Std. Error  t value
## (Intercept)                      -39523.460319117  13845.603658233   -2.855
## land_value_total                      1.828339001      0.201110709    9.091
## main_sqft                            36.917779982     18.941632266    1.949
## garage_sqft                        -826.150091785     14.535967614  -56.835
## garage_value                         10.535229332      0.152653509   69.014
## area_total_sqft                     137.691125245     10.246939174   13.437
## I(land_value_total^2)                 0.000008312      0.000003081    2.698
## I(main_sqft^2)                        0.070732169      0.004803734   14.724
## I(garage_sqft^2)                      0.871028622      0.005609366  155.281
## land_value_total:main_sqft           -0.003050124      0.000194208  -15.705
## land_value_total:garage_value        -0.000034740      0.000001575  -22.051
## land_value_total:area_total_sqft      0.002489153      0.000082333   30.233
## main_sqft:garage_sqft                -0.196230925      0.005212665  -37.645
## main_sqft:garage_value                0.003586134      0.000071119   50.425
## main_sqft:area_total_sqft            -0.067937221      0.004113243  -16.517
## garage_sqft:garage_value             -0.011921708      0.000071596 -166.514
##                                              Pr(>|t|)    
## (Intercept)                                    0.0101 *  
## land_value_total                  0.00000002386595202 ***
## main_sqft                                      0.0662 .  
## garage_sqft                      < 0.0000000000000002 ***
## garage_value                     < 0.0000000000000002 ***
## area_total_sqft                   0.00000000003748596 ***
## I(land_value_total^2)                          0.0143 *  
## I(main_sqft^2)                    0.00000000000762333 ***
## I(garage_sqft^2)                 < 0.0000000000000002 ***
## land_value_total:main_sqft        0.00000000000244574 ***
## land_value_total:garage_value     0.00000000000000536 ***
## land_value_total:area_total_sqft < 0.0000000000000002 ***
## main_sqft:garage_sqft            < 0.0000000000000002 ***
## main_sqft:garage_value           < 0.0000000000000002 ***
## main_sqft:area_total_sqft         0.00000000000099970 ***
## garage_sqft:garage_value         < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 169.2 on 19 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.72e+06 on 15 and 19 DF,  p-value: < 0.00000000000000022
vif(model_final, type = "predictor")
##                          GVIF Df GVIF^(1/(2*Df))
## land_value_total     33474318 11        2.198024
## main_sqft                   1 15        1.000000
## garage_sqft      398369889288  8        5.309063
## garage_value       3020696856 12        2.483159
## area_total_sqft      94187384  8        3.150464
##                                                                                Interacts With
## land_value_total              I(land_value_total^2), main_sqft, garage_value, area_total_sqft
## main_sqft        I(main_sqft^2), land_value_total, garage_sqft, garage_value, area_total_sqft
## garage_sqft                                         I(garage_sqft^2), main_sqft, garage_value
## garage_value                                         land_value_total, main_sqft, garage_sqft
## area_total_sqft                                                   land_value_total, main_sqft
##                                   Other Predictors
## land_value_total                       garage_sqft
## main_sqft                                     --  
## garage_sqft      land_value_total, area_total_sqft
## garage_value                       area_total_sqft
## area_total_sqft          garage_sqft, garage_value

Observations:

  • Among the quadratic terms, all three—I(land_value_total^2), I(main_sqft^2), and I(garage_sqft^2)—are statistically significant (p < 0.05), supporting the presence of nonlinear effects and justifying the inclusion of their corresponding linear terms to maintain model hierarchy.
  • Several interaction terms are highly significant, particularly:
    • land_value_total:main_sqft, land_value_total:garage_value, and land_value_total:area_total_sqft (all with p < 0.001),
    • main_sqft:garage_sqft, main_sqft:garage_value, main_sqft:area_total_sqft, and garage_sqft:garage_value (all with p < 0.001)
  • Based on GVIF diagnostics, no problematic multicollinearity is present:
    • main_sqft: GVIF = 1.0
    • land_value_total: GVIF^(1/2Df) ≈ 2.20
    • garage_value: GVIF^(1/2Df) ≈ 2.48
    • area_total_sqft: GVIF^(1/2Df) ≈ 3.15
    • garage_sqft: GVIF^(1/2Df) ≈ 5.31
  • Although garage_sqft reaches a moderate GVIF, it is involved in multiple highly significant interactions and a strong nonlinear effect (I(garage_sqft^2)), supporting its retention.
  • The final model shows a good fit (Adjusted R² = 1), strong predictive terms, and acceptable collinearity levels. All retained variables are either statistically significant or justified through hierarchical structure and interactions. No further reduction is recommended.



5 PREDICTION FOR TARGET PROPERTY

With the final regression model established and validated, we now use it to generate a predicted 2025 market value for the target property located at 6321 88th Street. This estimate is derived from the same regression equation used to model the neighborhood, applying the physical and structural features specific to the target property.

To assess the reliability and precision of this prediction, both a 95% confidence interval and a 95% prediction interval are computed:

The confidence interval provides a range for the expected mean market value of homes with similar characteristics.

The prediction interval offers a wider range that captures the likely value of the individual property at 6321 88th Street, accounting for both model uncertainty and natural variability.

test_data <- data_6321

# Confidence and prediction intervals for 6321
ci_6321 <- predict(model_final, newdata = test_data, interval = "confidence", level = 0.95)
pi_6321 <- predict(model_final, newdata = test_data, interval = "prediction", level = 0.95)

# Display results
cat("Prediction and Interval Estimates for 6321 88th Street:\n",
    "-> Predicted Market Value (2025): $", round(ci_6321[1], 2), "\n",
    "-> Confidence Interval (95%): $", round(ci_6321[2], 2), "– $", round(ci_6321[3], 2), "\n",
    "-> Prediction Interval (95%): $", round(pi_6321[2], 2), "– $", round(pi_6321[3], 2), "\n"
)
## Prediction and Interval Estimates for 6321 88th Street:
##  -> Predicted Market Value (2025): $ 538535.6 
##  -> Confidence Interval (95%): $ 538362.1 – $ 538709 
##  -> Prediction Interval (95%): $ 538141.3 – $ 538929.8

Observations

  • The predicted 2025 market value for the property at 6321 88th Street is approximately \(\text{\$ 538,535.60}\), based on the final regression model.
  • The 95% confidence interval, which ranges from \(\text{\$ 538,362.10}\) to \(\text{\$ 538,709.00}\), represents the range in which we expect the average market value of all properties with similar characteristics to fall.
  • The 95% prediction interval, ranging from \(\text{\$ 538,141.30}\) to \(\text{\$ 538,929.80}\), is wider and reflects the range in which we expect the individual market value of this specific property to fall.
  • The actual assessed value of the property, \(\text{\$ 538,409}\), falls within both the confidence and prediction intervals. This suggests that:
    • The assigned value is statistically consistent with the model’s estimate.
    • There is no strong evidence to claim the property has been over- or under-valued.



6 CONCLUSIONS

Data Collection and Preparation

  • The dataset included 45 residential properties, narrowed to 40 valid records after filtering and cleaning.
  • Redundant and derived variables were excluded to maintain independence among predictors and prevent overfitting.
  • Only numeric and structurally meaningful variables were retained to support regression modeling.

Exploratory Data Analysis (EDA)

  • Descriptive statistics and boxplots revealed skewed distributions and outliers in variables like market_value_2025 and garage_sqft.
  • Strong correlations with the target variable were observed for main_sqft, land_value_total, and garage_value.
  • Variables directly embedded in the response (e.g., main_value, improvement_value_total) were excluded to avoid multicollinearity and model leakage.

Model Building

  • An initial model with main, interaction, and quadratic terms showed high explanatory power (R² ≈ 0.9999) but included several statistically insignificant or redundant terms.
  • Diagnostics indicated possible distortion from influential observations (e.g., Data Points 1, 3, 11, 37), requiring iterative refinement.

Model Diagnostics and Re-estimation

  • Removing the most influential observations led to a notable reduction in residual standard error (from 1901 to 160.5), while maintaining Adjusted R² = 1.00.
  • Garage-related variables consistently emerged as the strongest predictors of market value.
  • Sensitivity analysis identified Data Point 9 as an additional influencer, which was also excluded in the final model.

Model Optimization

  • The model was optimized using p-value thresholds and GVIF multicollinearity diagnostics.
  • Irrelevant and collinear terms (e.g., land_sqft, certain interactions) were removed.
  • The final model retained only significant main effects, quadratic terms, and interactions, ensuring interpretability and structural validity.
  • However, it is important to note that the final model’s R² is exactly 1, which, while indicative of excellent fit, also raises a potential red flag for overfitting. Future analyses should consider:
    • Evaluating the model on new or holdout data.
    • Reducing model complexity if performance degrades out-of-sample.

Prediction for Target Property (6321 88th Street)

  • The predicted 2025 market value is \(\text{\$ 538,535.60}\), with:
    • 95% Confidence Interval: \(\text{\$ 538,362.10}\) - \(\text{\$ 538,709.00}\)
    • 95% Prediction Interval: \(\text{\$ 538,141.30}\) - \(\text{\$ 538,929.80}\)
  • The actual assessed value (\(\text{\$ 538,409}\)) falls within both intervals, indicating that it is statistically reasonable given the property’s characteristics and the neighborhood model.

The model supports the conclusion that the assessed value of 6321 88th Street is fair and consistent with the valuation of similar nearby properties. The final model is statistically robust, yet the perfect R² calls for further validation to rule out overfitting. At present, there is no strong evidence to suggest that the property has been misvalued



7 COMPLETE R-CODE

# Libraries
library(dplyr)
library(tidyr)
library(MASS)
library(psych)
library(ggcorrplot)
library(ggplot2)
library(car)

# SECTION 1: EXPLORATORY DATA ANALYSIS (EDA) ----------------------------------------------------------
# Create the dataframe
# Raw GitHub URL
url <- "https://raw.githubusercontent.com/JairoRodriguezB/Datasets/refs/heads/main/Property_Assessment_88thStreet_2025.csv"
# Read the CSV file
data <- read.csv(url)
head(data)

# INITIAL DATA ANALYSIS
# Rename columns to facilitate analysis
colnames(data) <- c(
  "property_id",
  "address",
  "property_type",
  "assessed_value_2025",
  "market_value_2025",
  "improvement_value_total",
  "land_value_total",
  "homestead_cap_loss",
  "area_total_sqft",
  "main_sqft",
  "main_value",
  "garage_sqft",
  "garage_value",
  "pool_sqft",
  "pool_value",
  "land_sqft"
)
# Convert selected columns from character to integer, replacing "N/A" with NA
data <- data %>%
  mutate(
    improvement_value_total = as.integer(na_if(as.character(improvement_value_total), "N/A")),
    land_value_total = as.integer(na_if(as.character(land_value_total), "N/A")),
    homestead_cap_loss = as.integer(na_if(as.character(homestead_cap_loss), "N/A")),
    area_total_sqft = as.integer(na_if(as.character(area_total_sqft), "N/A")),
    main_sqft = as.integer(na_if(as.character(main_sqft), "N/A")),
    main_value = as.integer(na_if(as.character(main_value), "N/A")),
    garage_sqft = as.integer(na_if(as.character(garage_sqft), "N/A")),
    garage_value = as.integer(na_if(as.character(garage_value), "N/A")),
    pool_sqft = as.integer(na_if(as.character(pool_sqft), "N/A")),
    pool_value = as.integer(na_if(as.character(pool_value), "N/A")),
    land_sqft = as.integer(na_if(as.character(land_sqft), "N/A"))
  )

# Initial Data Inspection
# Summary statistics for numerical variables
summary(data)  
# Overview of the dataset structure
glimpse(data)   
# Count the number of missing values (NA) in each column
colSums(is.na(data))

# DATA FILTERING AND PREPARATION
unique(data$property_type)
data_real <- data %>%
  filter(property_type == "Real")
data_real <- data_real %>%
  mutate(street_number = substr(address, 1, 4)) %>%
  relocate(street_number, .after = property_id)
glimpse(data_real)

# Data Cleaning – Removal of Irrelevant or Redundant Variables
# List of columns to retain for analysis
cols_keep <- c(
  "property_id",
  "street_number",
  "market_value_2025",
  "improvement_value_total",
  "land_value_total",
  "area_total_sqft",
  "main_sqft",
  "main_value",
  "garage_sqft",
  "garage_value",
  "land_sqft"
)
# Filter the dataframe to keep only the selected columns
data_real <- data_real[, names(data_real) %in% cols_keep]
# Count the number of missing values (NA) in each column
colSums(is.na(data_real))
# Remove rows where key variables are NA
data_real <- data_real %>%
  filter(!is.na(main_value) & !is.na(garage_value))
# Verify missing values
colSums(is.na(data_real))

# DISTRIBUTION AND RELATIONSHIP ANALYSIS
# DISTRIBUTION ANALYSIS
# Boxplot Visualization
options(scipen = 999)
numeric_vars <- data_real[, sapply(data_real, is.numeric)]
par(mfrow = c(3, 3), mar = c(4, 4, 2, 1),
    cex.main = 1.2, cex.axis = 0.9)

for (var in names(numeric_vars)) {
  boxplot(numeric_vars[[var]],
          main = var,
          col = "lightblue",
          horizontal = TRUE)
}

# RELATIONSHIP ANALYSIS
# Correlation Matrix Visualization
numeric_vars <- data_real[, sapply(data_real, is.numeric)]
cor_matrix <- cor(numeric_vars)
ggcorrplot(cor_matrix, 
           lab = TRUE,                          # Mostrar coeficientes
           colors = c("red", "white", "#4A90E2"),  # Rango de colores
           outline.color = "black",             # Borde de cada celda
           show.legend = TRUE,                  # Mostrar leyenda
           title = "Correlation Matrix of Numeric Variables",
)
# -----------------------------------------------------------------------------------------------------


# SECTION 2: MODEL BUILDING ---------------------------------------------------------------------------
# Extract the property with street_number 6321 and store it in data_6321
data_6321 <- data_real[data_real$street_number == 6321, ]

# Remove that row from the training dataset
data_real <- data_real[data_real$street_number != 6321, ]
rownames(data_real) <- NULL

model_full <- lm(
  market_value_2025 ~ 
    (land_value_total + main_sqft +
     garage_sqft + garage_value + land_sqft + area_total_sqft)^2 +
    I(land_value_total^2) + I(main_sqft^2) +
    I(garage_sqft^2) + I(garage_value^2) + I(land_sqft^2) +
    I(area_total_sqft^2),
  data = data_real
)
summary(model_full)

# Initial Predictor Evaluation
model2<- lm(
  market_value_2025 ~ 
    land_value_total + main_sqft + garage_sqft + garage_value + land_sqft + area_total_sqft +
    I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
    I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
    land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
    land_value_total:area_total_sqft +
    main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + main_sqft:area_total_sqft +
    garage_sqft:garage_value + garage_sqft:land_sqft,
  data = data_real
)

summary(model2)


# OUTLIER AND INFLUENCE DIAGNOSTICS
# DIAGNOSTICS IMPLEMENTATION
# Creating combined diagnostics dataframe
diagnostics_df <- data.frame(
  Obs = 1:nrow(data_real),
  Residual = rstudent(model2),
  Leverage = hatvalues(model2),
  CooksD = cooks.distance(model2)
)

# Thresholds for leverage and Cook's Distance
n <- nrow(data_real)
p <- length(coef(model2))
leverage_thresh <- 2 * p / n
cook_thresh <- 4 / n

# Flagging high influence points
diagnostics_df$High_Residual <- abs(diagnostics_df$Residual) > 2
diagnostics_df$High_Leverage <- diagnostics_df$Leverage > leverage_thresh
diagnostics_df$High_CooksD <- diagnostics_df$CooksD > cook_thresh

# Sorting by Cook's Distance
diagnostics_df <- diagnostics_df[order(-diagnostics_df$CooksD), ]

# Display top influential points
rownames(diagnostics_df) <- NULL
rmarkdown::paged_table(diagnostics_df)

plot(model2,5)
plot(model2,6)

# RE-ESTIMATE THE MODEL WITHOUT INFLUENTIAL DATA POINTS
# MODEL RE-ESTIMATION
# Remove influential observations
data_no_points <- data_real[-c(1, 11, 3, 37), ]

# Fit the new model
model_no_points <- lm(
  market_value_2025 ~ 
    land_value_total + main_sqft + garage_sqft + garage_value + land_sqft + area_total_sqft +
    I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
    I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
    land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
    land_value_total:area_total_sqft +
    main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + main_sqft:area_total_sqft +
    garage_sqft:garage_value + garage_sqft:land_sqft,
  data = data_no_points)

summary(model_no_points)

# TESTING ADDITIONAL OBSERVATIONS
# List of observations to evaluate
outliers_to_test <- c(9, 32, 39)

for (obs in outliers_to_test) {
  cat("\n================ Data Point Removed:", obs, "================\n")
  
   # Remove observation
  temp_data <- data_no_points[-obs, ]
  
  # Refit the model
  temp_model <- lm(
  market_value_2025 ~ 
    land_value_total + main_sqft + garage_sqft + garage_value + land_sqft + area_total_sqft +
    I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
    I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
    land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
    land_value_total:area_total_sqft +
    main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + main_sqft:area_total_sqft +
    garage_sqft:garage_value + garage_sqft:land_sqft,
  data = temp_data)
  
  # Model summary
  s <- summary(temp_model)
  f <- s$fstatistic
  f_pval <- pf(f[1], f[2], f[3], lower.tail = FALSE)

  # Display global metrics
  cat("Residual standard error:", format(s$sigma, scientific = TRUE), "on", s$df[2], "degrees of freedom\n")
  cat("Multiple R-squared:", round(s$r.squared, 4), "\tAdjusted R-squared:", round(s$adj.r.squared, 4), "\n")
  cat("F-statistic:", format(f[1], scientific = TRUE), "on", f[2], "and", f[3], "DF, p-value:", format.pval(f_pval, digits = 4, eps = .0001), "\n\n")
  
  # Display rounded p-values
  pvals <- round(s$coefficients[, "Pr(>|t|)"], 4)
  for (i in seq_along(pvals)) {
    cat(names(pvals)[i], ":", pvals[i], "\n")
  }
}

# FINAL CLEANED MODEL

# Remove influential observations
data_cleaned <- data_real[-c(1, 11, 3, 37, 9), ]

# Refit the regression model using the cleaned dataset
model_cleaned <- lm(
  market_value_2025 ~ 
    land_value_total + main_sqft + garage_sqft + garage_value + land_sqft + area_total_sqft +
    I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
    I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
    land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
    land_value_total:area_total_sqft +
    main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + main_sqft:area_total_sqft +
    garage_sqft:garage_value + garage_sqft:land_sqft,
  data =  data_cleaned
)

# Display key model results
summary(model_cleaned)

# MODEL OPTIMIZATION
# STEP 1: INITIAL MULTICOLLINEARITY ASSESSMENT

vif(model_cleaned, type = "predictor")

# STEP 2: REDUCED MODEL
model_reduced <- lm(
  market_value_2025 ~ 
    land_value_total + main_sqft + garage_sqft + garage_value + area_total_sqft +
    I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
    I(garage_value^2) +  I(area_total_sqft^2) +
    land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
    land_value_total:area_total_sqft +
    main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:area_total_sqft +
    garage_sqft:garage_value,
  data =  data_cleaned
)
summary(model_reduced)
vif(model_reduced, type = "predictor")

# STEP 3: FINAL MODEL ESTIMATION
model_final <- lm(
  market_value_2025 ~ 
    land_value_total + main_sqft + garage_sqft + garage_value + area_total_sqft +
    I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
    land_value_total:main_sqft + land_value_total:garage_value +
    land_value_total:area_total_sqft +
    main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:area_total_sqft +
    garage_sqft:garage_value,
  data =  data_cleaned
)
summary(model_final)
vif(model_final, type = "predictor")
# -----------------------------------------------------------------------------------------------------


# SECTION 3: PREDICTION FOR TARGET PROPERTY -----------------------------------------------------------
test_data <- data_6321
# Confidence and prediction intervals for 6321
ci_6321 <- predict(model_final, newdata = test_data, interval = "confidence", level = 0.95)
pi_6321 <- predict(model_final, newdata = test_data, interval = "prediction", level = 0.95)

# Display results
cat("Prediction and Interval Estimates for 6321 88th Street:\n",
    "-> Predicted Market Value (2025): $", round(ci_6321[1], 2), "\n",
    "-> Confidence Interval (95%): $", round(ci_6321[2], 2), "– $", round(ci_6321[3], 2), "\n",
    "-> Prediction Interval (95%): $", round(pi_6321[2], 2), "– $", round(pi_6321[3], 2), "\n"
)
# -----------------------------------------------------------------------------------------------------