In property taxation, accurate market value assessment is essential to ensure that homeowners contribute fairly to municipal revenue without being overcharged. However, in practice, the valuation process is often based on the discretion of the county tax assessor rather than strict statistical procedures. This raises the possibility of discrepancies between a property’s true market value and its assessed value.
This study focuses on the residential property located at 6321 88th Street, Lubbock, Texas, and aims to determine whether its 2025 assessed market value is statistically justified when compared to similar homes on the same street (address range 6309–6351). Given the availability of detailed property data—including land size, heated area, garage size, and their respective values—multiple regression modeling provides a suitable framework to assess this fairness.
This report applies Exploratory Data Analysis (EDA) and Multiple Linear Regression Modeling to estimate property market values based on physical and structural characteristics. The analysis is structured as follows:
Data Collection and Initial Preparation: Manual extraction of property data from the Lubbock Central Appraisal District website, followed by standardization and cleaning of variables.
Exploratory Data Analysis (EDA): Statistical and visual examination of distributions, detection of missing values, and correlation analysis to understand variable relationships and potential model risks (e.g., multicollinearity).
Model Construction and Diagnostics: Fitting an initial full regression model with main effects, interaction, and quadratic terms, followed by checks for significance, residual behavior, and influential points.
Model Refinement and Optimization: Stepwise reduction of predictors based on p-values and multicollinearity (GVIF) to improve interpretability while maintaining predictive accuracy.
Prediction and Interval Estimation: Using the final model to predict the market value of 6321 88th Street, with 95% confidence and prediction intervals to assess valuation reliability.
The dataset was manually collected from the official website of the Lubbock Central Appraisal District at https://lubbockcad.org. The goal was to gather appraisal information on properties located along 88th Street in Lubbock, Texas, specifically within the street number range 6309 to 6351, which includes the target property: 6321 88th Street.
The data was obtained using the Advanced Search tool. By entering the street name (“88th St”), selecting the 2025 tax year
A list of relevant residential properties was generated. Each property record was accessed individually, and relevant details were manually recorded. The resulting dataset includes appraisal information for each property, with the following variables:
Property ID: Unique property identifier.
Situs Address: Property’s physical location.
Property Type: Classification (e.g., Residential Single Family).
2025 Assessed Value: Taxable value used to calculate property taxes.
2025 Market Value: Total market value (land + improvements).
Total Improvement Market Value: Appraised value of all structures.
Total Land Market Value: Appraised value of the land portion only.
Homestead Cap Loss: Tax discount based on the Texas 10% cap rule.
Total Main Area (Sq. Ft.): Enclosed area (heated + unheated).
Main Area (Sq. Ft.): Heated, livable square footage.
Main Area (Value): Value attributed to the livable area.
Garage (Sq. Ft.): Unheated/non-livable space (e.g., garage).
Garage (Value): Value of the garage or similar structures.
Pool (Sq. Ft.): Surface area of pool (if applicable).
Pool (Value): Appraised value of the pool.
Land (Sq. Ft.): Lot size in square feet.
All collected data was organized and stored in a CSV file, which was uploaded to a public GitHub repository. The dataset can be accessed directly via the following link:
This section provides an exploration of the dataset to prepare it for regression modeling. The process begins with importing and inspecting the raw data, followed by renaming variables, converting data types, and handling missing values. The dataset is then filtered to retain only relevant residential properties and core numerical predictors.
Subsequent analyses focus on understanding the distribution of each variable through visual tools like boxplots and evaluating the relationships between features using a correlation matrix. These exploratory steps are essential for identifying patterns, detecting outliers, and diagnosing potential multicollinearity, informing later decisions on model construction and variable selection.
In this section, the dataset is imported directly from a public GitHub repository and prepared for analysis. Variable names are standardized for clarity, relevant columns are converted to numeric types, and missing values are addressed. Summary statistics and data structure are then reviewed to gain an initial understanding of the properties’ characteristics and to identify any potential data quality issues before proceeding with deeper analysis.
DATA LOADING
# Create the dataframe
# Raw GitHub URL
url <- "https://raw.githubusercontent.com/JairoRodriguezB/Datasets/refs/heads/main/Property_Assessment_88thStreet_2025.csv"
# Read the CSV file
data <- read.csv(url)
rmarkdown::paged_table(data)
INITIAL DATA ANALYSIS
To prepare the dataset for analysis, variable names were standardized for clarity and consistency, and key columns were converted from character strings to integers. Missing values originally labeled as “N/A” were appropriately recoded as NA to ensure proper handling in subsequent statistical procedures.
Variable Renaming and Type Conversion
# Rename columns to facilitate analysis
colnames(data) <- c(
"property_id",
"address",
"property_type",
"assessed_value_2025",
"market_value_2025",
"improvement_value_total",
"land_value_total",
"homestead_cap_loss",
"area_total_sqft",
"main_sqft",
"main_value",
"garage_sqft",
"garage_value",
"pool_sqft",
"pool_value",
"land_sqft"
)
# Convert selected columns from character to integer, replacing "N/A" with NA
data <- data %>%
mutate(
improvement_value_total = as.integer(na_if(as.character(improvement_value_total), "N/A")),
land_value_total = as.integer(na_if(as.character(land_value_total), "N/A")),
homestead_cap_loss = as.integer(na_if(as.character(homestead_cap_loss), "N/A")),
area_total_sqft = as.integer(na_if(as.character(area_total_sqft), "N/A")),
main_sqft = as.integer(na_if(as.character(main_sqft), "N/A")),
main_value = as.integer(na_if(as.character(main_value), "N/A")),
garage_sqft = as.integer(na_if(as.character(garage_sqft), "N/A")),
garage_value = as.integer(na_if(as.character(garage_value), "N/A")),
pool_sqft = as.integer(na_if(as.character(pool_sqft), "N/A")),
pool_value = as.integer(na_if(as.character(pool_value), "N/A")),
land_sqft = as.integer(na_if(as.character(land_sqft), "N/A"))
)
Initial Data Inspection
# Summary statistics for numerical variables
summary(data)
## property_id address property_type assessed_value_2025
## Length:45 Length:45 Length:45 Min. : 3000
## Class :character Class :character Class :character 1st Qu.: 498357
## Mode :character Mode :character Mode :character Median : 532125
## Mean : 533709
## 3rd Qu.: 580180
## Max. :1218146
##
## market_value_2025 improvement_value_total land_value_total homestead_cap_loss
## Min. : 3000 Min. : 373092 Min. : 43506 Min. :0
## 1st Qu.: 498357 1st Qu.: 460406 1st Qu.: 44986 1st Qu.:0
## Median : 532125 Median : 490236 Median : 45658 Median :0
## Mean : 533709 Mean : 519710 Mean : 50480 Mean :0
## 3rd Qu.: 580180 3rd Qu.: 535879 3rd Qu.: 47379 3rd Qu.:0
## Max. :1218146 Max. :1116617 Max. :101529 Max. :0
## NA's :3 NA's :3 NA's :3
## area_total_sqft main_sqft main_value garage_sqft
## Min. :2747 Min. :2241 Min. :338833 Min. : 479.0
## 1st Qu.:3160 1st Qu.:2614 1st Qu.:419843 1st Qu.: 528.0
## Median :3362 Median :2763 Median :444984 Median : 551.0
## Mean :3637 Mean :2857 Mean :461270 Mean : 695.0
## 3rd Qu.:3913 3rd Qu.:2953 3rd Qu.:468781 3rd Qu.: 917.8
## Max. :7117 Max. :4582 Max. :899451 Max. :1119.0
## NA's :3 NA's :3 NA's :4 NA's :3
## garage_value pool_sqft pool_value land_sqft
## Min. :32308 Min. : 0.00 Min. : 0 Min. : 7501
## 1st Qu.:38290 1st Qu.: 0.00 1st Qu.: 0 1st Qu.: 7756
## Median :40721 Median : 0.00 Median : 0 Median : 7872
## Mean :50217 Mean : 85.69 Mean : 6039 Mean : 8695
## 3rd Qu.:66658 3rd Qu.: 0.00 3rd Qu.: 0 3rd Qu.: 8169
## Max. :96763 Max. :1468.00 Max. :120403 Max. :17505
## NA's :4 NA's :3 NA's :3 NA's :3
# Overview of the dataset structure
glimpse(data)
## Rows: 45
## Columns: 16
## $ property_id <chr> "R322646", "R322647", "R322648", "R322649", "R…
## $ address <chr> "6310 88TH ST, LUBBOCK, TX 79424", "6312 88TH…
## $ property_type <chr> "Real", "Real", "Real", "Real", "Real", "Real"…
## $ assessed_value_2025 <int> 663907, 602427, 569992, 735026, 514098, 434345…
## $ market_value_2025 <int> 663907, 602427, 569992, 735026, 514098, 434345…
## $ improvement_value_total <int> 603222, 538677, 511992, 677026, 465906, 390839…
## $ land_value_total <int> 60685, 63750, 58000, 58000, 48192, 43506, 4565…
## $ homestead_cap_loss <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ area_total_sqft <int> 4304, 4186, 4001, 4525, 3121, 2901, 3286, 3923…
## $ main_sqft <int> 3226, 3277, 3036, 3462, 2593, 2417, 2758, 2927…
## $ main_value <int> NA, 477232, 447924, 593404, 426798, 358531, 44…
## $ garage_sqft <int> 1078, 909, 965, 1063, 528, 484, 528, 996, 528,…
## $ garage_value <int> NA, 61445, 64068, 83622, 39108, 32308, 38545, …
## $ pool_sqft <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pool_value <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ land_sqft <int> 10463, 10625, 10000, 10000, 8309, 7501, 7872, …
# Count the number of missing values (NA) in each column
colSums(is.na(data))
## property_id address property_type
## 0 0 0
## assessed_value_2025 market_value_2025 improvement_value_total
## 0 0 3
## land_value_total homestead_cap_loss area_total_sqft
## 3 3 3
## main_sqft main_value garage_sqft
## 3 4 3
## garage_value pool_sqft pool_value
## 4 3 3
## land_sqft
## 3
Observations
The dataset contains 45 residential properties on 88th Street, each described by 16 variables including identifiers, assessed and market values, structural dimensions, and improvement components.
Property ID, address, and property type are character variables used for reference and classification; the rest are numeric and suitable for regression modeling.
Key variables show the following central tendencies:
Assessed and Market Values (2025): Median values are approximately $532,125 and $533,709 respectively, with a maximum property value of over $1.2 million.
Improvement and Land Values: Mean improvement value is $519,710; land value averages around $50,480.
Main and Garage Areas: Heated living space (main_sqft) ranges from 2,241 to 4,582 sq. ft., while garage space varies between 479 and 1,119 sq. ft.
Missing values are present in several fields:
Main Area Value and Garage Value have 4 missing entries each.
Other variables like improvement_value_total, land_value_total, area_total_sqft, and land_sqft have 3 missing values.
Pool data (sq. ft. and value) are zero or NA for most properties, indicating that pools are not a common feature in this sample
In this section, we refine the dataset by selecting only relevant property types and variables necessary for the regression analysis. Specifically, we filter out non-residential property types, extract useful identifiers, and retain only numeric variables that contribute to the predictive modeling of property values. Basic data cleaning is also performed to remove rows with missing values in key features.
Data Filtering
To ensure accuracy and relevance, only properties classified as “Real”—representing land and buildings—were retained for analysis. “Personal” properties, such as movable assets, were excluded. Additionally, street numbers were extracted to support identification.
unique(data$property_type)
## [1] "Real" "Personal"
data_real <- data %>%
filter(property_type == "Real")
data_real <- data_real %>%
mutate(street_number = substr(address, 1, 4)) %>%
relocate(street_number, .after = property_id)
glimpse(data_real)
## Rows: 42
## Columns: 17
## $ property_id <chr> "R322646", "R322647", "R322648", "R322649", "R…
## $ street_number <chr> "6310", "6312", "6311", "6309", "6349", "6347"…
## $ address <chr> "6310 88TH ST, LUBBOCK, TX 79424", "6312 88TH…
## $ property_type <chr> "Real", "Real", "Real", "Real", "Real", "Real"…
## $ assessed_value_2025 <int> 663907, 602427, 569992, 735026, 514098, 434345…
## $ market_value_2025 <int> 663907, 602427, 569992, 735026, 514098, 434345…
## $ improvement_value_total <int> 603222, 538677, 511992, 677026, 465906, 390839…
## $ land_value_total <int> 60685, 63750, 58000, 58000, 48192, 43506, 4565…
## $ homestead_cap_loss <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ area_total_sqft <int> 4304, 4186, 4001, 4525, 3121, 2901, 3286, 3923…
## $ main_sqft <int> 3226, 3277, 3036, 3462, 2593, 2417, 2758, 2927…
## $ main_value <int> NA, 477232, 447924, 593404, 426798, 358531, 44…
## $ garage_sqft <int> 1078, 909, 965, 1063, 528, 484, 528, 996, 528,…
## $ garage_value <int> NA, 61445, 64068, 83622, 39108, 32308, 38545, …
## $ pool_sqft <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pool_value <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ land_sqft <int> 10463, 10625, 10000, 10000, 8309, 7501, 7872, …
Data Cleaning – Removal of Irrelevant or Redundant Variables
To streamline the dataset and retain only variables relevant for the implementation of the regression model, the following columns were removed:
address
: This column is a textual
identifier used solely for property location. It provides no predictive
information and is not suitable for regression modeling.property_type
: All observations are of
type “Real”, offering no variability and therefore no contribution to
explaining market value.assessed_value_2025
: This variable is
redundant with the response variable
market_value_2025
.homestead_cap_loss
: All values in this
column are zero, indicating no variability or explanatory power.pool_sqft
and
pool_value
: These were excluded due to the
limited number of homes with pools in the dataset, which results in
minimal variability and low predictive relevance.# List of columns to retain for analysis
cols_keep <- c(
"property_id",
"street_number",
"market_value_2025",
"improvement_value_total",
"land_value_total",
"area_total_sqft",
"main_sqft",
"main_value",
"garage_sqft",
"garage_value",
"land_sqft"
)
# Filter the dataframe to keep only the selected columns
data_real <- data_real[, names(data_real) %in% cols_keep]
# Count the number of missing values (NA) in each column
colSums(is.na(data_real))
## property_id street_number market_value_2025
## 0 0 0
## improvement_value_total land_value_total area_total_sqft
## 0 0 0
## main_sqft main_value garage_sqft
## 0 1 0
## garage_value land_sqft
## 1 0
# Remove rows where key variables are NA
data_real <- data_real %>%
filter(!is.na(main_value) & !is.na(garage_value))
# Verify missing values
colSums(is.na(data_real))
## property_id street_number market_value_2025
## 0 0 0
## improvement_value_total land_value_total area_total_sqft
## 0 0 0
## main_sqft main_value garage_sqft
## 0 0 0
## garage_value land_sqft
## 0 0
Observations
After filtering out irrelevant entries and handling missing values, the dataset now includes 40 valid residential properties located on 88th Street.
Only properties classified as “Real” were retained to ensure the analysis focuses on land and structural assets.
The dataset was cleaned to include only variables relevant for the regression model, resulting in 11 core variables:
property_id: Unique identifier of each property
street_number: Extracted number for property identification
market_value_2025: Target variable, representing the full appraised value
improvement_value_total: Total value of built structures
land_value_total: Appraised value of the land component
area_total_sqft: Total enclosed square footage (heated and non-heated)
main_sqft: Heated living area in square feet
main_value: Value attributed to the main (heated) area
garage_sqft: Non-heated area (e.g., garage) in square feet
garage_value: Appraised value of the garage or similar areas
land_sqft: Total land area in square feet
Understanding the distribution of variables is a crucial step in preparing for multivariable regression modeling. This section explores the spread, central tendency, and potential outliers of each numerical variable using boxplots, ensuring that the dataset is well-behaved for statistical modeling.
In addition, a correlation matrix is presented to examine the strength and direction of linear relationships between predictors. This helps identify possible multicollinearity, guide variable selection, and inform model refinement. These exploratory visualizations provide essential context for evaluating variable behavior and improving the robustness of the regression model.
# Boxplot Visualization
options(scipen = 999)
numeric_vars <- data_real[, sapply(data_real, is.numeric)]
par(mfrow = c(3, 3), mar = c(4, 4, 2, 1),
cex.main = 1.2, cex.axis = 0.9)
for (var in names(numeric_vars)) {
boxplot(numeric_vars[[var]],
main = var,
col = "lightblue",
horizontal = TRUE)
}
Observations
market_value_2025, improvement_value_total, and main_value show a right-skewed distribution with several extreme outliers on the higher end, indicating that a few high-value properties may distort regression estimates if not addressed.
land_value_total and land_sqft exhibit a very narrow interquartile range (IQR) but have numerous outliers above the upper whisker, suggesting limited variability for most homes with a few properties having unusually high land values or lot sizes.
area_total_sqft and main_sqft are moderately skewed with some influential observations, though most properties fall within a consistent central range.
garage_sqft has a highly skewed distribution, with a large spread and several high-end outliers, indicating that some properties have significantly larger garage spaces than typical.
garage_value follows a similar pattern to garage_sqft, reinforcing the direct relationship between garage size and its appraised value.
# Correlation Matrix Visualization
numeric_vars <- data_real[, sapply(data_real, is.numeric)]
cor_matrix <- cor(numeric_vars)
ggcorrplot(cor_matrix,
lab = TRUE, # Mostrar coeficientes
colors = c("red", "white", "#4A90E2"), # Rango de colores
outline.color = "black", # Borde de cada celda
show.legend = TRUE, # Mostrar leyenda
title = "Correlation Matrix of Numeric Variables",
)
Observations:
The target variable market_value_2025 exhibits near-perfect correlations with:
improvement_value_total (r = 1.00)
main_value (r = 0.99)
area_total_sqft (r = 0.96)
main_sqft (r = 0.93)
land_sqft (r = 0.91)
land_value_total (r = 0.90)
garage_value (r = 0.72)
improvement_value_total and main_value should be excluded from the regression model, despite their strong correlation with the target, because they are mathematically embedded within the target:
market_value_2025 = improvement_value_total + land_value_total
improvement_value_total = main_value + garage_value
Including these would violate the assumption of predictor independence and overfitting.
The features main_sqft, garage_sqft, and land_sqft remain in the model as they are direct physical measurements, not derived values. Their correlations with market_value_2025 (r = 0.93, 0.57, 0.91 respectively) confirm their predictive relevance while preserving model validity.
High intercorrelations among predictor variables are present:
main_sqft vs. area_total_sqft: r = 0.96
garage_sqft vs. garage_value: r = 0.96
main_sqft vs. main_value: r = 0.94
main_sqft vs. improvement_value_total: r = 0.92
These indicate potential multicollinearity, which should be addressed in later model diagnostics.
This section presents the process of constructing a multiple linear regression model to estimate the 2025 market value of residential properties using relevant physical and structural features. The model aims to quantify how various characteristics, such as land size, heated area, and garage attributes, jointly influence property valuation.
We begin by specifying an initial regression model that incorporates main effects, interaction terms, and quadratic terms to capture both linear and non-linear relationships. Careful consideration is given to the inclusion of predictors to avoid redundancy or data leakage.
Multiple linear regression is a fundamental statistical technique used to model the relationship between a continuous dependent variable and two or more independent variables. It enables us to quantify how changes in multiple property attributes jointly influence the outcome of interest.
In this case, the objective is to estimate the 2025 market value of residential properties on 88th Street in Lubbock, Texas, based on physical and structural characteristics such as land value, heated area, garage size, and lot size. By fitting a regression model to this data, we aim to assess the contribution of each feature, individually and through interaction effects, to the overall property value.
The general form of the model is:
\[ \text{market_value}_{2025} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \varepsilon \]
\(\text{market_value}_{2025}\): is the dependent variable (response), representing the assessed market value of the property in 2025.
\(\beta_0\): is the intercept, indicating the estimated value of the property when all predictors are zero.
\(\beta_1, \beta_2, \dots, \beta_p\): are the regression coefficients, estimating the effect of each predictor (e.g., land size, garage area) on the market value.
\(X_1, X_2, \dots, X_p\): are the independent variables (predictors) used in the model, such as:
\(\varepsilon\): is the random error term, accounting for variability in market value not captured by the predictors.
Initial Regression Model
To construct the initial regression model for predicting the 2025 market value of residential properties, the general linear equation is expanded using the actual predictors selected from the dataset. This formulation incorporates:
Main effects (original features),
Pairwise interaction terms (to capture possible synergies between variables), and
Quadratic terms (to account for potential non-linear patterns).
Although some variables such as improvement_value_total and main_value show a very high correlation with the target variable market_value_2025, they were intentionally excluded from the model. This is because they are mathematically embedded within the target variable, and including them would violate the assumption of predictor independence and lead to overfitting.
The expanded expression of the initial model is as follows:
\[ \begin{aligned} \text{market_value_2025 } =\ & \beta_0 + \beta_1 \cdot \text{land_value_total} + \beta_2 \cdot \text{main_sqft} + \beta_3 \cdot \text{garage_sqft} + \beta_4 \cdot \text{garage_value} + \beta_5 \cdot \text{land_sqft} + \beta_6 \cdot \text{area_total_sqft} \\ &+ \beta_7 \cdot (\text{land_value_total} \cdot \text{main_sqft}) + \beta_8 \cdot (\text{land_value_total} \cdot \text{garage_sqft}) + \beta_9 \cdot (\text{land_value_total} \cdot \text{garage_value}) \\ &+ \beta_{10} \cdot (\text{land_value_total} \cdot \text{land_sqft}) + \beta_{11} \cdot (\text{land_value\_total} \cdot \text{area_total_sqft}) + \beta_{12} \cdot (\text{main_sqft} \cdot \text{garage_sqft}) \\ &+ \beta_{13} \cdot (\text{main_sqft} \cdot \text{garage_value}) + \beta_{14} \cdot (\text{main_sqft} \cdot \text{land_sqft}) + \beta_{15} \cdot (\text{main_sqft} \cdot \text{area_total_sqft}) \\ &+ \beta_{16} \cdot (\text{garage_sqft} \cdot \text{garage_value}) + \beta_{17} \cdot (\text{garage_sqft} \cdot \text{land_sqft}) + \beta_{18} \cdot (\text{garage_sqft} \cdot \text{area_total_sqft}) \\ &+ \beta_{19} \cdot (\text{garage_value} \cdot \text{land_sqft}) + \beta_{20} \cdot (\text{garage_value} \cdot \text{area_total_sqft}) + \beta_{21} \cdot (\text{land_sqft} \cdot \text{area_total_sqft}) \\ &+ \beta_{22} \cdot \text{land_value_total}^2 + \beta_{23} \cdot \text{main_sqft}^2 + \beta_{24} \cdot \text{garage_sqft}^2 + \beta_{25} \cdot \text{garage_value}^2 \\ &+ \beta_{26} \cdot \text{land_sqft}^2 + \beta_{27} \cdot \text{area_total_sqft}^2 + \varepsilon \end{aligned} \]
In the first step, we isolate the target property located at street number 6321 by storing it in a separate data frame named data_6321. This allows us to make an out-of-sample prediction. To avoid introducing bias into the model, we then remove this property from the training dataset (data_real) before fitting the model.
# Extract the property with street_number 6321 and store it in data_6321
data_6321 <- data_real[data_real$street_number == 6321, ]
# Remove that row from the training dataset
data_real <- data_real[data_real$street_number != 6321, ]
rownames(data_real) <- NULL
A full regression model is fitted using all predictors, their pairwise interactions, and squared terms.
model_full <- lm(
market_value_2025 ~
(land_value_total + main_sqft +
garage_sqft + garage_value + land_sqft + area_total_sqft)^2 +
I(land_value_total^2) + I(main_sqft^2) +
I(garage_sqft^2) + I(garage_value^2) + I(land_sqft^2) +
I(area_total_sqft^2),
data = data_real
)
summary(model_full)
##
## Call:
## lm(formula = market_value_2025 ~ (land_value_total + main_sqft +
## garage_sqft + garage_value + land_sqft + area_total_sqft)^2 +
## I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
## I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2),
## data = data_real)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6625.4 -133.2 11.9 407.3 2266.4
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) -85943.73425868 162015.89026558 -0.530
## land_value_total 18808.53041293 23108.66694784 0.814
## main_sqft 208.08228722 619.27299609 0.336
## garage_sqft -1057.47336410 768.61434331 -1.376
## garage_value 14.19966474 3.93340643 3.610
## land_sqft -109083.75206096 134028.25421000 -0.814
## area_total_sqft 34.33949201 593.70946479 0.058
## I(land_value_total^2) 0.13879630 0.27657727 0.502
## I(main_sqft^2) -0.04215281 0.57776840 -0.073
## I(garage_sqft^2) 0.84597907 0.34997850 2.417
## I(garage_value^2) -0.00000482 0.00008139 -0.059
## I(land_sqft^2) -4.66892112 9.30518331 -0.502
## I(area_total_sqft^2) -0.03681714 0.22025905 -0.167
## land_value_total:main_sqft -8.63116334 12.00096550 -0.719
## land_value_total:garage_sqft 4.90735665 72.22682546 0.068
## land_value_total:garage_value -0.20184358 0.89152744 -0.226
## land_value_total:land_sqft NA NA NA
## land_value_total:area_total_sqft 0.00268557 0.00358760 0.749
## main_sqft:garage_sqft -0.05738249 0.39983222 -0.144
## main_sqft:garage_value 0.00132662 0.00123555 1.074
## main_sqft:land_sqft 50.04506636 69.60166533 0.719
## main_sqft:area_total_sqft 0.06027726 0.79682355 0.076
## garage_sqft:garage_value -0.01062309 0.00958458 -1.108
## garage_sqft:land_sqft -28.48522680 418.90384959 -0.068
## garage_sqft:area_total_sqft NA NA NA
## garage_value:land_sqft 1.17075009 5.17049868 0.226
## garage_value:area_total_sqft NA NA NA
## land_sqft:area_total_sqft NA NA NA
## Pr(>|t|)
## (Intercept) 0.60307
## land_value_total 0.42764
## main_sqft 0.74123
## garage_sqft 0.18783
## garage_value 0.00235 **
## land_sqft 0.42766
## area_total_sqft 0.95459
## I(land_value_total^2) 0.62262
## I(main_sqft^2) 0.94274
## I(garage_sqft^2) 0.02794 *
## I(garage_value^2) 0.95351
## I(land_sqft^2) 0.62267
## I(area_total_sqft^2) 0.86934
## land_value_total:main_sqft 0.48239
## land_value_total:garage_sqft 0.94667
## land_value_total:garage_value 0.82376
## land_value_total:land_sqft NA
## land_value_total:area_total_sqft 0.46498
## main_sqft:garage_sqft 0.88767
## main_sqft:garage_value 0.29888
## main_sqft:land_sqft 0.48250
## main_sqft:area_total_sqft 0.94064
## garage_sqft:garage_value 0.28409
## garage_sqft:land_sqft 0.94663
## garage_sqft:area_total_sqft NA
## garage_value:land_sqft 0.82374
## garage_value:area_total_sqft NA
## land_sqft:area_total_sqft NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1956 on 16 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9998
## F-statistic: 8792 on 23 and 16 DF, p-value: < 0.00000000000000022
Initial Predictor Evaluation
The following terms are recommended for retention:
garage_value (p = 0.002) and I(garage_sqft²) (p = 0.028) show clear statistical significance and should be retained as key predictors of market value.
While main_sqft, land_value_total, area_total_sqft, and their interaction terms were not statistically significant individually (p > 0.1), they are conceptually relevant and may support the structure of the model when considered together.
Interaction terms and quadratic components such as main_sqft:garage_value, land_value_total:garage_value, and garage_sqft:garage_value do not show significance and may be removed to simplify the model.
Several terms resulted in singularities (NA coefficients)—specifically:
model2<- lm(
market_value_2025 ~
land_value_total + main_sqft + garage_sqft + garage_value + land_sqft + area_total_sqft +
I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
land_value_total:area_total_sqft +
main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + main_sqft:area_total_sqft +
garage_sqft:garage_value + garage_sqft:land_sqft,
data = data_real
)
summary(model2)
##
## Call:
## lm(formula = market_value_2025 ~ land_value_total + main_sqft +
## garage_sqft + garage_value + land_sqft + area_total_sqft +
## I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
## I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
## land_value_total:main_sqft + land_value_total:garage_sqft +
## land_value_total:garage_value + land_value_total:area_total_sqft +
## main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft +
## main_sqft:area_total_sqft + garage_sqft:garage_value + garage_sqft:land_sqft,
## data = data_real)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6602.0 -167.8 12.5 416.5 2374.7
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -73517.15783031 148123.05611120 -0.496
## land_value_total 14520.54638818 12868.40740859 1.128
## main_sqft 259.35307972 560.07736186 0.463
## garage_sqft -936.46600839 536.78318171 -1.745
## garage_value 13.50713797 2.40333599 5.620
## land_sqft -84214.32287119 74638.73673539 -1.128
## area_total_sqft -25.83567017 515.88150733 -0.050
## I(land_value_total^2) 0.16777345 0.23824874 0.704
## I(main_sqft^2) -0.10429218 0.49404133 -0.211
## I(garage_sqft^2) 0.84860500 0.33988582 2.497
## I(garage_value^2) -0.00001351 0.00006975 -0.194
## I(land_sqft^2) -5.64385714 8.01559561 -0.704
## I(area_total_sqft^2) -0.05727685 0.19518616 -0.293
## land_value_total:main_sqft -7.74459322 11.02317682 -0.703
## land_value_total:garage_sqft -11.21049526 11.88951055 -0.943
## land_value_total:garage_value 0.00002396 0.00007889 0.304
## land_value_total:area_total_sqft 0.00199765 0.00185408 1.077
## main_sqft:garage_sqft -0.10941276 0.31794502 -0.344
## main_sqft:garage_value 0.00140998 0.00114604 1.230
## main_sqft:land_sqft 44.90814899 63.93806341 0.702
## main_sqft:area_total_sqft 0.14312407 0.68781794 0.208
## garage_sqft:garage_value -0.00974084 0.00850900 -1.145
## garage_sqft:land_sqft 64.99644411 68.93852004 0.943
## Pr(>|t|)
## (Intercept) 0.6260
## land_value_total 0.2748
## main_sqft 0.6492
## garage_sqft 0.0991 .
## garage_value 0.0000306 ***
## land_sqft 0.2749
## area_total_sqft 0.9606
## I(land_value_total^2) 0.4908
## I(main_sqft^2) 0.8353
## I(garage_sqft^2) 0.0231 *
## I(garage_value^2) 0.8487
## I(land_sqft^2) 0.4909
## I(area_total_sqft^2) 0.7727
## land_value_total:main_sqft 0.4918
## land_value_total:garage_sqft 0.3589
## land_value_total:garage_value 0.7651
## land_value_total:area_total_sqft 0.2963
## main_sqft:garage_sqft 0.7350
## main_sqft:garage_value 0.2353
## main_sqft:land_sqft 0.4920
## main_sqft:area_total_sqft 0.8376
## garage_sqft:garage_value 0.2682
## garage_sqft:land_sqft 0.3590
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1901 on 17 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9998
## F-statistic: 9735 on 22 and 17 DF, p-value: < 0.00000000000000022
Observations
This section conducts a diagnostic assessment to identify influential observations that may disproportionately affect the fitted regression model. Influential points can distort parameter estimates, compromise model assumptions, and reduce predictive accuracy
Observations within a dataset can be labeled by the amount in which they are an outlier in their response, have leverage in their predictors, and influence the fit of the model.
Outliers are observations in a dataset whose response variable is distant from other observations.
Leverage points are observations in a dataset whose predictor variables are distant from the other observations.
Influence points are observations in a dataset that are both outliers in their response variable and have leverage in their predictor variables.
Points that have influence need to be carefully examined to determine whether they are mistakes (e.g., due to measurement errors) or extreme observations (which are possible but unlikely). Once identified, a decision must be made on how to handle these points—whether to leave them in the model or remove them.
The diagnostics for influential observations are based on three main components:
1. Studentized Residuals (rstudent): Measures the residual of an observation scaled by its estimated standard deviation, providing an assessment of whether an observation is an outlier in terms of its response variable.
\[ \mathbf{e} = (\mathbf{I} - \mathbf{H}) \mathbf{y} \]
Where:
2. Leverage (hatvalues): Measures how much influence a particular predictor value has on the fitted values. The leverage for an observation \(i\) is calculated as:
\[ h_{ii} = \mathbf{x}_i^\top (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{x}_i \]
Where:
A common rule of thumb to identify high-leverage points is:
\[ h_{ii} > \frac{2p}{n} \]
3. Cook’s Distance (cooks.distance): Combines the residuals and leverage information to assess the overall influence of each observation on the model. The Cook’s Distance for the \(\text{i-th}\) observation is calculated as:
\[ D_i = \frac{1}{p \cdot \text{MSE}} (\hat{\boldsymbol{\beta}}^{(i)} - \hat{\boldsymbol{\beta}})^\top \mathbf{X}^\top \mathbf{X} (\hat{\boldsymbol{\beta}}^{(i)} - \hat{\boldsymbol{\beta}}) \]
Where:
The following R code generates diagnostic plots and a summary table of key influence statistics:
plot(model2, 1)
plot(model2, 4)
The plots suggest possible influential observations at indexes 2, 28, and 38.
# Creating combined diagnostics dataframe
diagnostics_df <- data.frame(
Obs = 1:nrow(data_real),
Residual = rstudent(model2),
Leverage = hatvalues(model2),
CooksD = cooks.distance(model2)
)
# Thresholds for leverage and Cook's Distance
n <- nrow(data_real)
p <- length(coef(model2))
leverage_thresh <- 2 * p / n
cook_thresh <- 4 / n
# Flagging high influence points
diagnostics_df$High_Residual <- abs(diagnostics_df$Residual) > 2
diagnostics_df$High_Leverage <- diagnostics_df$Leverage > leverage_thresh
diagnostics_df$High_CooksD <- diagnostics_df$CooksD > cook_thresh
# Sorting by Cook's Distance
diagnostics_df <- diagnostics_df[order(-diagnostics_df$CooksD), ]
# Display top influential points
rownames(diagnostics_df) <- NULL
rmarkdown::paged_table(diagnostics_df)
plot(model2,5)
plot(model2,6)
Observations
Data Points 1 and 11 are the strongest candidates for exclusion. Observation 1 displays a Cook’s Distance of approximately 1408, with a leverage value of 1.00, the theoretical maximum. Though its studentized residual is not extreme (0.03), its magnitude of influence on the regression is exceptionally high. Similarly, Data Point 11 has a Cook’s Distance of 9.68 and leverage near 0.99, making it highly influential on the fitted model.
Data Point 37 also stands out with a studentized residual of −11.46, well beyond the ±2 threshold, and a Cook’s Distance of 0.17. While its leverage is moderate (≈ 0.20), the extreme residual marks it as a clear outlier in terms of model fit.
Data Point 3 shows a Cook’s Distance of 1.43, leverage of 0.99, and a moderately large residual (−0.67). This combination indicates a potential distortion of coefficient estimates and model stability.
Other points such as 32, 9, and 39 exceed the Cook’s Distance threshold of 4/n (≈ 0.1), though their studentized residuals remain within acceptable bounds. For instance, Data Point 39 has a residual of 1.90 and moderate leverage (≈ 0.50), suggesting a non-negligible but less severe influence.
Based on these results, Data Points 1, 11, 3, and 37 are strong candidates for exclusion in a revised model. Data Points 9, 32, and 39 will be temporarily retained but flagged for further inspection. Subsequent modeling will involve re-estimating the regression with and without these observations to assess impacts on adjusted R-squared, residual behavior, and the stability of coefficient estimates.
In this section, we aim to improve the reliability and interpretability of the regression model by identifying and removing influential data points. These observations, typically characterized by high leverage, large residuals, or disproportionate influence on model coefficients, can distort estimates and compromise model performance.
By re-estimating the model without such points, we seek to achieve a more stable fit, enhance statistical significance of relevant predictors, and ensure the model better reflects the underlying patterns in the data.
To improve model stability and reduce the effect of extreme values, we re-estimated the model after removing four previously identified influential observations (Data Points 1, 3, 11, and 37). These points were flagged based on high leverage, outlier status, and influence metrics.
# Remove influential observations
data_no_points <- data_real[-c(1, 11, 3, 37), ]
# Fit the new model
model_no_points <- lm(
market_value_2025 ~
land_value_total + main_sqft + garage_sqft + garage_value + land_sqft + area_total_sqft +
I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
land_value_total:area_total_sqft +
main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + main_sqft:area_total_sqft +
garage_sqft:garage_value + garage_sqft:land_sqft,
data = data_no_points)
summary(model_no_points)
##
## Call:
## lm(formula = market_value_2025 ~ land_value_total + main_sqft +
## garage_sqft + garage_value + land_sqft + area_total_sqft +
## I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
## I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
## land_value_total:main_sqft + land_value_total:garage_sqft +
## land_value_total:garage_value + land_value_total:area_total_sqft +
## main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft +
## main_sqft:area_total_sqft + garage_sqft:garage_value + garage_sqft:land_sqft,
## data = data_no_points)
##
## Residuals:
## Min 1Q Median 3Q Max
## -231.516 -35.200 -0.174 44.186 186.723
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -32532.655322552 23386.404808748 -1.391
## land_value_total -3314.749529057 2662.441382959 -1.245
## main_sqft 65.896705356 63.043193006 1.045
## garage_sqft -751.515100888 57.588030943 -13.050
## garage_value 10.011740905 0.251663117 39.782
## land_sqft 19235.636345617 15441.581708770 1.246
## area_total_sqft 104.869067722 53.194883065 1.971
## I(land_value_total^2) 0.012594306 0.034824014 0.362
## I(main_sqft^2) 0.045388164 0.046824511 0.969
## I(garage_sqft^2) 0.867328814 0.030777077 28.181
## I(garage_value^2) -0.000005365 0.000007354 -0.730
## I(land_sqft^2) -0.423281744 1.171676752 -0.361
## I(area_total_sqft^2) -0.007849387 0.017810212 -0.441
## land_value_total:main_sqft 0.676598824 0.993448810 0.681
## land_value_total:garage_sqft -0.040440798 1.113434660 -0.036
## land_value_total:garage_value -0.000032808 0.000007526 -4.360
## land_value_total:area_total_sqft 0.002118823 0.000209319 10.122
## main_sqft:garage_sqft -0.238098904 0.032219711 -7.390
## main_sqft:garage_value 0.003786818 0.000133024 28.467
## main_sqft:land_sqft -3.940827623 5.762604131 -0.684
## main_sqft:area_total_sqft -0.031679026 0.064624283 -0.490
## garage_sqft:garage_value -0.011362737 0.000903749 -12.573
## garage_sqft:land_sqft 0.237610431 6.456075615 0.037
## Pr(>|t|)
## (Intercept) 0.187548
## land_value_total 0.235106
## main_sqft 0.314949
## garage_sqft 0.00000000759469477 ***
## garage_value 0.00000000000000575 ***
## land_sqft 0.234857
## area_total_sqft 0.070344 .
## I(land_value_total^2) 0.723420
## I(main_sqft^2) 0.350073
## I(garage_sqft^2) 0.00000000000048422 ***
## I(garage_value^2) 0.478568
## I(land_sqft^2) 0.723707
## I(area_total_sqft^2) 0.666653
## land_value_total:main_sqft 0.507784
## land_value_total:garage_sqft 0.971578
## land_value_total:garage_value 0.000773 ***
## land_value_total:area_total_sqft 0.00000015619830432 ***
## main_sqft:garage_sqft 0.00000527093644677 ***
## main_sqft:garage_value 0.00000000000042543 ***
## main_sqft:land_sqft 0.506070
## main_sqft:area_total_sqft 0.632159
## garage_sqft:garage_value 0.00000001192089443 ***
## garage_sqft:land_sqft 0.971200
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 160.5 on 13 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.302e+06 on 22 and 13 DF, p-value: < 0.00000000000000022
Observations
The residual standard error decreased significantly from 1901 to 160.5, indicating improved fit.
The adjusted R² remained at 1.00, confirming that the model continues to explain nearly all variability in market_value_2025 even after removing outliers.
The following variables are highly significant (p < 0.01):
Most other predictors, including land_value_total, main_sqft, and several squared and interaction terms, are not statistically significant (p > 0.1)`
The results confirm that garage-related variables and their interactions are the most influential in predicting market value in the cleaned model.
The elimination of the four influential observations led to a more stable, interpretable, and better statistically model.
To further evaluate the robustness of the model, we conducted a sensitivity analysis by testing the impact of individually removing three additional observations (9, 32, and 39). These were previously identified as having potential outlier behavior but were not removed in the initial round.
# List of observations to evaluate
outliers_to_test <- c(9, 32, 39)
for (obs in outliers_to_test) {
cat("\n================ Data Point Removed:", obs, "================\n")
# Remove observation
temp_data <- data_no_points[-obs, ]
# Refit the model
temp_model <- lm(
market_value_2025 ~
land_value_total + main_sqft + garage_sqft + garage_value + land_sqft + area_total_sqft +
I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
land_value_total:area_total_sqft +
main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + main_sqft:area_total_sqft +
garage_sqft:garage_value + garage_sqft:land_sqft,
data = temp_data)
# Model summary
s <- summary(temp_model)
f <- s$fstatistic
f_pval <- pf(f[1], f[2], f[3], lower.tail = FALSE)
# Display global metrics
cat("Residual standard error:", format(s$sigma, scientific = TRUE), "on", s$df[2], "degrees of freedom\n")
cat("Multiple R-squared:", round(s$r.squared, 4), "\tAdjusted R-squared:", round(s$adj.r.squared, 4), "\n")
cat("F-statistic:", format(f[1], scientific = TRUE), "on", f[2], "and", f[3], "DF, p-value:", format.pval(f_pval, digits = 4, eps = .0001), "\n\n")
# Display rounded p-values
pvals <- round(s$coefficients[, "Pr(>|t|)"], 4)
for (i in seq_along(pvals)) {
cat(names(pvals)[i], ":", pvals[i], "\n")
}
}
##
## ================ Data Point Removed: 9 ================
## Residual standard error: 1.503073e+02 on 12 degrees of freedom
## Multiple R-squared: 1 Adjusted R-squared: 1
## F-statistic: 1.484054e+06 on 22 and 12 DF, p-value: < 0.0001
##
## (Intercept) : 0.1106
## land_value_total : 0.1936
## main_sqft : 0.1863
## garage_sqft : 0
## garage_value : 0
## land_sqft : 0.1934
## area_total_sqft : 0.1049
## I(land_value_total^2) : 0.9028
## I(main_sqft^2) : 0.4422
## I(garage_sqft^2) : 0
## I(garage_value^2) : 0.3897
## I(land_sqft^2) : 0.9024
## I(area_total_sqft^2) : 0.4997
## land_value_total:main_sqft : 0.2192
## land_value_total:garage_sqft : 0.9483
## land_value_total:garage_value : 0.0004
## land_value_total:area_total_sqft : 0
## main_sqft:garage_sqft : 0
## main_sqft:garage_value : 0
## main_sqft:land_sqft : 0.2182
## main_sqft:area_total_sqft : 0.7958
## garage_sqft:garage_value : 0
## garage_sqft:land_sqft : 0.9476
##
## ================ Data Point Removed: 32 ================
## Residual standard error: 1.557109e+02 on 12 degrees of freedom
## Multiple R-squared: 1 Adjusted R-squared: 1
## F-statistic: 1.366883e+06 on 22 and 12 DF, p-value: < 0.0001
##
## (Intercept) : 0.2972
## land_value_total : 0.1001
## main_sqft : 0.2433
## garage_sqft : 0
## garage_value : 0
## land_sqft : 0.1
## area_total_sqft : 0.1095
## I(land_value_total^2) : 0.6813
## I(main_sqft^2) : 0.5653
## I(garage_sqft^2) : 0
## I(garage_value^2) : 0.1929
## I(land_sqft^2) : 0.6815
## I(area_total_sqft^2) : 0.4507
## land_value_total:main_sqft : 0.2074
## land_value_total:garage_sqft : 0.4676
## land_value_total:garage_value : 0.0006
## land_value_total:area_total_sqft : 0
## main_sqft:garage_sqft : 0
## main_sqft:garage_value : 0
## main_sqft:land_sqft : 0.2067
## main_sqft:area_total_sqft : 0.9049
## garage_sqft:garage_value : 0
## garage_sqft:land_sqft : 0.4671
##
## ================ Data Point Removed: 39 ================
## Residual standard error: 1.605191e+02 on 13 degrees of freedom
## Multiple R-squared: 1 Adjusted R-squared: 1
## F-statistic: 1.302447e+06 on 22 and 13 DF, p-value: < 0.0001
##
## (Intercept) : 0.1875
## land_value_total : 0.2351
## main_sqft : 0.3149
## garage_sqft : 0
## garage_value : 0
## land_sqft : 0.2349
## area_total_sqft : 0.0703
## I(land_value_total^2) : 0.7234
## I(main_sqft^2) : 0.3501
## I(garage_sqft^2) : 0
## I(garage_value^2) : 0.4786
## I(land_sqft^2) : 0.7237
## I(area_total_sqft^2) : 0.6667
## land_value_total:main_sqft : 0.5078
## land_value_total:garage_sqft : 0.9716
## land_value_total:garage_value : 0.0008
## land_value_total:area_total_sqft : 0
## main_sqft:garage_sqft : 0
## main_sqft:garage_value : 0
## main_sqft:land_sqft : 0.5061
## main_sqft:area_total_sqft : 0.6322
## garage_sqft:garage_value : 0
## garage_sqft:land_sqft : 0.9712
Observations:
After previously excluding Data Points 1, 3, 11, and 37, additional outliers were individually removed to assess their influence on the model’s performance. The goal was to determine whether their presence distorted coefficient estimation, statistical significance, or model fit.
Based on the above sensitivity analysis, we updated the model one final time by additionally removing Data Point 9, resulting in a total of five excluded observations (1, 3, 9, 11, and 37). This version is expected to be more stable and better represent the central trend of the data.
# Remove influential observations
data_cleaned <- data_real[-c(1, 11, 3, 37, 9), ]
# Refit the regression model using the cleaned dataset
model_cleaned <- lm(
market_value_2025 ~
land_value_total + main_sqft + garage_sqft + garage_value + land_sqft + area_total_sqft +
I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
land_value_total:area_total_sqft +
main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + main_sqft:area_total_sqft +
garage_sqft:garage_value + garage_sqft:land_sqft,
data = data_cleaned
)
# Display key model results
summary(model_cleaned)
##
## Call:
## lm(formula = market_value_2025 ~ land_value_total + main_sqft +
## garage_sqft + garage_value + land_sqft + area_total_sqft +
## I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
## I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
## land_value_total:main_sqft + land_value_total:garage_sqft +
## land_value_total:garage_value + land_value_total:area_total_sqft +
## main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft +
## main_sqft:area_total_sqft + garage_sqft:garage_value + garage_sqft:land_sqft,
## data = data_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -210.50 -40.01 0.00 40.52 215.30
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -33301.0900866070 23988.2074520500 -1.388
## land_value_total -2920.3869524207 2799.7509292910 -1.043
## main_sqft 55.8378979579 66.5652949434 0.839
## garage_sqft -765.5590030669 63.1471307680 -12.123
## garage_value 10.0627898331 0.2704914353 37.202
## land_sqft 16948.4185780307 16237.9474274157 1.044
## area_total_sqft 115.2403937887 56.9747827963 2.023
## I(land_value_total^2) -0.0063510047 0.0468630405 -0.136
## I(main_sqft^2) 0.0560352900 0.0509159870 1.101
## I(garage_sqft^2) 0.8821515313 0.0394884740 22.339
## I(garage_value^2) 0.0000003396 0.0000118530 0.029
## I(land_sqft^2) 0.2140794403 1.5766688364 0.136
## I(area_total_sqft^2) -0.0043050630 0.0191097567 -0.225
## land_value_total:main_sqft 0.8338762350 1.0484737628 0.795
## land_value_total:garage_sqft 1.6438229844 2.9326487547 0.561
## land_value_total:garage_value -0.0000402454 0.0000142038 -2.833
## land_value_total:area_total_sqft 0.0022064611 0.0002563984 8.606
## main_sqft:garage_sqft -0.2367018874 0.0330811502 -7.155
## main_sqft:garage_value 0.0038761209 0.0001977114 19.605
## main_sqft:land_sqft -4.8538465700 6.0820581652 -0.798
## main_sqft:area_total_sqft -0.0455712390 0.0698499061 -0.652
## garage_sqft:garage_value -0.0120579547 0.0014494027 -8.319
## garage_sqft:land_sqft -9.5285074916 17.0047674499 -0.560
## Pr(>|t|)
## (Intercept) 0.1903
## land_value_total 0.3175
## main_sqft 0.4180
## garage_sqft 0.0000000431449722 ***
## garage_value 0.0000000000000914 ***
## land_sqft 0.3172
## area_total_sqft 0.0660 .
## I(land_value_total^2) 0.8944
## I(main_sqft^2) 0.2927
## I(garage_sqft^2) 0.0000000000381967 ***
## I(garage_value^2) 0.9776
## I(land_sqft^2) 0.8942
## I(area_total_sqft^2) 0.8256
## land_value_total:main_sqft 0.4419
## land_value_total:garage_sqft 0.5854
## land_value_total:garage_value 0.0151 *
## land_value_total:area_total_sqft 0.0000017676926881 ***
## main_sqft:garage_sqft 0.0000115592776827 ***
## main_sqft:garage_value 0.0000000001760412 ***
## main_sqft:land_sqft 0.4403
## main_sqft:area_total_sqft 0.5264
## garage_sqft:garage_value 0.0000025131479113 ***
## garage_sqft:land_sqft 0.5856
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 164.4 on 12 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.241e+06 on 22 and 12 DF, p-value: < 0.00000000000000022
To improve the interpretability and robustness of the regression model, we perform step-by-step model refinement based on feature significance (p-values) and multicollinearity diagnostics (GVIF). The goal is to simplify the model while preserving predictive power and respecting hierarchical modeling principles.
We begin by assessing the cleaned model using vif() to detect potential multicollinearity issues and guide feature selection.
vif(model_cleaned, type = "predictor")
## GVIF Df
## land_value_total 25036649388093432618668060480264664400280864 18
## main_sqft 1 22
## garage_sqft 27566856153543144 18
## garage_value 3361717872367879852604402486620006628868448264606868 14
## land_sqft 2983421214152967732042488448860600064662086462046244044 9
## area_total_sqft 4590214097808697878824662660488828020422 9
## GVIF^(1/(2*Df))
## land_value_total 16.051514
## main_sqft 1.000000
## garage_sqft 2.862051
## garage_value 69.220436
## land_sqft 1062.607879
## area_total_sqft 159.747902
## Interacts With
## land_value_total I(land_value_total^2), main_sqft, garage_sqft, garage_value, area_total_sqft
## main_sqft I(main_sqft^2), land_value_total, garage_sqft, garage_value, land_sqft, area_total_sqft
## garage_sqft I(garage_sqft^2), land_value_total, main_sqft, garage_value, land_sqft
## garage_value I(garage_value^2), land_value_total, main_sqft, garage_sqft
## land_sqft I(land_sqft^2), main_sqft, garage_sqft
## area_total_sqft I(area_total_sqft^2), land_value_total, main_sqft
## Other Predictors
## land_value_total land_sqft
## main_sqft --
## garage_sqft area_total_sqft
## garage_value land_sqft, area_total_sqft
## land_sqft land_value_total, garage_value, area_total_sqft
## area_total_sqft garage_sqft, garage_value, land_sqft
Observations:
After removing land_sqft, we re-estimate the model and reassess both statistical significance and collinearity metrics.
model_reduced <- lm(
market_value_2025 ~
land_value_total + main_sqft + garage_sqft + garage_value + area_total_sqft +
I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
I(garage_value^2) + I(area_total_sqft^2) +
land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
land_value_total:area_total_sqft +
main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:area_total_sqft +
garage_sqft:garage_value,
data = data_cleaned
)
summary(model_reduced)
##
## Call:
## lm(formula = market_value_2025 ~ land_value_total + main_sqft +
## garage_sqft + garage_value + area_total_sqft + I(land_value_total^2) +
## I(main_sqft^2) + I(garage_sqft^2) + I(garage_value^2) + I(area_total_sqft^2) +
## land_value_total:main_sqft + land_value_total:garage_sqft +
## land_value_total:garage_value + land_value_total:area_total_sqft +
## main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:area_total_sqft +
## garage_sqft:garage_value, data = data_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308.84 -61.61 0.00 78.05 241.54
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -42563.625366632 15508.741695771 -2.744
## land_value_total 1.806761414 0.216719726 8.337
## main_sqft 69.761883902 56.808643123 1.228
## garage_sqft -792.891203994 52.010135912 -15.245
## garage_value 10.455571957 0.184198124 56.763
## area_total_sqft 108.502398874 50.758240565 2.138
## I(land_value_total^2) 0.000012278 0.000004675 2.627
## I(main_sqft^2) 0.048351136 0.046676989 1.036
## I(garage_sqft^2) 0.876504263 0.026691211 32.839
## I(garage_value^2) -0.000001470 0.000004972 -0.296
## I(area_total_sqft^2) -0.007460778 0.017893347 -0.417
## land_value_total:main_sqft -0.003076362 0.000242480 -12.687
## land_value_total:garage_sqft 0.000352183 0.000411437 0.856
## land_value_total:garage_value -0.000034838 0.000004862 -7.165
## land_value_total:area_total_sqft 0.002328307 0.000187760 12.400
## main_sqft:garage_sqft -0.218463330 0.029943718 -7.296
## main_sqft:garage_value 0.003633922 0.000090667 40.080
## main_sqft:area_total_sqft -0.036854201 0.064480738 -0.572
## garage_sqft:garage_value -0.011783771 0.000624665 -18.864
## Pr(>|t|)
## (Intercept) 0.0144 *
## land_value_total 0.000000323432301037 ***
## main_sqft 0.2372
## garage_sqft 0.000000000059915986 ***
## garage_value < 0.0000000000000002 ***
## area_total_sqft 0.0483 *
## I(land_value_total^2) 0.0183 *
## I(main_sqft^2) 0.3157
## I(garage_sqft^2) 0.000000000000000413 ***
## I(garage_value^2) 0.7714
## I(area_total_sqft^2) 0.6822
## land_value_total:main_sqft 0.000000000914576672 ***
## land_value_total:garage_sqft 0.4046
## land_value_total:garage_value 0.000002246608368307 ***
## land_value_total:area_total_sqft 0.000000001276921815 ***
## main_sqft:garage_sqft 0.000001794065466160 ***
## main_sqft:garage_value < 0.0000000000000002 ***
## main_sqft:area_total_sqft 0.5756
## garage_sqft:garage_value 0.000000000002352542 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 174 on 16 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.355e+06 on 18 and 16 DF, p-value: < 0.00000000000000022
vif(model_reduced, type = "predictor")
## GVIF Df GVIF^(1/(2*Df))
## land_value_total 1 18 1.000000
## main_sqft 1 18 1.000000
## garage_sqft 1598310432474622 14 3.491306
## garage_value 1598310432474622 14 3.491306
## area_total_sqft 374583515959 9 4.395162
## Interacts With
## land_value_total I(land_value_total^2), main_sqft, garage_sqft, garage_value, area_total_sqft
## main_sqft I(main_sqft^2), land_value_total, garage_sqft, garage_value, area_total_sqft
## garage_sqft I(garage_sqft^2), land_value_total, main_sqft, garage_value
## garage_value I(garage_value^2), land_value_total, main_sqft, garage_sqft
## area_total_sqft I(area_total_sqft^2), land_value_total, main_sqft
## Other Predictors
## land_value_total --
## main_sqft --
## garage_sqft area_total_sqft
## garage_value area_total_sqft
## area_total_sqft garage_sqft, garage_value
Observations:
We now fit the final model, keeping only the most relevant predictors and interactions while ensuring that model hierarchy is respected and multicollinearity is minimized.
model_final <- lm(
market_value_2025 ~
land_value_total + main_sqft + garage_sqft + garage_value + area_total_sqft +
I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
land_value_total:main_sqft + land_value_total:garage_value +
land_value_total:area_total_sqft +
main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:area_total_sqft +
garage_sqft:garage_value,
data = data_cleaned
)
summary(model_final)
##
## Call:
## lm(formula = market_value_2025 ~ land_value_total + main_sqft +
## garage_sqft + garage_value + area_total_sqft + I(land_value_total^2) +
## I(main_sqft^2) + I(garage_sqft^2) + land_value_total:main_sqft +
## land_value_total:garage_value + land_value_total:area_total_sqft +
## main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:area_total_sqft +
## garage_sqft:garage_value, data = data_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -305.23 -65.61 -9.06 95.84 201.65
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -39523.460319117 13845.603658233 -2.855
## land_value_total 1.828339001 0.201110709 9.091
## main_sqft 36.917779982 18.941632266 1.949
## garage_sqft -826.150091785 14.535967614 -56.835
## garage_value 10.535229332 0.152653509 69.014
## area_total_sqft 137.691125245 10.246939174 13.437
## I(land_value_total^2) 0.000008312 0.000003081 2.698
## I(main_sqft^2) 0.070732169 0.004803734 14.724
## I(garage_sqft^2) 0.871028622 0.005609366 155.281
## land_value_total:main_sqft -0.003050124 0.000194208 -15.705
## land_value_total:garage_value -0.000034740 0.000001575 -22.051
## land_value_total:area_total_sqft 0.002489153 0.000082333 30.233
## main_sqft:garage_sqft -0.196230925 0.005212665 -37.645
## main_sqft:garage_value 0.003586134 0.000071119 50.425
## main_sqft:area_total_sqft -0.067937221 0.004113243 -16.517
## garage_sqft:garage_value -0.011921708 0.000071596 -166.514
## Pr(>|t|)
## (Intercept) 0.0101 *
## land_value_total 0.00000002386595202 ***
## main_sqft 0.0662 .
## garage_sqft < 0.0000000000000002 ***
## garage_value < 0.0000000000000002 ***
## area_total_sqft 0.00000000003748596 ***
## I(land_value_total^2) 0.0143 *
## I(main_sqft^2) 0.00000000000762333 ***
## I(garage_sqft^2) < 0.0000000000000002 ***
## land_value_total:main_sqft 0.00000000000244574 ***
## land_value_total:garage_value 0.00000000000000536 ***
## land_value_total:area_total_sqft < 0.0000000000000002 ***
## main_sqft:garage_sqft < 0.0000000000000002 ***
## main_sqft:garage_value < 0.0000000000000002 ***
## main_sqft:area_total_sqft 0.00000000000099970 ***
## garage_sqft:garage_value < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 169.2 on 19 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.72e+06 on 15 and 19 DF, p-value: < 0.00000000000000022
vif(model_final, type = "predictor")
## GVIF Df GVIF^(1/(2*Df))
## land_value_total 33474318 11 2.198024
## main_sqft 1 15 1.000000
## garage_sqft 398369889288 8 5.309063
## garage_value 3020696856 12 2.483159
## area_total_sqft 94187384 8 3.150464
## Interacts With
## land_value_total I(land_value_total^2), main_sqft, garage_value, area_total_sqft
## main_sqft I(main_sqft^2), land_value_total, garage_sqft, garage_value, area_total_sqft
## garage_sqft I(garage_sqft^2), main_sqft, garage_value
## garage_value land_value_total, main_sqft, garage_sqft
## area_total_sqft land_value_total, main_sqft
## Other Predictors
## land_value_total garage_sqft
## main_sqft --
## garage_sqft land_value_total, area_total_sqft
## garage_value area_total_sqft
## area_total_sqft garage_sqft, garage_value
Observations:
With the final regression model established and validated, we now use it to generate a predicted 2025 market value for the target property located at 6321 88th Street. This estimate is derived from the same regression equation used to model the neighborhood, applying the physical and structural features specific to the target property.
To assess the reliability and precision of this prediction, both a 95% confidence interval and a 95% prediction interval are computed:
The confidence interval provides a range for the expected mean market value of homes with similar characteristics.
The prediction interval offers a wider range that captures the likely value of the individual property at 6321 88th Street, accounting for both model uncertainty and natural variability.
test_data <- data_6321
# Confidence and prediction intervals for 6321
ci_6321 <- predict(model_final, newdata = test_data, interval = "confidence", level = 0.95)
pi_6321 <- predict(model_final, newdata = test_data, interval = "prediction", level = 0.95)
# Display results
cat("Prediction and Interval Estimates for 6321 88th Street:\n",
"-> Predicted Market Value (2025): $", round(ci_6321[1], 2), "\n",
"-> Confidence Interval (95%): $", round(ci_6321[2], 2), "– $", round(ci_6321[3], 2), "\n",
"-> Prediction Interval (95%): $", round(pi_6321[2], 2), "– $", round(pi_6321[3], 2), "\n"
)
## Prediction and Interval Estimates for 6321 88th Street:
## -> Predicted Market Value (2025): $ 538535.6
## -> Confidence Interval (95%): $ 538362.1 – $ 538709
## -> Prediction Interval (95%): $ 538141.3 – $ 538929.8
Observations
Data Collection and Preparation
Exploratory Data Analysis (EDA)
Model Building
Model Diagnostics and Re-estimation
Model Optimization
Prediction for Target Property (6321 88th Street)
The model supports the conclusion that the assessed value of 6321 88th Street is fair and consistent with the valuation of similar nearby properties. The final model is statistically robust, yet the perfect R² calls for further validation to rule out overfitting. At present, there is no strong evidence to suggest that the property has been misvalued
# Libraries
library(dplyr)
library(tidyr)
library(MASS)
library(psych)
library(ggcorrplot)
library(ggplot2)
library(car)
# SECTION 1: EXPLORATORY DATA ANALYSIS (EDA) ----------------------------------------------------------
# Create the dataframe
# Raw GitHub URL
url <- "https://raw.githubusercontent.com/JairoRodriguezB/Datasets/refs/heads/main/Property_Assessment_88thStreet_2025.csv"
# Read the CSV file
data <- read.csv(url)
head(data)
# INITIAL DATA ANALYSIS
# Rename columns to facilitate analysis
colnames(data) <- c(
"property_id",
"address",
"property_type",
"assessed_value_2025",
"market_value_2025",
"improvement_value_total",
"land_value_total",
"homestead_cap_loss",
"area_total_sqft",
"main_sqft",
"main_value",
"garage_sqft",
"garage_value",
"pool_sqft",
"pool_value",
"land_sqft"
)
# Convert selected columns from character to integer, replacing "N/A" with NA
data <- data %>%
mutate(
improvement_value_total = as.integer(na_if(as.character(improvement_value_total), "N/A")),
land_value_total = as.integer(na_if(as.character(land_value_total), "N/A")),
homestead_cap_loss = as.integer(na_if(as.character(homestead_cap_loss), "N/A")),
area_total_sqft = as.integer(na_if(as.character(area_total_sqft), "N/A")),
main_sqft = as.integer(na_if(as.character(main_sqft), "N/A")),
main_value = as.integer(na_if(as.character(main_value), "N/A")),
garage_sqft = as.integer(na_if(as.character(garage_sqft), "N/A")),
garage_value = as.integer(na_if(as.character(garage_value), "N/A")),
pool_sqft = as.integer(na_if(as.character(pool_sqft), "N/A")),
pool_value = as.integer(na_if(as.character(pool_value), "N/A")),
land_sqft = as.integer(na_if(as.character(land_sqft), "N/A"))
)
# Initial Data Inspection
# Summary statistics for numerical variables
summary(data)
# Overview of the dataset structure
glimpse(data)
# Count the number of missing values (NA) in each column
colSums(is.na(data))
# DATA FILTERING AND PREPARATION
unique(data$property_type)
data_real <- data %>%
filter(property_type == "Real")
data_real <- data_real %>%
mutate(street_number = substr(address, 1, 4)) %>%
relocate(street_number, .after = property_id)
glimpse(data_real)
# Data Cleaning – Removal of Irrelevant or Redundant Variables
# List of columns to retain for analysis
cols_keep <- c(
"property_id",
"street_number",
"market_value_2025",
"improvement_value_total",
"land_value_total",
"area_total_sqft",
"main_sqft",
"main_value",
"garage_sqft",
"garage_value",
"land_sqft"
)
# Filter the dataframe to keep only the selected columns
data_real <- data_real[, names(data_real) %in% cols_keep]
# Count the number of missing values (NA) in each column
colSums(is.na(data_real))
# Remove rows where key variables are NA
data_real <- data_real %>%
filter(!is.na(main_value) & !is.na(garage_value))
# Verify missing values
colSums(is.na(data_real))
# DISTRIBUTION AND RELATIONSHIP ANALYSIS
# DISTRIBUTION ANALYSIS
# Boxplot Visualization
options(scipen = 999)
numeric_vars <- data_real[, sapply(data_real, is.numeric)]
par(mfrow = c(3, 3), mar = c(4, 4, 2, 1),
cex.main = 1.2, cex.axis = 0.9)
for (var in names(numeric_vars)) {
boxplot(numeric_vars[[var]],
main = var,
col = "lightblue",
horizontal = TRUE)
}
# RELATIONSHIP ANALYSIS
# Correlation Matrix Visualization
numeric_vars <- data_real[, sapply(data_real, is.numeric)]
cor_matrix <- cor(numeric_vars)
ggcorrplot(cor_matrix,
lab = TRUE, # Mostrar coeficientes
colors = c("red", "white", "#4A90E2"), # Rango de colores
outline.color = "black", # Borde de cada celda
show.legend = TRUE, # Mostrar leyenda
title = "Correlation Matrix of Numeric Variables",
)
# -----------------------------------------------------------------------------------------------------
# SECTION 2: MODEL BUILDING ---------------------------------------------------------------------------
# Extract the property with street_number 6321 and store it in data_6321
data_6321 <- data_real[data_real$street_number == 6321, ]
# Remove that row from the training dataset
data_real <- data_real[data_real$street_number != 6321, ]
rownames(data_real) <- NULL
model_full <- lm(
market_value_2025 ~
(land_value_total + main_sqft +
garage_sqft + garage_value + land_sqft + area_total_sqft)^2 +
I(land_value_total^2) + I(main_sqft^2) +
I(garage_sqft^2) + I(garage_value^2) + I(land_sqft^2) +
I(area_total_sqft^2),
data = data_real
)
summary(model_full)
# Initial Predictor Evaluation
model2<- lm(
market_value_2025 ~
land_value_total + main_sqft + garage_sqft + garage_value + land_sqft + area_total_sqft +
I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
land_value_total:area_total_sqft +
main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + main_sqft:area_total_sqft +
garage_sqft:garage_value + garage_sqft:land_sqft,
data = data_real
)
summary(model2)
# OUTLIER AND INFLUENCE DIAGNOSTICS
# DIAGNOSTICS IMPLEMENTATION
# Creating combined diagnostics dataframe
diagnostics_df <- data.frame(
Obs = 1:nrow(data_real),
Residual = rstudent(model2),
Leverage = hatvalues(model2),
CooksD = cooks.distance(model2)
)
# Thresholds for leverage and Cook's Distance
n <- nrow(data_real)
p <- length(coef(model2))
leverage_thresh <- 2 * p / n
cook_thresh <- 4 / n
# Flagging high influence points
diagnostics_df$High_Residual <- abs(diagnostics_df$Residual) > 2
diagnostics_df$High_Leverage <- diagnostics_df$Leverage > leverage_thresh
diagnostics_df$High_CooksD <- diagnostics_df$CooksD > cook_thresh
# Sorting by Cook's Distance
diagnostics_df <- diagnostics_df[order(-diagnostics_df$CooksD), ]
# Display top influential points
rownames(diagnostics_df) <- NULL
rmarkdown::paged_table(diagnostics_df)
plot(model2,5)
plot(model2,6)
# RE-ESTIMATE THE MODEL WITHOUT INFLUENTIAL DATA POINTS
# MODEL RE-ESTIMATION
# Remove influential observations
data_no_points <- data_real[-c(1, 11, 3, 37), ]
# Fit the new model
model_no_points <- lm(
market_value_2025 ~
land_value_total + main_sqft + garage_sqft + garage_value + land_sqft + area_total_sqft +
I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
land_value_total:area_total_sqft +
main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + main_sqft:area_total_sqft +
garage_sqft:garage_value + garage_sqft:land_sqft,
data = data_no_points)
summary(model_no_points)
# TESTING ADDITIONAL OBSERVATIONS
# List of observations to evaluate
outliers_to_test <- c(9, 32, 39)
for (obs in outliers_to_test) {
cat("\n================ Data Point Removed:", obs, "================\n")
# Remove observation
temp_data <- data_no_points[-obs, ]
# Refit the model
temp_model <- lm(
market_value_2025 ~
land_value_total + main_sqft + garage_sqft + garage_value + land_sqft + area_total_sqft +
I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
land_value_total:area_total_sqft +
main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + main_sqft:area_total_sqft +
garage_sqft:garage_value + garage_sqft:land_sqft,
data = temp_data)
# Model summary
s <- summary(temp_model)
f <- s$fstatistic
f_pval <- pf(f[1], f[2], f[3], lower.tail = FALSE)
# Display global metrics
cat("Residual standard error:", format(s$sigma, scientific = TRUE), "on", s$df[2], "degrees of freedom\n")
cat("Multiple R-squared:", round(s$r.squared, 4), "\tAdjusted R-squared:", round(s$adj.r.squared, 4), "\n")
cat("F-statistic:", format(f[1], scientific = TRUE), "on", f[2], "and", f[3], "DF, p-value:", format.pval(f_pval, digits = 4, eps = .0001), "\n\n")
# Display rounded p-values
pvals <- round(s$coefficients[, "Pr(>|t|)"], 4)
for (i in seq_along(pvals)) {
cat(names(pvals)[i], ":", pvals[i], "\n")
}
}
# FINAL CLEANED MODEL
# Remove influential observations
data_cleaned <- data_real[-c(1, 11, 3, 37, 9), ]
# Refit the regression model using the cleaned dataset
model_cleaned <- lm(
market_value_2025 ~
land_value_total + main_sqft + garage_sqft + garage_value + land_sqft + area_total_sqft +
I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
I(garage_value^2) + I(land_sqft^2) + I(area_total_sqft^2) +
land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
land_value_total:area_total_sqft +
main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:land_sqft + main_sqft:area_total_sqft +
garage_sqft:garage_value + garage_sqft:land_sqft,
data = data_cleaned
)
# Display key model results
summary(model_cleaned)
# MODEL OPTIMIZATION
# STEP 1: INITIAL MULTICOLLINEARITY ASSESSMENT
vif(model_cleaned, type = "predictor")
# STEP 2: REDUCED MODEL
model_reduced <- lm(
market_value_2025 ~
land_value_total + main_sqft + garage_sqft + garage_value + area_total_sqft +
I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
I(garage_value^2) + I(area_total_sqft^2) +
land_value_total:main_sqft + land_value_total:garage_sqft + land_value_total:garage_value +
land_value_total:area_total_sqft +
main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:area_total_sqft +
garage_sqft:garage_value,
data = data_cleaned
)
summary(model_reduced)
vif(model_reduced, type = "predictor")
# STEP 3: FINAL MODEL ESTIMATION
model_final <- lm(
market_value_2025 ~
land_value_total + main_sqft + garage_sqft + garage_value + area_total_sqft +
I(land_value_total^2) + I(main_sqft^2) + I(garage_sqft^2) +
land_value_total:main_sqft + land_value_total:garage_value +
land_value_total:area_total_sqft +
main_sqft:garage_sqft + main_sqft:garage_value + main_sqft:area_total_sqft +
garage_sqft:garage_value,
data = data_cleaned
)
summary(model_final)
vif(model_final, type = "predictor")
# -----------------------------------------------------------------------------------------------------
# SECTION 3: PREDICTION FOR TARGET PROPERTY -----------------------------------------------------------
test_data <- data_6321
# Confidence and prediction intervals for 6321
ci_6321 <- predict(model_final, newdata = test_data, interval = "confidence", level = 0.95)
pi_6321 <- predict(model_final, newdata = test_data, interval = "prediction", level = 0.95)
# Display results
cat("Prediction and Interval Estimates for 6321 88th Street:\n",
"-> Predicted Market Value (2025): $", round(ci_6321[1], 2), "\n",
"-> Confidence Interval (95%): $", round(ci_6321[2], 2), "– $", round(ci_6321[3], 2), "\n",
"-> Prediction Interval (95%): $", round(pi_6321[2], 2), "– $", round(pi_6321[3], 2), "\n"
)
# -----------------------------------------------------------------------------------------------------