The average citizen contributions to society in many ways, i.e. voting, jury duty, humanitarian aid work, paying taxes, etc. The latter is one of the most important functions, as they are essential for the performance of any government and society. They provide the primary source of revenue that allows governments to fund public services and infrastructure such as education, healthcare, transportation, public safety, and social welfare programs. Taxes also play a critical role in promoting economic stability and reducing income inequality through redistribution policies. By collecting taxes, governments can invest in long-term development, respond to emergencies, and support vulnerable populations. In essence, taxes are a way for citizens to collectively contribute to the well-being and advancement of their communities and nation as a whole.
There are various kinds of taxes however for this case we will focus on property taxes, which is a tax imposed on property paid by the property owner based on the property assessment conducted by an appointed appraiser. This examination evalutes the property’s characteristics, such as its size, location, condition, improvements (like renovations or additions), and the value of similar properties in the area. Once the property’s assessed value is determined, it is used as the basis for calculating property taxes. Unfortunately, the assessment is subjective and can lead to discrepancies in property taxes between similar homes, which in turn affects how much the homeowner pays in taxes. An appeal can be made if the owner believes the property has been over or undervalued.
In this case study, the assessed value of property at 6321 88th Street in Lubbock, Texas should be reviewed in order to evaluate whether the assessment is justified given the assessed property of other homes in the neighborhood. By analyzing key factors such as land market value, house and land footage, total improvement market value etc. to determine influential factors, and whether the home in question is overvalued or undervalued. The findings via multiple linear regression (MLR) will advocate for a re-evaluation so the home’s taxable value can align with the empirical evidence.
Before applying regression modeling to the given data, the analysis of distribution and reliability need to be assessed and validated. Understanding the variability, normality, accuracy, and potential deviations in the data set aids the confirmation process of key statistical assumptions.
The data that was used for this project was collected from Lubbock Central Appraisal District (CAD) website. This incredibly useful site contains an abundance of information on every house in the city, i.e from general property details to the owner information. But only a specific set of data was necessary to calculate, analyze, and make a decision. A breakdown of key variables gathered is as follows:
| Variable Name | Explanation |
| 2025 Market Value | Total value of property appraisal = Total Improvement Market Value + Total Land Market Value |
| Total Improvement Market Value | Total value of house appraisal = Main Area (Value) + Garage (Value) |
| Total Land Market Value | Total value of land appraisal |
| Homestead Cap Loss | Represents a discount only in the current tax year if the appraised value from the previous year went up by more than 10% |
| Total Main Area | Total square footage of house = Main Area (Sq. Ft.) + Garage (Sq. Ft.) |
| Main Area | Total square footage of heated house area |
| Main Area (Value) | Total value of heated house area |
| Garage (Sq. Ft.) | Total square footage of non-heated house area |
| Garage (Value) | Total value of non-heated house area |
| Land (Sq. Ft.) | Total square footage of land |
Note: Out of the forty-two homes within the dataset, seventeen contain additional features such as a second garage/house, pool, pool house, etc. This information can help explain the pricing of the more extravagant houses but neither the value or square footage of the additional features directly influence the main variables. Therefore these values will not be included in the primary data set. There will be an addendum after the conclusion that will explain how the outliers seen throughout the process are directly linked the existence of the additional features.
This data was converted into a “.csv” file and saved on GitHub for storage. All additional steps relied on a url link to the data in GitHub. Below is the initialization of calling the data and a brief look at the dataset using the head( ) function.
df <- read.table("https://raw.githubusercontent.com/Isabella-Ortiz/IE-5344/refs/heads/main/Properties_Info.csv", header = TRUE, sep=',')
head(df)
## House_Num Market_Value Total_Improvement_Market_Value Total_Land_Market_Value
## 1 6309 735026 677026 58000
## 2 6310 663907 603222 60685
## 3 6311 569992 511992 58000
## 4 6312 602427 538677 63750
## 5 6313 460288 415135 45153
## 6 6314 968766 888796 79970
## Homestead_Cap_Loss Total_House_Area Main_Area Main_Area_Value. Garage_Area
## 1 0 3905 3192 558004 713
## 2 0 3898 3226 0 672
## 3 0 3611 3036 447924 575
## 4 0 3786 2877 432168 909
## 5 0 2747 2241 376845 506
## 6 0 2591 2041 367600 550
## Garage_Value Land_Area
## 1 56089 10000
## 2 0 10463
## 3 38175 10000
## 4 61445 10625
## 5 38290 7785
## 6 44577 13788
Observations:
Data Range - For purposes of this analysis, we were asked to consider all homes in the neighborhood from “6309 - 6351 88th Street, Lubbock, Texas 79424” to be included. It is important to note that house 6351 does not exist. The numbering skips from 6349 to 6352, but since 6352 is outside of the requirements, it is also not included.
Irrelevant Data - The “Homestead Cap Loss” variable was equivalent to zero for each data point, therefore it will be excluded from the dataset and further analysis. This trend is reasonable since it is for the current tax year only and the assessed value will only go up over time.
Missing Data - Apart from 6351, which as previously mentioned does not exist, house 6310 does not contain any information on the value of the main and garage area. It does have information on all other variables, but the inclusion of a zero will greatly affect the accuracy of the data, therefore it will need to be removed.
#Removing irrelevant and missing data
df <- subset(df, select = -Homestead_Cap_Loss)
df <- subset(df,House_Num != 6310 )
head(df)
## House_Num Market_Value Total_Improvement_Market_Value Total_Land_Market_Value
## 1 6309 735026 677026 58000
## 3 6311 569992 511992 58000
## 4 6312 602427 538677 63750
## 5 6313 460288 415135 45153
## 6 6314 968766 888796 79970
## 7 6315 550119 505175 44944
## Total_House_Area Main_Area Main_Area_Value. Garage_Area Garage_Value
## 1 3905 3192 558004 713 56089
## 3 3611 3036 447924 575 38175
## 4 3786 2877 432168 909 61445
## 5 2747 2241 376845 506 38290
## 6 2591 2041 367600 550 44577
## 7 3132 2582 433934 550 41595
## Land_Area
## 1 10000
## 3 10000
## 4 10625
## 5 7785
## 6 13788
## 7 7749
One of the most important values of the dataset is the Market Value. In order to make a decision about the accuracy of the appraised value, we must consider the value of the houses surrounding 6321, this would include understanding the market value trend for the houses in the neighborhood.
# Statistical Summary of Market Value Dataset
summary(df$Market_Value)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 418286 504815 534991 567905 580180 1218146
Observations: The market value for the neighborhood ranges from $418,286 to $1,218,146, which is quite a large gap. Although it is important to notice that the mean is $570,191 and that is a lot closer to the minimum value, which means that the data skews to the left, towards the minimum despite the range. The median being in close proximity to the mean further strengthens that statement.
While the statistical summary alone can give a generalize idea of the Market Value dataset, it is critical to understand how the data is distributed and find trends that will assist in the analysis. This requires plotting various graphs that will highlight different trends of the dataset, but in order to get an accurate representation, a maximum and minimum are calculated to zoom into the appropriate range.
# Finding the corresponding range
X_Max = max(df$Market_Value)
X_Min = min(df$Market_Value)
# 2025 Market Value Distribution
par(mfrow = c(1, 2), mar = c(4, 4, 3, 0.2) + 0.1, oma = c(2,0,2,0))
hist(df$Market_Value,
main = "Histogram of Market Value",
xlab = "2025 Market Value ($)",
col ="lightblue",
border = 'darkblue',
ylim=c(0, 25),
xlim = c(400000, 1450000))
boxplot(df$Market_Value,
main = "Boxplot of Market Value",
xlab = "Market Value",
ylab = "Value ($)",
col ="cyan",
border = "darkblue",
pch = 19,
outcol="orange", # Will highlight the outliers in a different color
ylim=c(X_Min, X_Max))
legend("topright", c("In Spec", "Outliers"), border="black",inset=.02, fill = c("cyan", "orange"))
mtext("Market Value Distribution of 2025", outer = TRUE, cex = 2, font = 4)
Observations:
Histogram - This plot shows, just as the statistical summary did, that most fall within the $5,000s range. Out of the forty-one houses in the neighborhood, twenty-three, aka 55%, are included in that subset. House 6321 88th Street is also apart of said subset, which is important to consider when explore further analysis points.
Boxplot - As the legend distinguishes, this plot highlights four outliers in the dataset. This remark supports the statistical summary above as most values reside from the first to third quartile. The outliers are significantly larger than the majority of the data, as the graph perspective indicates.
The understanding of the Market Value is important, but it’s relation with other key factors such as the area of the house is also critical towards the comprehension dataset. The total area of the house is a summation of the measured main and garage area. The size of the housing area is used to calculate the “Total Improvement Market Value” which is the equal to the Total value of house appraisal that is then a a part of the total assessed value. The latter of which is equivalent to the market value. Therefore it is crucial to understand the individual data and its influence on the Market Value.
# Statistical Summary of T_H_A dataset
summary(df$Total_House_Area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2591 3064 3272 3265 3472 4011
# Market Value vs. Total House Area
par(mfrow = c(1, 2), mar = c(4, 4, 3, 0.2) + 0.1, oma = c(2,0,2,0))
hist(df$Total_House_Area,
main = "Distribution of House Area",
xlab = "Total Main Area (Sq. Ft.)",
col ="lightpink",
border = 'maroon',
ylim= c(0, 12),
xlim=c(2000,4500),
labels = TRUE)
plot(df$Market_Value, df$Total_House_Area,
main = "Market Value vs. House Area",
xlab = "2025 Market Value ($)",
ylab = "Total Main Area (Sq. Ft.)",
col ="hotpink",
pch=19)
points(538409, 3365, cex = 2, pch = 18, col ="maroon4")
text(628409, 3330,labels="#6321", col="maroon4", cex=0.8, font=2)
mtext("2025 Market Value vs. Total House Area", outer = TRUE, cex = 2, font = 4)
Observations:
Statistical Summary - Unlike “Market Value,” the range of this subset is much tighter. Furthermore, the proximity of the median and mean indicates that the data distribution is relatively symmetric or not heavily skewed.
Histogram - This plot further illustrates the conclusion provided by the summary( ) function while also depicting slight deviation from its Bell Curve shape since there is an uncharacteristic increase in the second-to-the-end bars on the right side. The symmetry of the distribution is explicitly expressed by the plot’s bin labels, with the highest concentration of houses falling between 3,000-3,700 square feet.
ScatterPlot - The cluster of points on this plot resides between the 3,000 - 3,600 square feet, as expressed in the histogram. Although this graph also shows the houses that create the upper and lower fences of the range that fall short of being outliers. Additionally, the designated house, #6321, is highlighted to prove that it’s value falls within the cluster, therefore indicating that it is considered average.
Another critical variable is the land size, which is classified as the land segment pertaining to the property, i.e. the entire house, front and backyard, garage, etc. This measurement influences the “Land Homesite Value” that is used to calculate the total assessed value, aka the Market Value. Therefore it is crucial to understand the individual data and its influence on the Market Value.
# Statistical Summary of T_L dataset
summary(df$Land_Area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7501 7749 7872 8652 8057 17505
# Market Value vs.Total Land
par(mfrow = c(1, 2), mar = c(4, 4, 3, 0.2) + 0.1, oma = c(2,0,2,0))
hist(df$Land_Area,
main = "Histogram of Land Area",
xlab = "Total Main Area (Sq. Ft.)",
col ="lightgreen",
border = 'green4',
labels = TRUE)
plot(df$Market_Value, df$Land_Area,
main = "Market Value vs. Total Land",
xlab = "2025 Market Value ($)",
ylab = "Total Main Area (Sq. Ft.)",
col = ifelse(df$Land_Area < 8169, "seagreen", "salmon"), #highlights outliers
pch=19)
points(538409, 7546, cex = 2, pch = 18, col ="darkgreen")
text(718000, 7546,labels="#6321", col="darkgreen", cex=0.8, font=2)
legend("topleft", c("Within Spec", "Outliers"), border="black",inset=.02, fill = c("seagreen", "salmon"))
mtext("Market Value vs. Area of Land", outer = TRUE, cex = 2, font = 4)
Observations:
Statistical Summary - The range of this subset is quiet large despite the compact state of the first-to-third quartiles. Furthermore, the difference between the median and mean indicates that the outliers are significantly larger than the average value, since it causes a disruption in the mean. These reflection illustrate a considerable skew to the right due to the disproportion of the data range.
Histogram - This plot further illustrates the conclusion provided by the summary( ) function that the data is skewed to the right heavily. The distribution expresses that thirty out of the forty-one values reside within the 6,000 - 8,000 range.
ScatterPlot - The cluster of points on this plot resides between the 6,000 - 8,000 square feet, as depicted in the histogram. Although this graph also shows which points are considered outliers. The points highlighted pink corresponding the same points that create the slight shift to the right due to their significant difference from the median. Additionally, the designated house, #6321, is highlighted to prove that it’s value falls within the cluster, therefore indicating that it is considered average.
Having gotten rid of any data that was considered irrelevant or had missing values, visualized critical subsets that influence the Market Value, and justified the observations, the data frame is now ready to be manipulated for future analysis.
The statistical analysis of the data frame will be conducted using the Multiple Regression technique, which is used to examine the relationship between a dependent variable and two or more independent variables. This helps to predict the value of the dependent variable based on the combined influence of multiple predictors, aka the independent variables.
The multiple linear regression equation is given by \[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_nX_n + \varepsilon \]
Where y (the dependent variable) denotes the yield, \(X_1, X_2, ... X_n\) denotes the regressor values as the independent variables, and \(\beta_0, \beta_1, \beta_n\) are the model’s intercepts that are unknown coefficients. This means that predictors chosen have a direct influence on the output and should be chosen wisely.
Although before any predictors are chosen, the status of the dependent variable, the Market value, should be assessed.
# Initial check
model_v1 <- lm(df$Market_Value ~ ., data = df)
summary(model_v1)
## Warning in summary.lm(model_v1): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = df$Market_Value ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.988e-11 -4.943e-11 -1.472e-11 1.082e-11 4.430e-10
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.474e-09 9.276e-09 1.021e+00 0.3148
## House_Num -1.531e-12 1.458e-12 -1.050e+00 0.3016
## Total_Improvement_Market_Value 1.000e+00 3.257e-16 3.071e+15 <2e-16 ***
## Total_Land_Market_Value 1.000e+00 6.755e-14 1.480e+13 <2e-16 ***
## Total_House_Area -6.127e-12 2.726e-12 -2.248e+00 0.0316 *
## Main_Area 7.241e-12 3.229e-12 2.243e+00 0.0320 *
## Main_Area_Value. -6.046e-15 2.993e-15 -2.020e+00 0.0518 .
## Garage_Area NA NA NA NA
## Garage_Value 8.451e-14 3.660e-14 2.309e+00 0.0276 *
## Land_Area -1.466e-13 3.928e-13 -3.730e-01 0.7114
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.815e-11 on 32 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.005e+31 on 8 and 32 DF, p-value: < 2.2e-16
Observations:
Statistical Value Interpretation - The minuscule value for the RSE expresses a better fit of the regression model to the data, which means that the predicted values are closer to the actual values. An MSE, or \(R^2\) of 1 indicates that the data fits the regression line, therefore suggesting that the model fits the data well in regards to the proportion of variance. The evaluation of a result less than 2.23-16 also suggests the overall model is highly significant. Although, the warning does mention that due to the “essentially perfect fit” the data/summary may be unreliable.
Level of Significance - The level of significance used to determine critical factors is 0.05. In which case the “Total Improvement Market Value” (which is the monetary value of the house area) and the “Total Land Market Value” (which is the monetary value of the land size) are considered significant as both had a positive relationship with the 2025 Market Value. The area of both of these values were assessed in the previous section due to this reason and will be used to further analyze the data frame and reach a conclusion. It is also important to note that some values indicate a negative impact, which may suggest multicollinearity and will require further evaluation.
plot(model_v1)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Observations:
Residuals vs. Fitted - This first plot reveals a chopy curved pattern, when the desired line should be straight along the horizontal axis, therefore indicating a non-constant variance. Additionally, the plot highlights three potential outliers (1, 22, 39) that should be looked at further.
Q-Q Residuals - Overall, there is a normal distribution along the line, but towards the end, the deviation becomes drastic. Though a few points veer away, but most points reside along the reference line. Therefore suggesting that normality is met.
Scale Location - Despite the warning RStudio displays, the other most important takeaway is that there is one point that is greater than 1.73, which indicates an extremely large deviation and another point that gets near the 1.5 range which is of concern but not as great.
Residuals vs. Leverage - This plot explains that not only do many points have influence according to the Leverage Value (HII) since \(\frac{2p}{n}\Rightarrow \frac{2(10)}{41}=0.48\) and six points are greater than said value, but that there are three points (1, 4, 24) that seem to carry leverage and influence due to Cooke’s Law. Furthermore, it seems that #1 while falling farther outside Cook’s Distance (CD), the leverage value approximates a lesser level of influence than #4 which lies slightly outside of the CD with a greater leverage value. Point 24 does not lie outside of CD but still has a greater HII and therefore has influence.
Since the computational output of the regression model displays probable issues, this model will need to be adjusted. The variables included in multiple regression analysis were selected due to their significant influence and positive effect on the market value, attributable to their direct relationship with the determinants (Total Land/Improvement Market Value) employed in property’s market value calculation.
model <- lm(df$Market_Value ~ df$Total_House_Area + df$Land_Area, data = df)
summary(model)
##
## Call:
## lm(formula = df$Market_Value ~ df$Total_House_Area + df$Land_Area,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -168460 -45206 5361 39847 91655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41601.632 95079.212 0.438 0.664
## df$Total_House_Area -6.615 29.876 -0.221 0.826
## df$Land_Area 63.329 4.915 12.885 1.92e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59750 on 38 degrees of freedom
## Multiple R-squared: 0.8249, Adjusted R-squared: 0.8157
## F-statistic: 89.52 on 2 and 38 DF, p-value: 4.184e-15
Observations:
Multiple R-Squared - This plot further illustrates the conclusion provided by the summary( ) function while also depicting slight deviation from its Bell Curve shape since there is an uncharacteristic increase in the second-to-the-end bars on either side. The symmetry of the distribution is explicitly expressed by the plot’s bin labels, with the highest concentration of houses falling between 3,000-3,700 square feet..
P-Value - The cluster of points on this plot resides between the 3,000 - 3,600 square feet, as expressed in the histogram. Although this graph also shows the houses that create the upper and lower fences of the range that fall short of being outliers. Additionally, the designated house, #6321, is highlighted to prove that it’s value falls within the cluster, therefore indicating that it is considered average.
plot(model)
Observations:
Residuals vs. Fitted - Though this plot is also choppy, it is still straighter with many points in-line or near the reference line, therefore indicating a slight non-constant variance. Additionally, the plot highlights three potential outliers (14, 16, 22) that should be looked at further.
Q-Q Residuals - Overall, the residuals follow an normal distribution with little deviance. A few points veer and three are called out as possible outliers, but most points reside along the reference line. Therefore suggesting that normality is met.
Scale Location - Unlike the previous model, all points reside below the 1.73 limit, indicating that there is not a drastic deviation. Although point 16 is above the 1.5 range which is of concern and will be discussed later on.
Residuals vs. Leverage - This plot explains that the same three points which are flagged by Cook’s Distance (CD) are also flagged by the the Leverage Value (HII) since \(\frac{2p}{n}\Rightarrow \frac{2(3)}{41}=0.15\), meaning that these three points much have influence and leverage. Furthermore, it would seem that out of the trio, Point 14 is the most influence as it out outside the CD reference line and the farther from the HII value. Points 6 and 16 contain some level of leverage and influence for opposing reasons, as point 6 is farther from the HII value but closer to the 0.5 reference line, and point 16 is closer to the 1 reference line yet closer to the HII value. Regardless, all carry a level of influence and should be reviewed further.
The two measures of leverage and influence that we use to determine what points, if any, have any influence on the data set are Cook’s Distance and the Hat Matrix. This is because remote points potentially have a disproportionate impact on the parameter estimates, standard errors, predicted values, and model summary statistics.
The Hat Matrix focuses on using the location of points in x space to determine its potential importance to the property of the regression model. It has two elements, i.e. \(h_{ii}\) and \(h_{ij}\) though the latter may be interpreted as the amount of leverage exerted by the \(i_{th}\) observation. This is because it measures the distance between the \(i_{th}\) observation from the centroid of the x space. Thus, large hat diagonals reveal observations that are potentially influential because they are remote in x space from the rest of the sample. The distance limit used to identify leverage points is equal to any value that exceeds twice the average 2p/n, where p equals the number of coefficients and n equals the number of observations.
Cook’s Distance works in tandem with the Hat Matrix, though instead of just considering the x, it considers that both the location of the point in the x space and the response variable are used in measuring influence. This is done by using a measure of the squared distance between the least-squares estimate based on all n points and the estimate obtained by deleting the \(i_{th}\) point. This distance measure can be expressed in a general form as:
\[D_i = (M,c)=\frac{(\hat{\beta}_i-\hat{\beta})\acute{}M({\beta}_i-\hat{\beta})}{c}, i = 1, 2, ...n\]
Where points with large values of \(D_i\) have considerable influence on the least-squares estimates. But it is important to note that not all leverage points are going to be influential on the regression coefficients. Therefore it is important to check its influence and leverage by reviewing its effect on the data set. This is done by checking the influences the data points have if they were removed from the data set, as seen below.
# Removing points 6, 14, 16 since they are outliers
newdf <- df[-c(6,14,16),]
model_v2 <- lm(newdf$Market_Value ~ newdf$Total_House_Area + newdf$Land_Area, data =newdf)
summary(model)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41601.631808 95079.211971 0.4375471 6.641915e-01
## df$Total_House_Area -6.614682 29.875695 -0.2214068 8.259612e-01
## df$Land_Area 63.328783 4.914995 12.8848120 1.921619e-15
summary(model_v2)$coeff
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36172.944321 99167.096629 0.3647676 7.174798e-01
## newdf$Total_House_Area -5.894093 31.050966 -0.1898199 8.505468e-01
## newdf$Land_Area 63.549886 5.120675 12.4104507 2.243746e-14
Observations: There is a big difference in the intercept value once the trio of points are removed since it decreases by 5428, which is the biggest change. The new values for the Total_House_Area and Land_Area stay very similar and therefore show that deleting the points does influence the dataset.
summary(model)$r.square
## [1] 0.8249179
summary(model_v2)$r.square
## [1] 0.8258408
Observations: The new model without the three influential points does display a slightly higher R-Squared in the thousandths place it does not have as much leverage or is as influential as thought. Using these diagnostic results, it can be assumed that while these points do have leverage and influence, it does not make the data set more powerful if they were to be excluded.
Another diagnostic method that the data set is inspected with besides leverage is called the Variance Inflation Factor (VIF) which checks for the occurrence of multicollinearity. This refers to a situation in multiple regression analysis where two or more predictor variables exhibit a high degree of linear correlation. This violates the assumption of independence among explanatory variables and can inflate the standard errors of the estimated coefficients, leading to unreliable statistical inferences. Key consequences include unstable coefficient estimates and the reduction of interpret-ability of individual predictor effects. VIF quantifies the extent to which multicollinearity increases the variance of a regression coefficient. It is calculated for each predictor as:
\[ VIF_i = \frac{1}{1-R_{2}^{i}} \]
Where \(R_{2}^{i}\) is the coefficient of determination when the \(i^{th}\) predictor is regressed on all the other predictors, otherwise know as regressor variables. The number produced by VIF identifies how much the variance of a regression coefficient is being inflated due to multicollinearity with other predictors. If the VIF is too high then the removal or combination of correlated variables or the model specification is reconsidered via Principal Component Analysis (PCA) or Ridge Regression.
According to D.W. Marquardt [1970], if:
VIF = 1 → No correlation with other variable.
VIF between 1 and 5 → Moderate correlation, usually acceptable.
VIF > 5 or 10 → High correlation, multicollinearity may be problematic.
Therefore, using the car library which contains the necesssary VIF( ) function, the quantification of multicollinearity in the model’s variables are calculated.
#check for multicollinearity using VIF
library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.3
vif_values <- vif(model)
vif_values
## df$Total_House_Area df$Land_Area
## 1.088806 1.088806
vif_df <- data.frame(Variable = names(vif_values), VIF = vif_values)
print(vif_df)
## Variable VIF
## df$Total_House_Area df$Total_House_Area 1.088806
## df$Land_Area df$Land_Area 1.088806
Observations: Both values are within the acceptable range and show that there is a moderate correlation between them. This makes sense due to the total house area being a subset of the land area since the house is considered as part of the land. The low VIF appraisal aligns perfectly with our expectations, allowing us to proceed with the analysis.
Now that the the data set has been cleaned, manipulated, and analyzed, the assessment to determine whether the 2025 Market Value of the home at 6321 88th street is greater or less than it should be. This begins by creating a new data frame with the specific information , which is also printed out, from the data set.
# Then confidence intervals @ House Number
model <- lm(Market_Value ~ Total_House_Area + Land_Area, data = df)
# Review House Information
print(df[12,c(1,2,5,10)])
## House_Num Market_Value Total_House_Area Land_Area
## 13 6321 538409 3365 7546
# Create a new data frame with the specified numbers
House_Info <- data.frame(Total_House_Area = df$Total_House_Area[12], Land_Area = df$Land_Area[12])
CI_6321 <- predict(model, newdata = House_Info, interval = "confidence", level = 0.95)
PI_6321 <- predict(model, newdata = House_Info, interval = "prediction", level = 0.95)
The new data sets will be used to create the confidence and predict tolerance lines on the model, but this requires that the model be plotted from fitted regression line. It is important to note that the difference between the observed value and the corresponding fitted value i is a residual. If these errors are normally and independently, which is the case here, the sampling distribution is set to “t with n − 2 degrees of freedom”. Therefore, a 100(1 − α) percent confidence interval (CI) on the slope \(\beta_{1}\) is given by
\[ \hat{\beta_{1}} - t_{\alpha /2, n-2},se(\hat{\beta_{1}})\leq \hat{\beta_{1}} \leq \hat{\beta_{1}} + t_{\alpha /2, n-2},se(\hat{\beta_{1}}) \]
where the width of these confidence intervals is a measure of the overall quality of the regression line. Furthermore, the base regression line equation is also used to derive the predictive intervals, noted as \(y_{0}\), for future operations and observations:
\[ \psi=y_{0} - \hat{y_{0}} \; \Rightarrow Var(\psi) = Var(y_{0} - \hat{y_{0}})\; \Rightarrow \sigma^2[1+\frac{1}{n}+\frac{(x_{0} - \bar{x})^2}{S_{xx}}] \]
If the line is normally distributed with mean zero and variance, then the predicted value is zero because the future observation is independent of said value. Thus, the 100(1 − α) percent prediction interval on a future observation at \(x_{0}\) is:
\[ \hat{y_{0}}-t_{\alpha /2, n-2}\sqrt{MS_{Res}(1+\frac{1}{n}+\frac{(x_{0} - \bar{x})^2}{S_{xx}})}\leq y_{0} \leq \hat{y_{0}}+t_{\alpha /2, n-2}\sqrt{MS_{Res}(1+\frac{1}{n}+\frac{(x_{0} - \bar{x})^2}{S_{xx}})} \]
where the prediction interval is always wider than the confidence interval because the prediction interval depends on both the error from the fitted model and the error associated with future observations. This is demonstrated below as the confidence and prediction intervals pertaining to the market value of the home at 6321 88th street.
cat("Confidence Interval (95%):", CI_6321[2], "-", CI_6321[3],
"\nPrediction Interval (95%):", PI_6321[2], "-", PI_6321[3])
## Confidence Interval (95%): 473715.1 - 520729.4
## Prediction Interval (95%): 374011.4 - 620433
cat("Predicted Market Value:", CI_6321[1],
"\nAssessed Market Value:", df[12,2][1])
## Predicted Market Value: 497222.2
## Assessed Market Value: 538409
Difference = (df[12,2][1]) - (CI_6321[1])
cat("Overvalued By:", Difference)
## Overvalued By: 41186.78
Observations: The 95% prediction interval for 6321 88th Street is $374,011 –$620,433 with the predicted market value being $497,222, far below the assessed value of $538,409. This suggests the home is overvalued by $41,186 which leads to an unfair tax burden.
Governments need to collect taxes to function. Federal, state, and local governments impose tax assessments against real property, personal property, and income, etc. All of these, including property tax, are forms of revenue that the county uses to fund municipal operations. The process involves the valuation of a property by local tax authorities to establish its fair market value. A tax assessor estimates a property value (land plus house), aka the assessed value, through a tax assessment. The property tax is then levied as a percentage of the assessed value, less there are any exemptions.
These assessments determine the amount of property tax that homeowners are obligated to pay. The problem is that the assessed value of property isn’t very scientific, rather the tax assessor adjusts the value of the property based on what they feels is right. And the assessed value is directly linked to the amount of property taxes must be paid, therefore it is essential to have an accurate estimate of a home’s market value. Inaccuracies in property market value assessments can lead to overvaluation, resulting in disproportionately high tax burdens. Due to this, homeowners have the chance to appeal their assessment to the assessor, even taking it to court if necessary.
In this case, the appraisal done on for 6321 88th Street, Lubbock, Texas was overvalued when comparing it to other homes in the surrounding neighborhood (6309-6351 88th Street). This is because the property demonstrates a market value that exceeds expectations relative to comparable homes of similar size and appraised value.
For example:
Throughout the EDA process, the house’s features falls within the average range for all key variables. This includes its total land and house area, which have a significant relationship and strong correlation with the market value. Which means that the assessed value should also fall within the average price range.
The Predictor Intervals for the Multiple Linear Regression model that compared the house properties to that of its neighbors calculate with a a 95% confidence level that the range should be $374,011 –$620,433 with the predicted market value being $497,222. Therefore there is a possibility of the property being overvalued.
The houses that are normally priced within the range of the assessed value contain additional features such as a pool, second garage, etc., but 6321 88th Street, Lubbock, Texas does not have any such items and therefore should not be priced at the same level.
Therefore, because of these reasons, it is suggested the home is overvalued by $41,186 which leads to an unfair tax burden and should be reassessed for a better fit market value.
As previously explained, out of the forty-two homes within the data set, seventeen contain additional features such as a second garage/house, pool, pool house, etc. This information can help explain the pricing of the more extravagant houses but neither the value or square footage of the additional features directly influence the main variables.
The subset for all additional features is split into two parts: those that have only one additional feature and those that have multiple. This is because those that only have one additional feature fall between a slightly elevated yet specific range when regarding the market value assigned. The land area may or may not reflect the same correlation but most houses with additions appear on the higher limit of this variable as well. The houses with one additional feature are shown below by feature:
House #2:
Garage #2:
Pool:
There are three houses apart from those above that have more than one additional feature, and as expected, their total land area and market value are significantly higher than the average and those that have one feature. These houses are the outliers seen in the EDA process.
Though it is important to remember that the value and size of their features do not directly influence the final value of the land or assessed price, though it does explain such variables.
Below is the entirety of my code for this project. It has been commented out so it will not execute the commands again, but is here for your perusal. This concludes the work for my case study.
# # #Read in the data from the URL given
# df <- read.table("https://raw.githubusercontent.com/Isabella-Ortiz/IE-5344/refs/heads/main/Properties_Info.csv", header = TRUE, sep=',')
#
# head(df)
#
# #Removing irrelevant and missing data
# df <- subset(df, select = -Homestead_Cap_Loss)
# df <- subset(df,House_Num != 6310 )
# head(df)
#
# X_Max = max(df$Market_Value)
# X_Min = min(df$Market_Value)
#
#
# summary(df$Market_Value)
#
# # 2025 Market Value Distribution
# par(mfrow = c(1, 2), mar = c(4, 4, 3, 0.2) + 0.1, oma = c(2,0,2,0))
# hist(df$Market_Value,
# main = "Histogram of Market Value",
# xlab = "2025 Market Value ($)",
# col ="lightblue",
# border = 'darkblue',
# ylim=c(0, 25),
# xlim = c(400000, 1450000))
#
# boxplot(df$Market_Value,
# main = "Boxplot of Market Value",
# xlab = "Market Value",
# ylab = "Value ($)",
# col ="cyan",
# border = "darkblue",
# pch = 19,
# outcol="orange",
# ylim=c(X_Min, X_Max))
#
# legend("topright", c("Within Spec", "Outliers"), border="black",inset=.02, fill = c("cyan", "orange"))
# mtext("Market Value Distribution", outer = TRUE, cex = 2, font = 4)
#
# # Market Value vs. Total House Area
# par(mfrow = c(1, 2), mar = c(4, 4, 3, 0.2) + 0.1, oma = c(2,0,2,0))
# hist(df$Total_House_Area,
# main = "Distribution of House Area",
# xlab = "Total Main Area (Sq. Ft.)",
# col ="lightpink",
# border = 'maroon',
# ylim= c(0, 12),
# xlim=c(2000,4500))
#
# plot(df$Market_Value, df$Total_House_Area,
# main = "Market Value vs. House Area",
# xlab = "2025 Market Value ($)",
# ylab = "Total Main Area (Sq. Ft.)",
# col ="hotpink",
# pch=19)
#
# points(538409, 3365, cex = 2, pch = 18, col ="maroon4")
# text(628409, 3330,labels="#6321", col="maroon4", cex=0.8, font=2)
#
#
# mtext("2025 Market Value vs. Total House Area", outer = TRUE, cex = 2, font = 4)
#
#
# # Market Value vs.Total Land
# summary(df$Land_Area)
#
# par(mfrow = c(1, 2), mar = c(4, 4, 3, 0.2) + 0.1, oma = c(2,0,2,0))
# hist(df$Land_Area,
# main = "Histogram of Land Area",
# xlab = "Total Main Area (Sq. Ft.)",
# col ="lightgreen",
# border = 'darkgreen')
#
#
# plot(df$Market_Value, df$Land_Area,
# main = "Market Value vs. Total Land",
# xlab = "2025 Market Value ($)",
# ylab = "Total Main Area (Sq. Ft.)",
# col = ifelse(df$Land_Area < 8169, "seagreen", "salmon"),
# pch=19)
# legend("topleft", c("Within Spec", "Outliers"), border="black",inset=.02, fill = c("seagreen", "salmon"))
#
# mtext("Market Value vs. Area of Land", outer = TRUE, cex = 2, font = 4)
#
# # Multiple Regression
#
# model_v1 <- lm(df$Market_Value ~ ., data = df)
# summary(model_v1)
# plot(model_v1)
#
# model <- lm(df$Market_Value ~ df$Total_House_Area + df$Land_Area, data = df)
# summary(model)
#
# # Outliers, Leverage, and Influence
# newdf <- df[-c(6,14,16),]
# model_v2 <- lm(newdf$Market_Value ~ newdf$Total_House_Area + newdf$Land_Area, data =newdf)
# summary(model_v2)
#
# summary(model)$coeff
# summary(model_v2)$coeff
#
# summary(model)$r.square
# summary(model_v2)$r.square
#
# plot(model_v2) # Just for fun to see if it helps.. but it doesn't look like it does.
#
# #check for multicollinearity using Variance Inflation Factor (VIF)
# library(car)
#
# vif_values <- vif(model)
# vif_values
#
# vif_df <- data.frame(Variable = names(vif_values), VIF = vif_values)
# print(vif_df)
#
#
# # Then confidence intervals @ House Number
# model <- lm(Market_Value ~ Total_House_Area + Land_Area, data = df)
#
# # Review House Information
# print(df[12,c(1,2,5,10)])
#
# # Create a new data frame with the specified numbers
# House_Info <- data.frame(Total_House_Area = df$Total_House_Area[12], Land_Area = df$Land_Area[12])
#
# CI_6321 <- predict(model, newdata = House_Info, interval = "confidence", level = 0.95)
# PI_6321 <- predict(model, newdata = House_Info, interval = "prediction", level = 0.95)
#
# # Print intervals and difference between values
# cat("Confidence Interval (95%):", CI_6321[2], "-", CI_6321[3], "\nPrediction Interval (95%):", PI_6321[2], "-", PI_6321[3])
#
# cat("Predicted Market Value:", CI_6321[1], "\nAssessed Market Value:", df[12,2][1])
#
# Difference = (df[12,2][1]) - (CI_6321[1])
# cat("Overvalued By:", Difference)
#
# THE END!