Property tax serves as a critical revenue stream for county operations, calculated as a percentage of a property’s assessed value. This value—determined by combining land and improvement (house) appraisals—relies heavily on subjective adjustments by tax assessors. Homeowners may appeal these assessments if they appear inconsistent with comparable properties.
This report evaluates whether the 2025 Market Value assessment of 6321 88th Street, Lubbock, Texas, aligns fairly with neighboring homes (6309–6351 88th Street). Using regression analysis and statistical diagnostics (e.g., prediction intervals, outlier detection), we objectively determine if the property is over- or under-assessed. The findings are presented for both the county tax assessor and presiding judge, with technical concepts explained in accessible terms.
Data was manually compiled from lubbockcad.org and uploaded to GitHub. The dataset includes all properties on 88th Street between 6309-6351 with these key variables:
Physical characteristics: Total Main Area (Sq. Ft.), Garage (Sq. Ft.), Land (Sq. Ft.)
Financial metrics: 2025 Market Value
The potential variables used in this analysis are the physical characteristics as the independent variables and the market value as the dependent variable.
Through this section the goal is to check the relationship of the dependent and independent variables. This can allow the basic understanding of the distribution of data and how the variable relate to each other. Through out the analysis we will also check the position of the property house number 6321, to have an idea if the assessment parameters were fair to the evaluation of the property in comparison of the other properties. Due to multicollinearity and overfiiting, there is need to do data preparation before conducting the multiple linear regression. By overfitting we try to avoid adding to many independent variables which simply account for more variance but do not add anything to the model. The preparation will include:
Correlations
Scatter plots
Simple regressions
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.3
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
#fetching data from url link
url<- "https://raw.githubusercontent.com/tafadzwabanga/Project-Tax/refs/heads/main/property_evaluation.csv"
#downloading data from url
download.file(url, destfile = "property_evaluation.csv")
#load the datasets
property_evaluation <- read.csv("property_evaluation.csv")
#view data
head(property_evaluation)
## Property..ID market.value Main.area Garage Land.size Year
## 1 6309 735026 3192 1063 10000 2015
## 2 6310 663907 3226 1078 10463 2017
## 3 6311 569992 3036 965 10000 2013
## 4 6312 602427 3277 909 10625 2015
## 5 6313 460288 2241 506 7785 2018
## 6 6314 968766 4188 985 13788 2021
In this section we look for a linear relationship between the two variable which can support the assumption that the independent variable has an effect on the dependent variable. In other words we can explain this based on correlation where if there is a high correlation this proves that there is a strong relationship between the variables. If that relationship does not exit or is very low it might just mean that the chosen independent variable in this case has no effect on the dependent variable hence it is not important to use it in the model.
# Adding a column to flag the special data point (6321 88th Street)
property_evaluation$highlight <- ifelse(property_evaluation$Property..ID == "6321", "House_6321", "Other")
# Now plot and change color based on that flag
ggplot(property_evaluation, aes(x = Main.area, y = market.value, color = highlight)) +
geom_point() +
scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
labs(title = "Scatterplot of Market value vs Main Area", color = "Property") +
stat_smooth(method = "lm", col = "green")
## `geom_smooth()` using formula = 'y ~ x'
Observation
Data distribution shows a linear distribution showing that main area has a positive relationship with the market value. This also supports pure logic that the main area increase
Market value and main area appears highly correlated with main area
The data shows that the are two outliers with main area size above 4000 that have and high leverage on the predicated values
These two points will inflate the strength of the regression relationship by both the statistical significance (reducing the p-value to increase the chance of a significant relationship) and the practical significance (increasing r-square)
ggplot(property_evaluation, aes(x = Garage, y = market.value, color = highlight)) +
geom_point() +
scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
labs(title = "Scatterplot of Market value vs Garage Area", color = "Property") +
stat_smooth(method = "lm", col = "green")
## `geom_smooth()` using formula = 'y ~ x'
Observation
Data distribution shows more of clustered groups of data but with a general overall distribution that shows a linear distribution
The data shows that there are major outliers with properties that have a market value greater that $800000 also shown in the appendix
Due to this major outliers with their influence we can expect that they might affect all statistics, including the p-value, r-square, coefficients, and intercept
ggplot(property_evaluation, aes(x = Land.size, y = market.value, color = highlight)) +
geom_point() +
scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
labs(title = "Scatterplot of Market value vs Land Size", color = "Property") +
stat_smooth(method = "lm", col = "green")+
geom_vline(xintercept = 9000, linetype = "dashed", linewidth = 1.2, color = "blue")
## `geom_smooth()` using formula = 'y ~ x'
Observation
Although there is a positive relationship it does seem that the 8 properties with land size above 9000 sqft has a significant effect on the relationship between the independent variable and the response variable
In this part through this analysis we are checking for multicollinearity between independent variables. In the analysis a good result will be two independent variables that have are not dependent on each other as it makes the analysis easier to define the variable that has an impact on the dependent variable.
ggplot(property_evaluation, aes(x = Land.size, y = Garage, color = highlight)) +
geom_point() +
scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
labs(title = "Scatterplot of Garage vs Land Size", color = "Property")
Observation
ggplot(property_evaluation, aes(x = Main.area, y = Garage, color = highlight)) +
geom_point() +
scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
labs(title = "Scatterplot of Garage vs Main Area", color = "Property")
Observation
ggplot(property_evaluation, aes(x = Main.area, y = Land.size, color = highlight)) +
geom_point() +
scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
labs(title = "Scatterplot of Land Size vs Main Area", color = "Property") +
stat_smooth(method = "lm", col = "blue")+
geom_hline(yintercept = 7800, linetype = "dashed", linewidth = 1.2, color = "darkgreen")
## `geom_smooth()` using formula = 'y ~ x'
Observation
There is a linear relationship between the land size and the main area but there is some pattern or relationship between certain properties in a certain range of main area size
To summarize and confirm the observations from the scatter plots, below the correlation plots show the pairwise relationship of the variables
By omitting properties with values above 9000sq Ft the fitted model completely changes from a positive relation to no relationship as illustrated by the green dashed line. This is done to observe how property 6321 can be influenced by other non-outlier properties of comparable main area and land size
Only 8 properties above 9000 are responsible in affecting the data of other 34 properties
numeric_data <- property_evaluation %>% select(where(is.numeric))
corrplot(cor(numeric_data), method = "number", type = "upper")
Evaluation of the simple pair models of the independent variables and the dependent variables
Market Value and Land Size
model_1 <- lm(market.value ~ Land.size, data = property_evaluation)
summary(model_1)
##
## Call:
## lm(formula = market.value ~ Land.size, data = property_evaluation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -170391 -41030 5992 41510 94237
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18715.415 41828.130 0.447 0.657
## Land.size 63.193 4.696 13.458 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60330 on 40 degrees of freedom
## Multiple R-squared: 0.8191, Adjusted R-squared: 0.8146
## F-statistic: 181.1 on 1 and 40 DF, p-value: < 2.2e-16
Market Value and Garage
model_2 <- lm(market.value ~ Garage, data = property_evaluation)
summary(model_2)
##
## Call:
## lm(formula = market.value ~ Garage, data = property_evaluation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -188200 -55712 -788 31218 514392
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 314139.75 58640.49 5.357 3.77e-06 ***
## Garage 365.15 80.53 4.535 5.15e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 115300 on 40 degrees of freedom
## Multiple R-squared: 0.3395, Adjusted R-squared: 0.323
## F-statistic: 20.56 on 1 and 40 DF, p-value: 5.148e-05
Market Value and Main Area
model_3 <- lm(market.value ~ Main.area, data = property_evaluation)
summary(model_3)
##
## Call:
## lm(formula = market.value ~ Main.area, data = property_evaluation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -156977 -16705 -3260 22014 154869
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -227945.26 47426.18 -4.806 2.19e-05 ***
## Main.area 281.80 16.58 16.994 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49480 on 40 degrees of freedom
## Multiple R-squared: 0.8783, Adjusted R-squared: 0.8753
## F-statistic: 288.8 on 1 and 40 DF, p-value: < 2.2e-16
Summary on the paired models
All the p value are less than the threshold of 0.05 which proves that the relationship between the independent and dependent variables is highly significant.
The market value and garage has a very high standard error and low R-squared value or adjusted R-squared value which can be problematic hence this pairwise relationship might not be the best model
Market value and Main area model has the highest F statistic, R squared and adjusted R squared. It also has the lowest standard error making it the best model.
This analysis will be pivotal to evaluate if the multiple linear regression model is the best model to use in evaluating the dependent variable or rather in this case in making a decisive conclusion if property 6321 valuation is justified
confint(model_1)
## 2.5 % 97.5 %
## (Intercept) -65822.38981 103253.21936
## Land.size 53.70278 72.68293
confint(model_2)
## 2.5 % 97.5 %
## (Intercept) 195622.9121 432656.5975
## Garage 202.4003 527.8973
confint(model_3)
## 2.5 % 97.5 %
## (Intercept) -323797.1536 -132093.3672
## Main.area 248.2893 315.3172
Observation
Since all confidence intervals do not cover zero (are greater than 0) we
can reject that the parameters (independent variables) are in fact zero,
and we would conclude that land size, garage and main area explain the
variations in market value.
In this analysis we will create two dependent variable analysis to observe their effects. The combination will be assigned as follows:
(garage , main area)
(garage, land size)
(main area, land size)
G_M <- lm(market.value ~ Garage + Main.area , data = property_evaluation)
summary(G_M)
##
## Call:
## lm(formula = market.value ~ Garage + Main.area, data = property_evaluation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -144113 -14666 -1513 19255 159044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -224613.34 47782.71 -4.701 3.20e-05 ***
## Garage 35.35 42.66 0.829 0.412
## Main.area 271.93 20.47 13.284 4.67e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49670 on 39 degrees of freedom
## Multiple R-squared: 0.8805, Adjusted R-squared: 0.8743
## F-statistic: 143.6 on 2 and 39 DF, p-value: < 2.2e-16
G_LS <- lm(market.value ~ Garage + Land.size , data = property_evaluation)
summary(G_LS)
##
## Call:
## lm(formula = market.value ~ Garage + Land.size, data = property_evaluation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -159514 -39435 12388 30517 106016
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6365.632 41906.985 0.152 0.880
## Garage 76.339 49.617 1.539 0.132
## Land.size 58.515 5.528 10.585 5e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59330 on 39 degrees of freedom
## Multiple R-squared: 0.8295, Adjusted R-squared: 0.8207
## F-statistic: 94.84 on 2 and 39 DF, p-value: 1.049e-15
M_LS <- lm(market.value ~ Main.area + Land.size , data = property_evaluation)
summary(M_LS)
##
## Call:
## lm(formula = market.value ~ Main.area + Land.size, data = property_evaluation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -102641 -7149 3280 20474 87902
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.842e+05 3.922e+04 -4.697 3.24e-05 ***
## Main.area 1.813e+02 2.490e+01 7.279 8.89e-09 ***
## Land.size 2.765e+01 5.782e+00 4.781 2.49e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39790 on 39 degrees of freedom
## Multiple R-squared: 0.9233, Adjusted R-squared: 0.9194
## F-statistic: 234.8 on 2 and 39 DF, p-value: < 2.2e-16
Observation
Based on the observed values the independent variables Main area and land size have the highest R- squared of 92.33% and the highest adjusted R-squared which means with this combination we can be able to explain 92% of the data if they have an impact on the Market value
The two independent variables also have a the lowest standard error in comparison with the other two models and with the p value which is lest than the threshold value of 0.05 we are confident that the relationship is significant
We are going to conduct a multiple linear regression analysis to determine which of the independent variables are significant predictors of the response variable. To help explain all the three variables associated with a property are being considered as possible predictors of the market value. The analysis will help answer some of the questions that include
Are all variables needed?
Does each independent variable help explain some variation in the response variable after accounting the effects of the other independent variables in the model
The full model is represented by this relationship
\[
y = \beta_{0} +\beta_{1}x_{1} +
\beta_{2}x_{2}+\beta_{3}x_{3}+\varepsilon
\]
where :
\(y = Market.Value\), \(x_{1}=Main.Area\), \(x_{2}=Garage\), \(x_{3}=Land.Size\)
\(\beta_{0} = Constant\)
\(\beta_{1}x_{1}...x_{3} = Coefficients\)
\(\varepsilon\) = \(error (residual)\)
F-Test
final_model<-lm(formula = market.value ~ Main.area + Garage + Land.size, data = property_evaluation)
summary(final_model)
##
## Call:
## lm(formula = market.value ~ Main.area + Garage + Land.size, data = property_evaluation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -102186 -10360 4950 20790 90266
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.834e+05 3.970e+04 -4.620 4.32e-05 ***
## Main.area 1.786e+02 2.609e+01 6.844 4.00e-08 ***
## Garage 1.366e+01 3.486e+01 0.392 0.697
## Land.size 2.734e+01 5.899e+00 4.634 4.14e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40220 on 38 degrees of freedom
## Multiple R-squared: 0.9236, Adjusted R-squared: 0.9176
## F-statistic: 153.2 on 3 and 38 DF, p-value: < 2.2e-16
Observation
The regression summary shows that the F statistic of the entire model is 153,2 with a p value less than the threshold of 0.05
There is evidence to reject the null hypothesis \((H_{0}: \beta_{0} = \beta_{1} = \beta_{3} = 0)\) as the p value implies that the independent variables are significant indicators of the response variable
Since the F static number is an average of the three independent variables we will use the ANOVA to expand the data to see the effect of each independent variable .
anova(final_model)
## Analysis of Variance Table
##
## Response: market.value
## Df Sum Sq Mean Sq F value Pr(>F)
## Main.area 1 7.0700e+11 7.0700e+11 436.9529 < 2.2e-16 ***
## Garage 1 1.6937e+09 1.6937e+09 1.0468 0.3127
## Land.size 1 3.4742e+10 3.4742e+10 21.4717 4.139e-05 ***
## Residuals 38 6.1485e+10 1.6180e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observation
The F value of the predictor variable Garage shows that it is not a significant predictor of the response variable which is also observed in the summary from the p value of the t tests that is way above the threshold
vif(final_model)
## Main.area Garage Land.size
## 3.745678 1.539787 3.551213
#creating vector of VIF values
vif_values <- vif(final_model)
#creating horizontal bar chart to display each VIF value
barplot(vif_values, main = "VIF Values", horiz = TRUE, col = "blue")
Observation
The variance of inflation data shows that all the values are below 5 which should not cause any concern to the model.
The evaluation of property assessments along 88th Street in Lubbock, Texas particularly for the 2025 Market Value of 6321 88th Street, uncovers key statistical insights into fairness and accuracy in property tax valuation. Using regression diagnostics and a rigorous analysis of property attributes, the following conclusions are drawn:
Primary Drivers of Market Value
The main area and land size of a property are confirmed as the strongest predictors of market value. The regression model combining these two variables explains 92.33% of the variation in market value (R-squared), supported by:
A high F-statistic of 153.2
Low standard error
Statistical significance with p-values well below the 0.05 threshold
These results align with both logical and market expectations larger properties tend to command higher market values validating this model as the most robust and reliable for valuation purposes.
Impact of Outliers
The analysis identified two major outliers (properties exceeding 4,000 sq.ft. in main area and those valued over $800,000). These outliers exert high leverage on the regression model, skewing R-squared values upward and lowering p-values, which can create an artificial sense of strength and precision in the model.
Such influence may distort assessments for average properties like 6321 88th Street. Therefore, the fairness of its valuation depends on whether it aligns with non-outlier properties. If its characteristics deviate significantly due to proximity to these high-leverage points, a reassessment is warranted.
Garage Size: An Insignificant Factor
Garage size consistently showed no meaningful correlation with market value. With high p-values, low R-squared, and insignificant F-values, its inclusion in the model introduces statistical noise rather than insight. This variable should not be prioritized in future assessments as it adds minimal explanatory power.
Model Validity and Assessment Recommendations
The selected model main area and land size demonstrates high accuracy, low multicollinearity (VIF < 5), and statistical robustness, making it the most appropriate tool for property valuation in this context.
However, to preserve equity and transparency, the following steps are recommended:
Re-examine 6321 88th Street’s valuation by comparing it directly to non-outlier neighbors (6309–6351 88th Street), particularly in main area and land size.
Exclude high-leverage outliers in comparative analyses to avoid distortion in assessment benchmarks.
Use prediction intervals to determine whether the assessed value of 6321 falls within a statistically reasonable range for its attributes.
Adjust the valuation if it falls outside that range, ensuring fairness and preventing over- or under-taxation.
In conclusion, while the data validates the critical role of main area and land size in determining market value, the presence of extreme outliers necessitates a cautious, context-sensitive approach. The data distribution of the histograms in the appendix show that garage and the land size do not follow a normal distribution which would disqualify them from being used to predict the response variable. The assessment of 6321 88th Street should only be considered fair if its market value aligns with the trend established by comparable, non-outlier properties. From observations a reassessment or adjustment should be made to uphold the principles of equity and transparency in taxation. However if we disregard the regression model, the property 6321 is always located close to other properties with the same range of attributes which can justify its valuation.
library(patchwork)
plot1 <- ggplot(property_evaluation, aes(x = market.value)) +
geom_histogram(binwidth = 50000, fill = "blue", color = "black") +
labs(title = "2025 Market Value", x = "value", y = "Frequency")
plot2 <- ggplot(property_evaluation, aes(x = Main.area)) +
geom_histogram(binwidth = 200, fill = "blue", color = "black") +
labs(title = "Main Area", x = "Area", y = "Frequency")
plot3 <- ggplot(property_evaluation, aes(x = Garage)) +
geom_histogram(binwidth = 100, fill = "blue", color = "black") +
labs(title = "Garage", x = "Area", y = "Frequency")
plot4 <- ggplot(property_evaluation, aes(x = Land.size)) +
geom_histogram(binwidth = 200, fill = "blue", color = "black") +
labs(title = "Land Size", x = "size", y = "Frequency")
# Combine plots side by side
(plot1 + plot2)/(plot3 + plot4)
plot5 <- ggplot(property_evaluation, aes(y = market.value)) +geom_boxplot(fill = "orange") + labs(title = "Box Plot of 2025 Market Value", y = "Value")
plot6 <- ggplot(property_evaluation, aes(y = Main.area)) +geom_boxplot(fill = "orange") + labs(title = "Box Plot of Main Area", y = "Area")
plot7 <- ggplot(property_evaluation, aes(y = Garage)) +geom_boxplot(fill = "orange") + labs(title = "Box Plot of Garage", y = "Area")
plot8 <- ggplot(property_evaluation, aes(y = Land.size)) +geom_boxplot(fill = "orange") + labs(title = "Box Plot of Land Size", y = "size")
(plot5 + plot6)/(plot7 + plot8)
qqnorm(resid(final_model))
qqline(resid(final_model))
The normal QQ-plot deviates from the straight line for both large and small quantiles of the normal distribution. This S-shape tells us that both extremely small and extremely large empirical quantiles (on the vertical axis) are larger (in absolute value) than the corresponding theoretical quantiles of the normal distribution
plot(final_model, which = 1)
# Load necessary libraries
library(ggplot2) # For plotting
library(dplyr) # For data manipulation
library(corrplot) # For correlation plot
library(car) # For VIF calculation and regression diagnostics
# Define the URL to fetch dataset
url <- "https://raw.githubusercontent.com/tafadzwabanga/Project-Tax/refs/heads/main/property_evaluation.csv"
# Download the CSV file from the URL and save it locally
download.file(url, destfile = "property_evaluation.csv")
# Read the downloaded dataset into R
property_evaluation <- read.csv("property_evaluation.csv")
# Display the first few rows of the dataset
head(property_evaluation)
# Flag the special property of interest (6321 88th Street) for highlighting in plots
property_evaluation$highlight <- ifelse(property_evaluation$Property..ID == "6321", "House_6321", "Other")
# Plot: Market Value vs Main Area with linear regression line
ggplot(property_evaluation, aes(x = Main.area, y = market.value, color = highlight)) +
geom_point() +
scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
labs(title = "Scatterplot of Market value vs Main Area", color = "Property") +
stat_smooth(method = "lm", col = "green")
# Plot: Market Value vs Garage Area
ggplot(property_evaluation, aes(x = Garage, y = market.value, color = highlight)) +
geom_point() +
scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
labs(title = "Scatterplot of Market value vs Garage Area", color = "Property") +
stat_smooth(method = "lm", col = "green")
# Plot: Market Value vs Land Size with a vertical marker line
ggplot(property_evaluation, aes(x = Land.size, y = market.value, color = highlight)) +
geom_point() +
scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
labs(title = "Scatterplot of Market value vs Land Size", color = "Property") +
stat_smooth(method = "lm", col = "green") +
geom_vline(xintercept = 9000, linetype = "dashed", linewidth = 1.2, color = "blue")
# Plot: Garage vs Land Size
ggplot(property_evaluation, aes(x = Land.size, y = Garage, color = highlight)) +
geom_point() +
scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
labs(title = "Scatterplot of Garage vs Land Size", color = "Property")
# Plot: Garage vs Main Area
ggplot(property_evaluation, aes(x = Main.area, y = Garage, color = highlight)) +
geom_point() +
scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
labs(title = "Scatterplot of Garage vs Main Area", color = "Property")
# Plot: Land Size vs Main Area with a horizontal marker line
ggplot(property_evaluation, aes(x = Main.area, y = Land.size, color = highlight)) +
geom_point() +
scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
labs(title = "Scatterplot of Land Size vs Main Area", color = "Property") +
stat_smooth(method = "lm", col = "blue") +
geom_hline(yintercept = 7800, linetype = "dashed", linewidth = 1.2, color = "darkgreen")
# Extract only numeric variables for correlation analysis
numeric_data <- property_evaluation %>% select(where(is.numeric))
# Plot correlation matrix of numeric variables
corrplot(cor(numeric_data), method = "number", type = "upper")
# Build and summarize simple linear regression models
model_1 <- lm(market.value ~ Land.size, data = property_evaluation)
summary(model_1)
model_2 <- lm(market.value ~ Garage, data = property_evaluation)
summary(model_2)
model_3 <- lm(market.value ~ Main.area, data = property_evaluation)
summary(model_3)
# Compute 95% confidence intervals for regression coefficients
confint(model_1)
confint(model_2)
confint(model_3)
# Build and summarize multiple regression models with different combinations
G_M <- lm(market.value ~ Garage + Main.area, data = property_evaluation)
summary(G_M)
G_LS <- lm(market.value ~ Garage + Land.size, data = property_evaluation)
summary(G_LS)
M_LS <- lm(market.value ~ Main.area + Land.size, data = property_evaluation)
summary(M_LS)
final_model<-lm(formula = market.value ~ Main.area + Garage + Land.size, data = property_evaluation)
# Summarize final multiple regression model (assuming final_model is predefined)
summary(final_model)
# Perform ANOVA on the final model
anova(final_model)
# Check multicollinearity using Variance Inflation Factor (VIF)
vif(final_model)
# Create a vector of VIF values
vif_values <- vif(final_model)
# Visualize VIF values using a horizontal bar plot
barplot(vif_values, main = "VIF Values", horiz = TRUE, col = "blue")
library(patchwork) # Load patchwork for multi-plot layouts
# Histograms for each numeric feature
plot1 <- ggplot(property_evaluation, aes(x = market.value)) +
geom_histogram(binwidth = 50000, fill = "blue", color = "black") +
labs(title = "2025 Market Value", x = "value", y = "Frequency")
plot2 <- ggplot(property_evaluation, aes(x = Main.area)) +
geom_histogram(binwidth = 200, fill = "blue", color = "black") +
labs(title = "Main Area", x = "Area", y = "Frequency")
plot3 <- ggplot(property_evaluation, aes(x = Garage)) +
geom_histogram(binwidth = 100, fill = "blue", color = "black") +
labs(title = "Garage", x = "Area", y = "Frequency")
plot4 <- ggplot(property_evaluation, aes(x = Land.size)) +
geom_histogram(binwidth = 200, fill = "blue", color = "black") +
labs(title = "Land Size", x = "size", y = "Frequency")
# Display histograms in a 2x2 grid
(plot1 + plot2)/(plot3 + plot4)
# Boxplots for each numeric feature
plot5 <- ggplot(property_evaluation, aes(y = market.value)) +
geom_boxplot(fill = "orange") +
labs(title = "Box Plot of 2025 Market Value", y = "Value")
plot6 <- ggplot(property_evaluation, aes(y = Main.area)) +
geom_boxplot(fill = "orange") +
labs(title = "Box Plot of Main Area", y = "Area")
plot7 <- ggplot(property_evaluation, aes(y = Garage)) +
geom_boxplot(fill = "orange") +
labs(title = "Box Plot of Garage", y = "Area")
plot8 <- ggplot(property_evaluation, aes(y = Land.size)) +
geom_boxplot(fill = "orange") +
labs(title = "Box Plot of Land Size", y = "Size")
# Display boxplots in a 2x2 grid
(plot5 + plot6)/(plot7 + plot8)
# Q-Q plot for checking normality of residuals in final model
qqnorm(resid(final_model))
qqline(resid(final_model))
# Residuals vs Fitted plot for checking homoscedasticity and model fit
plot(final_model, which = 1)