Multiple Linear Regression

1.0 Introduction

Property tax serves as a critical revenue stream for county operations, calculated as a percentage of a property’s assessed value. This value—determined by combining land and improvement (house) appraisals—relies heavily on subjective adjustments by tax assessors. Homeowners may appeal these assessments if they appear inconsistent with comparable properties.

This report evaluates whether the 2025 Market Value assessment of 6321 88th Street, Lubbock, Texas, aligns fairly with neighboring homes (6309–6351 88th Street). Using regression analysis and statistical diagnostics (e.g., prediction intervals, outlier detection), we objectively determine if the property is over- or under-assessed. The findings are presented for both the county tax assessor and presiding judge, with technical concepts explained in accessible terms.

1.1 Data Collection

Data was manually compiled from lubbockcad.org and uploaded to GitHub. The dataset includes all properties on 88th Street between 6309-6351 with these key variables:

Physical characteristics: Total Main Area (Sq. Ft.), Garage (Sq. Ft.), Land (Sq. Ft.)
Financial metrics: 2025 Market Value

The potential variables used in this analysis are the physical characteristics as the independent variables and the market value as the dependent variable.

1.2 Exploratory Analysis

Through this section the goal is to check the relationship of the dependent and independent variables. This can allow the basic understanding of the distribution of data and how the variable relate to each other. Through out the analysis we will also check the position of the property house number 6321, to have an idea if the assessment parameters were fair to the evaluation of the property in comparison of the other properties. Due to multicollinearity and overfiiting, there is need to do data preparation before conducting the multiple linear regression. By overfitting we try to avoid adding to many independent variables which simply account for more variance but do not add anything to the model. The preparation will include:

Correlations
Scatter plots
Simple regressions

1.2.1 Load the dataset

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.4.3

## corrplot 0.95 loaded

library(car)

## Warning: package 'car' was built under R version 4.4.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.4.3

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

#fetching data from url link
url<- "https://raw.githubusercontent.com/tafadzwabanga/Project-Tax/refs/heads/main/property_evaluation.csv"

#downloading data from url

download.file(url, destfile = "property_evaluation.csv")

#load the datasets

property_evaluation <- read.csv("property_evaluation.csv")

#view data
head(property_evaluation)

##   Property..ID market.value Main.area Garage Land.size Year
## 1         6309       735026      3192   1063     10000 2015
## 2         6310       663907      3226   1078     10463 2017
## 3         6311       569992      3036    965     10000 2013
## 4         6312       602427      3277    909     10625 2015
## 5         6313       460288      2241    506      7785 2018
## 6         6314       968766      4188    985     13788 2021

1.3 Checking Relevancy of variables

1.3.1 Independent Variable with dependent variable

In this section we look for a linear relationship between the two variable which can support the assumption that the independent variable has an effect on the dependent variable. In other words we can explain this based on correlation where if there is a high correlation this proves that there is a strong relationship between the variables. If that relationship does not exit or is very low it might just mean that the chosen independent variable in this case has no effect on the dependent variable hence it is not important to use it in the model.

# Adding a column to flag the special data point (6321 88th Street)
property_evaluation$highlight <- ifelse(property_evaluation$Property..ID == "6321", "House_6321", "Other")

# Now plot and change color based on that flag
ggplot(property_evaluation, aes(x = Main.area, y = market.value, color = highlight)) +
  geom_point() +
  scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
  labs(title = "Scatterplot of Market value vs Main Area", color = "Property") +
  stat_smooth(method = "lm", col = "green")

## `geom_smooth()` using formula = 'y ~ x'

Observation

Data distribution shows a linear distribution showing that main area has a positive relationship with the market value. This also supports pure logic that the main area increase
Market value and main area appears highly correlated with main area
The data shows that the are two outliers with main area size above 4000 that have and high leverage on the predicated values
These two points will inflate the strength of the regression relationship by both the statistical significance (reducing the p-value to increase the chance of a significant relationship) and the practical significance (increasing r-square)

ggplot(property_evaluation, aes(x = Garage, y = market.value, color = highlight)) +
  geom_point() +
  scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
  labs(title = "Scatterplot of Market value vs Garage Area", color = "Property") +
  stat_smooth(method = "lm", col = "green")

## `geom_smooth()` using formula = 'y ~ x'

Observation

Data distribution shows more of clustered groups of data but with a general overall distribution that shows a linear distribution
The data shows that there are major outliers with properties that have a market value greater that $800000 also shown in the appendix
Due to this major outliers with their influence we can expect that they might affect all statistics, including the p-value, r-square, coefficients, and intercept

ggplot(property_evaluation, aes(x = Land.size, y = market.value, color = highlight)) +
  geom_point() +
  scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
  labs(title = "Scatterplot of Market value vs Land Size", color = "Property") +
  stat_smooth(method = "lm", col = "green")+
   geom_vline(xintercept = 9000, linetype = "dashed", linewidth = 1.2, color = "blue")

## `geom_smooth()` using formula = 'y ~ x'

Observation

Although there is a positive relationship it does seem that the 8 properties with land size above 9000 sqft has a significant effect on the relationship between the independent variable and the response variable

1.3.2 Independent Variable with Independent Variable

In this part through this analysis we are checking for multicollinearity between independent variables. In the analysis a good result will be two independent variables that have are not dependent on each other as it makes the analysis easier to define the variable that has an impact on the dependent variable.

ggplot(property_evaluation, aes(x = Land.size, y = Garage, color = highlight)) +
  geom_point() +
  scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
  labs(title = "Scatterplot of Garage vs Land Size", color = "Property")

Observation

There is no discernible linear patter or linear relationship between the garage and the land size

ggplot(property_evaluation, aes(x = Main.area, y = Garage, color = highlight)) +
  geom_point() +
  scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
  labs(title = "Scatterplot of Garage vs Main Area", color = "Property")

Observation

There is no discernible linear patter or linear relationship between the garage and the land size

ggplot(property_evaluation, aes(x = Main.area, y = Land.size, color = highlight)) +
  geom_point() +
  scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
  labs(title = "Scatterplot of Land Size vs Main Area", color = "Property") +
  stat_smooth(method = "lm", col = "blue")+
  geom_hline(yintercept = 7800, linetype = "dashed", linewidth = 1.2, color = "darkgreen")

## `geom_smooth()` using formula = 'y ~ x'

Observation

There is a linear relationship between the land size and the main area but there is some pattern or relationship between certain properties in a certain range of main area size
To summarize and confirm the observations from the scatter plots, below the correlation plots show the pairwise relationship of the variables
By omitting properties with values above 9000sq Ft the fitted model completely changes from a positive relation to no relationship as illustrated by the green dashed line. This is done to observe how property 6321 can be influenced by other non-outlier properties of comparable main area and land size
Only 8 properties above 9000 are responsible in affecting the data of other 34 properties

numeric_data <- property_evaluation %>% select(where(is.numeric))

corrplot(cor(numeric_data), method = "number", type = "upper")

1.4 Regression Analysis

1.4.1 Single Independent Variable

Evaluation of the simple pair models of the independent variables and the dependent variables

Market Value and Land Size

model_1 <- lm(market.value ~ Land.size, data = property_evaluation) 
summary(model_1)

## 
## Call:
## lm(formula = market.value ~ Land.size, data = property_evaluation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -170391  -41030    5992   41510   94237 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18715.415  41828.130   0.447    0.657    
## Land.size      63.193      4.696  13.458   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60330 on 40 degrees of freedom
## Multiple R-squared:  0.8191, Adjusted R-squared:  0.8146 
## F-statistic: 181.1 on 1 and 40 DF,  p-value: < 2.2e-16

Market Value and Garage

model_2 <- lm(market.value ~ Garage, data = property_evaluation) 
summary(model_2)

## 
## Call:
## lm(formula = market.value ~ Garage, data = property_evaluation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -188200  -55712    -788   31218  514392 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 314139.75   58640.49   5.357 3.77e-06 ***
## Garage         365.15      80.53   4.535 5.15e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 115300 on 40 degrees of freedom
## Multiple R-squared:  0.3395, Adjusted R-squared:  0.323 
## F-statistic: 20.56 on 1 and 40 DF,  p-value: 5.148e-05

Market Value and Main Area

model_3 <- lm(market.value ~ Main.area, data = property_evaluation) 
summary(model_3)

## 
## Call:
## lm(formula = market.value ~ Main.area, data = property_evaluation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -156977  -16705   -3260   22014  154869 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -227945.26   47426.18  -4.806 2.19e-05 ***
## Main.area       281.80      16.58  16.994  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49480 on 40 degrees of freedom
## Multiple R-squared:  0.8783, Adjusted R-squared:  0.8753 
## F-statistic: 288.8 on 1 and 40 DF,  p-value: < 2.2e-16

Summary on the paired models

All the p value are less than the threshold of 0.05 which proves that the relationship between the independent and dependent variables is highly significant.
The market value and garage has a very high standard error and low R-squared value or adjusted R-squared value which can be problematic hence this pairwise relationship might not be the best model
Market value and Main area model has the highest F statistic, R squared and adjusted R squared. It also has the lowest standard error making it the best model.
This analysis will be pivotal to evaluate if the multiple linear regression model is the best model to use in evaluating the dependent variable or rather in this case in making a decisive conclusion if property 6321 valuation is justified

confint(model_1)

##                    2.5 %       97.5 %
## (Intercept) -65822.38981 103253.21936
## Land.size       53.70278     72.68293

confint(model_2)

##                   2.5 %      97.5 %
## (Intercept) 195622.9121 432656.5975
## Garage         202.4003    527.8973

confint(model_3)

##                    2.5 %       97.5 %
## (Intercept) -323797.1536 -132093.3672
## Main.area       248.2893     315.3172

Observation
Since all confidence intervals do not cover zero (are greater than 0) we can reject that the parameters (independent variables) are in fact zero, and we would conclude that land size, garage and main area explain the variations in market value.

1.4.2 Two Dependent Variable

In this analysis we will create two dependent variable analysis to observe their effects. The combination will be assigned as follows:

(garage , main area)
(garage, land size)
(main area, land size)

G_M <- lm(market.value ~ Garage + Main.area , data = property_evaluation) 
summary(G_M)

## 
## Call:
## lm(formula = market.value ~ Garage + Main.area, data = property_evaluation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -144113  -14666   -1513   19255  159044 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -224613.34   47782.71  -4.701 3.20e-05 ***
## Garage           35.35      42.66   0.829    0.412    
## Main.area       271.93      20.47  13.284 4.67e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49670 on 39 degrees of freedom
## Multiple R-squared:  0.8805, Adjusted R-squared:  0.8743 
## F-statistic: 143.6 on 2 and 39 DF,  p-value: < 2.2e-16

G_LS  <- lm(market.value ~ Garage + Land.size , data = property_evaluation) 
summary(G_LS)

## 
## Call:
## lm(formula = market.value ~ Garage + Land.size, data = property_evaluation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -159514  -39435   12388   30517  106016 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6365.632  41906.985   0.152    0.880    
## Garage         76.339     49.617   1.539    0.132    
## Land.size      58.515      5.528  10.585    5e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59330 on 39 degrees of freedom
## Multiple R-squared:  0.8295, Adjusted R-squared:  0.8207 
## F-statistic: 94.84 on 2 and 39 DF,  p-value: 1.049e-15

M_LS  <- lm(market.value ~ Main.area + Land.size , data = property_evaluation) 
summary(M_LS)

## 
## Call:
## lm(formula = market.value ~ Main.area + Land.size, data = property_evaluation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -102641   -7149    3280   20474   87902 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.842e+05  3.922e+04  -4.697 3.24e-05 ***
## Main.area    1.813e+02  2.490e+01   7.279 8.89e-09 ***
## Land.size    2.765e+01  5.782e+00   4.781 2.49e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39790 on 39 degrees of freedom
## Multiple R-squared:  0.9233, Adjusted R-squared:  0.9194 
## F-statistic: 234.8 on 2 and 39 DF,  p-value: < 2.2e-16

Observation

Based on the observed values the independent variables Main area and land size have the highest R- squared of 92.33% and the highest adjusted R-squared which means with this combination we can be able to explain 92% of the data if they have an impact on the Market value
The two independent variables also have a the lowest standard error in comparison with the other two models and with the p value which is lest than the threshold value of 0.05 we are confident that the relationship is significant

1.5 Final Model

We are going to conduct a multiple linear regression analysis to determine which of the independent variables are significant predictors of the response variable. To help explain all the three variables associated with a property are being considered as possible predictors of the market value. The analysis will help answer some of the questions that include

Are all variables needed?
Does each independent variable help explain some variation in the response variable after accounting the effects of the other independent variables in the model

The full model is represented by this relationship
\[ y = \beta_{0} +\beta_{1}x_{1} + \beta_{2}x_{2}+\beta_{3}x_{3}+\varepsilon \]

where :

$y = Market.Value$, $x_{1}=Main.Area$, $x_{2}=Garage$, $x_{3}=Land.Size$

$\beta_{0} = Constant$

$\beta_{1}x_{1}...x_{3} = Coefficients$

$\varepsilon$ = $error (residual)$

F-Test

final_model<-lm(formula = market.value ~ Main.area + Garage + Land.size, data = property_evaluation)
summary(final_model)

## 
## Call:
## lm(formula = market.value ~ Main.area + Garage + Land.size, data = property_evaluation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -102186  -10360    4950   20790   90266 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.834e+05  3.970e+04  -4.620 4.32e-05 ***
## Main.area    1.786e+02  2.609e+01   6.844 4.00e-08 ***
## Garage       1.366e+01  3.486e+01   0.392    0.697    
## Land.size    2.734e+01  5.899e+00   4.634 4.14e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40220 on 38 degrees of freedom
## Multiple R-squared:  0.9236, Adjusted R-squared:  0.9176 
## F-statistic: 153.2 on 3 and 38 DF,  p-value: < 2.2e-16

Observation

The regression summary shows that the F statistic of the entire model is 153,2 with a p value less than the threshold of 0.05

There is evidence to reject the null hypothesis $(H_{0}: \beta_{0} = \beta_{1} = \beta_{3} = 0)$ as the p value implies that the independent variables are significant indicators of the response variable

Since the F static number is an average of the three independent variables we will use the ANOVA to expand the data to see the effect of each independent variable .

anova(final_model)

## Analysis of Variance Table
## 
## Response: market.value
##           Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## Main.area  1 7.0700e+11 7.0700e+11 436.9529 < 2.2e-16 ***
## Garage     1 1.6937e+09 1.6937e+09   1.0468    0.3127    
## Land.size  1 3.4742e+10 3.4742e+10  21.4717 4.139e-05 ***
## Residuals 38 6.1485e+10 1.6180e+09                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observation

The F value of the predictor variable Garage shows that it is not a significant predictor of the response variable which is also observed in the summary from the p value of the t tests that is way above the threshold

vif(final_model)

## Main.area    Garage Land.size 
##  3.745678  1.539787  3.551213

#creating vector of VIF values
vif_values <- vif(final_model)

#creating horizontal bar chart to display each VIF value
barplot(vif_values, main = "VIF Values", horiz = TRUE, col = "blue")

Observation

The variance of inflation data shows that all the values are below 5 which should not cause any concern to the model.

1.6 Conclusion

The evaluation of property assessments along 88th Street in Lubbock, Texas particularly for the 2025 Market Value of 6321 88th Street, uncovers key statistical insights into fairness and accuracy in property tax valuation. Using regression diagnostics and a rigorous analysis of property attributes, the following conclusions are drawn:

Primary Drivers of Market Value

The main area and land size of a property are confirmed as the strongest predictors of market value. The regression model combining these two variables explains 92.33% of the variation in market value (R-squared), supported by:

A high F-statistic of 153.2
Low standard error
Statistical significance with p-values well below the 0.05 threshold

These results align with both logical and market expectations larger properties tend to command higher market values validating this model as the most robust and reliable for valuation purposes.

Impact of Outliers

The analysis identified two major outliers (properties exceeding 4,000 sq.ft. in main area and those valued over $800,000). These outliers exert high leverage on the regression model, skewing R-squared values upward and lowering p-values, which can create an artificial sense of strength and precision in the model.

Such influence may distort assessments for average properties like 6321 88th Street. Therefore, the fairness of its valuation depends on whether it aligns with non-outlier properties. If its characteristics deviate significantly due to proximity to these high-leverage points, a reassessment is warranted.

Garage Size: An Insignificant Factor

Garage size consistently showed no meaningful correlation with market value. With high p-values, low R-squared, and insignificant F-values, its inclusion in the model introduces statistical noise rather than insight. This variable should not be prioritized in future assessments as it adds minimal explanatory power.

Model Validity and Assessment Recommendations

The selected model main area and land size demonstrates high accuracy, low multicollinearity (VIF < 5), and statistical robustness, making it the most appropriate tool for property valuation in this context.

However, to preserve equity and transparency, the following steps are recommended:

Re-examine 6321 88th Street’s valuation by comparing it directly to non-outlier neighbors (6309–6351 88th Street), particularly in main area and land size.
Exclude high-leverage outliers in comparative analyses to avoid distortion in assessment benchmarks.
Use prediction intervals to determine whether the assessed value of 6321 falls within a statistically reasonable range for its attributes.
Adjust the valuation if it falls outside that range, ensuring fairness and preventing over- or under-taxation.

In conclusion, while the data validates the critical role of main area and land size in determining market value, the presence of extreme outliers necessitates a cautious, context-sensitive approach. The data distribution of the histograms in the appendix show that garage and the land size do not follow a normal distribution which would disqualify them from being used to predict the response variable. The assessment of 6321 88th Street should only be considered fair if its market value aligns with the trend established by comparable, non-outlier properties. From observations a reassessment or adjustment should be made to uphold the principles of equity and transparency in taxation. However if we disregard the regression model, the property 6321 is always located close to other properties with the same range of attributes which can justify its valuation.

Appendix

Histogram Plots

library(patchwork)


plot1 <- ggplot(property_evaluation, aes(x = market.value)) +
  geom_histogram(binwidth = 50000, fill = "blue", color = "black") +
  labs(title = "2025 Market Value", x = "value", y = "Frequency")


plot2 <- ggplot(property_evaluation, aes(x = Main.area)) +
  geom_histogram(binwidth = 200, fill = "blue", color = "black") +
  labs(title = "Main Area", x = "Area", y = "Frequency")


plot3 <- ggplot(property_evaluation, aes(x = Garage)) +
  geom_histogram(binwidth = 100, fill = "blue", color = "black") +
  labs(title = "Garage", x = "Area", y = "Frequency")


plot4 <- ggplot(property_evaluation, aes(x = Land.size)) +
  geom_histogram(binwidth = 200, fill = "blue", color = "black") +
  labs(title = "Land Size", x = "size", y = "Frequency")


# Combine plots side by side
(plot1 + plot2)/(plot3 + plot4)

Box Plots

plot5 <- ggplot(property_evaluation, aes(y = market.value)) +geom_boxplot(fill = "orange") + labs(title = "Box Plot of 2025 Market Value", y = "Value")

plot6 <- ggplot(property_evaluation, aes(y = Main.area)) +geom_boxplot(fill = "orange") + labs(title = "Box Plot of Main Area", y = "Area")

plot7 <- ggplot(property_evaluation, aes(y = Garage)) +geom_boxplot(fill = "orange") + labs(title = "Box Plot of Garage", y = "Area")

plot8 <- ggplot(property_evaluation, aes(y = Land.size)) +geom_boxplot(fill = "orange") + labs(title = "Box Plot of Land Size", y = "size")

(plot5 + plot6)/(plot7 + plot8)

Normality of Residuals

qqnorm(resid(final_model)) 
qqline(resid(final_model))

The normal QQ-plot deviates from the straight line for both large and small quantiles of the normal distribution. This S-shape tells us that both extremely small and extremely large empirical quantiles (on the vertical axis) are larger (in absolute value) than the corresponding theoretical quantiles of the normal distribution

Residuals vs Fitted Plot

plot(final_model, which = 1)

Code

# Load necessary libraries
library(ggplot2)    # For plotting
library(dplyr)      # For data manipulation
library(corrplot)   # For correlation plot
library(car)        # For VIF calculation and regression diagnostics



# Define the URL to fetch dataset
url <- "https://raw.githubusercontent.com/tafadzwabanga/Project-Tax/refs/heads/main/property_evaluation.csv"

# Download the CSV file from the URL and save it locally
download.file(url, destfile = "property_evaluation.csv")

# Read the downloaded dataset into R
property_evaluation <- read.csv("property_evaluation.csv")

# Display the first few rows of the dataset
head(property_evaluation)

# Flag the special property of interest (6321 88th Street) for highlighting in plots
property_evaluation$highlight <- ifelse(property_evaluation$Property..ID == "6321", "House_6321", "Other")

# Plot: Market Value vs Main Area with linear regression line
ggplot(property_evaluation, aes(x = Main.area, y = market.value, color = highlight)) +
  geom_point() +
  scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
  labs(title = "Scatterplot of Market value vs Main Area", color = "Property") +
  stat_smooth(method = "lm", col = "green")

# Plot: Market Value vs Garage Area
ggplot(property_evaluation, aes(x = Garage, y = market.value, color = highlight)) +
  geom_point() +
  scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
  labs(title = "Scatterplot of Market value vs Garage Area", color = "Property") +
  stat_smooth(method = "lm", col = "green")

# Plot: Market Value vs Land Size with a vertical marker line
ggplot(property_evaluation, aes(x = Land.size, y = market.value, color = highlight)) +
  geom_point() +
  scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
  labs(title = "Scatterplot of Market value vs Land Size", color = "Property") +
  stat_smooth(method = "lm", col = "green") +
  geom_vline(xintercept = 9000, linetype = "dashed", linewidth = 1.2, color = "blue")

# Plot: Garage vs Land Size
ggplot(property_evaluation, aes(x = Land.size, y = Garage, color = highlight)) +
  geom_point() +
  scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
  labs(title = "Scatterplot of Garage vs Land Size", color = "Property")

# Plot: Garage vs Main Area
ggplot(property_evaluation, aes(x = Main.area, y = Garage, color = highlight)) +
  geom_point() +
  scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
  labs(title = "Scatterplot of Garage vs Main Area", color = "Property")

# Plot: Land Size vs Main Area with a horizontal marker line
ggplot(property_evaluation, aes(x = Main.area, y = Land.size, color = highlight)) +
  geom_point() +
  scale_color_manual(values = c("House_6321" = "red", "Other" = "black")) +
  labs(title = "Scatterplot of Land Size vs Main Area", color = "Property") +
  stat_smooth(method = "lm", col = "blue") +
  geom_hline(yintercept = 7800, linetype = "dashed", linewidth = 1.2, color = "darkgreen")

# Extract only numeric variables for correlation analysis
numeric_data <- property_evaluation %>% select(where(is.numeric))

# Plot correlation matrix of numeric variables
corrplot(cor(numeric_data), method = "number", type = "upper")

# Build and summarize simple linear regression models
model_1 <- lm(market.value ~ Land.size, data = property_evaluation) 
summary(model_1)

model_2 <- lm(market.value ~ Garage, data = property_evaluation) 
summary(model_2)

model_3 <- lm(market.value ~ Main.area, data = property_evaluation) 
summary(model_3)

# Compute 95% confidence intervals for regression coefficients
confint(model_1)
confint(model_2)
confint(model_3)

# Build and summarize multiple regression models with different combinations
G_M <- lm(market.value ~ Garage + Main.area, data = property_evaluation) 
summary(G_M)

G_LS <- lm(market.value ~ Garage + Land.size, data = property_evaluation) 
summary(G_LS)

M_LS <- lm(market.value ~ Main.area + Land.size, data = property_evaluation) 
summary(M_LS)

final_model<-lm(formula = market.value ~ Main.area + Garage + Land.size, data = property_evaluation)

# Summarize final multiple regression model (assuming final_model is predefined)
summary(final_model)

# Perform ANOVA on the final model
anova(final_model)

# Check multicollinearity using Variance Inflation Factor (VIF)
vif(final_model)

# Create a vector of VIF values
vif_values <- vif(final_model)

# Visualize VIF values using a horizontal bar plot
barplot(vif_values, main = "VIF Values", horiz = TRUE, col = "blue")

library(patchwork)  # Load patchwork for multi-plot layouts


# Histograms for each numeric feature
plot1 <- ggplot(property_evaluation, aes(x = market.value)) +
  geom_histogram(binwidth = 50000, fill = "blue", color = "black") +
  labs(title = "2025 Market Value", x = "value", y = "Frequency")

plot2 <- ggplot(property_evaluation, aes(x = Main.area)) +
  geom_histogram(binwidth = 200, fill = "blue", color = "black") +
  labs(title = "Main Area", x = "Area", y = "Frequency")

plot3 <- ggplot(property_evaluation, aes(x = Garage)) +
  geom_histogram(binwidth = 100, fill = "blue", color = "black") +
  labs(title = "Garage", x = "Area", y = "Frequency")

plot4 <- ggplot(property_evaluation, aes(x = Land.size)) +
  geom_histogram(binwidth = 200, fill = "blue", color = "black") +
  labs(title = "Land Size", x = "size", y = "Frequency")

# Display histograms in a 2x2 grid
(plot1 + plot2)/(plot3 + plot4)

# Boxplots for each numeric feature
plot5 <- ggplot(property_evaluation, aes(y = market.value)) +
  geom_boxplot(fill = "orange") +
  labs(title = "Box Plot of 2025 Market Value", y = "Value")

plot6 <- ggplot(property_evaluation, aes(y = Main.area)) +
  geom_boxplot(fill = "orange") +
  labs(title = "Box Plot of Main Area", y = "Area")

plot7 <- ggplot(property_evaluation, aes(y = Garage)) +
  geom_boxplot(fill = "orange") +
  labs(title = "Box Plot of Garage", y = "Area")

plot8 <- ggplot(property_evaluation, aes(y = Land.size)) +
  geom_boxplot(fill = "orange") +
  labs(title = "Box Plot of Land Size", y = "Size")

# Display boxplots in a 2x2 grid
(plot5 + plot6)/(plot7 + plot8)

# Q-Q plot for checking normality of residuals in final model
qqnorm(resid(final_model)) 
qqline(resid(final_model))

# Residuals vs Fitted plot for checking homoscedasticity and model fit
plot(final_model, which = 1)

Multiple Linear Regression

Property Evaluation

Tafadzwa Banga

2025-04-15

1.0 Introduction

1.1 Data Collection

1.2 Exploratory Analysis

1.2.1 Load the dataset

1.3 Checking Relevancy of variables

1.3.1 Independent Variable with dependent variable

1.3.2 Independent Variable with Independent Variable

1.4 Regression Analysis

1.4.1 Single Independent Variable

1.4.2 Two Dependent Variable

1.5 Final Model

1.6 Conclusion

Appendix

Histogram Plots

Box Plots

Normality of Residuals

Residuals vs Fitted Plot

Code