In order to assess the property tax, we firstly add the data from the Lubbock Central Appraisal District.
dat <- read.csv("https://raw.githubusercontent.com/Hola-Xu/SA-Project-2/refs/heads/main/project2%20dataset.csv")
head(dat)
## No X2025.Market.Value Total.Improvement.Market.Value
## 1 6309 735026 677026
## 2 6310 663907 603222
## 3 6311 569992 511992
## 4 6312 602427 538677
## 5 6313 460288 415135
## 6 6314 968766 888796
## Total.Land.Market.Value Homestead.Cap.Loss Total.Main.Area..Sq..Ft..
## 1 58000 0 3462
## 2 60685 0 3226
## 3 58000 0 3036
## 4 63750 0 3277
## 5 45153 0 2241
## 6 79970 0 4188
## Main.Area..Sq..Ft.. Main.Area..Value. Garage..Sq..Ft.. Garage..Value.
## 1 3192 558004 1063 83622
## 2 3226 N/A 672 N/A
## 3 3036 447924 965 64068
## 4 2877 432168 1309 106509
## 5 2241 376845 506 38290
## 6 2041 367600 985 79833
## Land..Sq..Ft.. X
## 1 10000 NA
## 2 10463 NA
## 3 10000 NA
## 4 10625 NA
## 5 7785 NA
## 6 13788 NA
str(dat)
## 'data.frame': 42 obs. of 12 variables:
## $ No : num 6309 6310 6311 6312 6313 ...
## $ X2025.Market.Value : num 735026 663907 569992 602427 460288 ...
## $ Total.Improvement.Market.Value: num 677026 603222 511992 538677 415135 ...
## $ Total.Land.Market.Value : num 58000 60685 58000 63750 45153 ...
## $ Homestead.Cap.Loss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Total.Main.Area..Sq..Ft.. : chr "3462 " "3226 " "3036 " "3277 " ...
## $ Main.Area..Sq..Ft.. : num 3192 3226 3036 2877 2241 ...
## $ Main.Area..Value. : chr "558004 " "N/A" "447924 " "432168 " ...
## $ Garage..Sq..Ft.. : num 1063 672 965 1309 506 ...
## $ Garage..Value. : chr "83622 " "N/A" "64068 " "106509 " ...
## $ Land..Sq..Ft.. : num 10000 10463 10000 10625 7785 ...
## $ X : logi NA NA NA NA NA NA ...
summary(dat)
## No X2025.Market.Value Total.Improvement.Market.Value
## Min. :6309 Min. : 418286 Min. : 373092
## 1st Qu.:6319 1st Qu.: 506390 1st Qu.: 460406
## Median :6330 Median : 536431 Median : 490236
## Mean :6330 Mean : 570191 Mean : 519710
## 3rd Qu.:6340 3rd Qu.: 580772 3rd Qu.: 535879
## Max. :6350 Max. :1218146 Max. :1116617
## Total.Land.Market.Value Homestead.Cap.Loss Total.Main.Area..Sq..Ft..
## Min. : 43506 Min. :0 Length:42
## 1st Qu.: 44986 1st Qu.:0 Class :character
## Median : 45658 Median :0 Mode :character
## Mean : 50480 Mean :0
## 3rd Qu.: 47379 3rd Qu.:0
## Max. :101529 Max. :0
## Main.Area..Sq..Ft.. Main.Area..Value. Garage..Sq..Ft.. Garage..Value.
## Min. :2041 Length:42 Min. : 479.0 Length:42
## 1st Qu.:2597 Class :character 1st Qu.: 528.0 Class :character
## Median :2752 Mode :character Median : 551.0 Mode :character
## Mean :2754 Mean : 694.9
## 3rd Qu.:2921 3rd Qu.: 917.8
## Max. :3532 Max. :1309.0
## Land..Sq..Ft.. X
## Min. : 7501 Mode:logical
## 1st Qu.: 7756 NA's:42
## Median : 7872
## Mean : 8695
## 3rd Qu.: 8169
## Max. :17505
Since the data provided to us is not in the most usable form for regression, we perform data pre-treatment. Here, we convert the data types from character variables to numeric variables.
After converting all the variables to the appropriate form, we shall now decide which of the variables are going to be predictor variables. Setting the “Ordered.to.Complete…Mins” as response variable (𝑦) and all other variables except “Unique.Identifier” shall be set as predictor variables($$x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,x_9$$). We shall then bind these variables into a data frame(dat_clean).
clean_numeric <- function(x) {
x <- gsub("[$,]", "", x)
x[x %in% c("", "NA", "N/A", "Not Available")] <- NA
as.numeric(x)
}
dat$X2025.Market.Value <- clean_numeric(dat$X2025.Market.Value)
dat$Total.Improvement.Market.Value <- clean_numeric(dat$Total.Improvement.Market.Value)
dat$Total.Land.Market.Value <- clean_numeric(dat$Total.Land.Market.Value)
dat$Total.Main.Area..Sq..Ft.. <- clean_numeric(dat$Total.Main.Area..Sq..Ft..)
dat$Main.Area..Sq..Ft.. <- clean_numeric(dat$Main.Area..Sq..Ft..)
dat$Main.Area..Value. <- clean_numeric(dat$Main.Area..Value.)
dat$Garage..Sq..Ft.. <- clean_numeric(dat$Garage..Sq..Ft..)
dat$Garage..Value. <- clean_numeric(dat$Garage..Value.)
dat$Land..Sq..Ft.. <- clean_numeric(dat$Land..Sq..Ft..)
str(dat)
## 'data.frame': 42 obs. of 12 variables:
## $ No : num 6309 6310 6311 6312 6313 ...
## $ X2025.Market.Value : num 735026 663907 569992 602427 460288 ...
## $ Total.Improvement.Market.Value: num 677026 603222 511992 538677 415135 ...
## $ Total.Land.Market.Value : num 58000 60685 58000 63750 45153 ...
## $ Homestead.Cap.Loss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Total.Main.Area..Sq..Ft.. : num 3462 3226 3036 3277 2241 ...
## $ Main.Area..Sq..Ft.. : num 3192 3226 3036 2877 2241 ...
## $ Main.Area..Value. : num 558004 NA 447924 432168 376845 ...
## $ Garage..Sq..Ft.. : num 1063 672 965 1309 506 ...
## $ Garage..Value. : num 83622 NA 64068 106509 38290 ...
## $ Land..Sq..Ft.. : num 10000 10463 10000 10625 7785 ...
## $ X : logi NA NA NA NA NA NA ...
# Converting the data frame into variables of x's and y
x <- dat[,3:11]
dat_clean <- cbind(dat$X2025.Market.Value,as.data.frame(x))
colnames(dat_clean) <- c("y","x1","x2","x3","x4","x5","x6","x7","x8","x9")
Having converted the data and having set the variables into a data frame, we are ready to build our basic model.
In this part, we try to assess the distribution of the response variables and find out whether there are some relationship between different predictors. If there are some relationship, try to find out the function between them.
Firstly, analysing the distribution of Electrical Resistance data.
#Histogram
hist(dat_clean$y,
main= "Histogram of 2025 Market Value",
xlab = "2025 Market Value",
)
From this distribution plot, we can know that there is a right skew in the 2025 market value, meaning that there might be some outliers.
Firstly, check pairwise correlation of the predictor variables.
#remove x3
dat_clean <- dat_clean[,-4]
#remove No 6310
dat_clean <- dat_clean[-2,]
cor(dat_clean) #look at correlation between predictor variables
## y x1 x2 x4 x5 x6 x7
## y 1.0000000 0.9992520 0.9048352 0.9259925 0.2941591 0.6491497 0.5358176
## x1 0.9992520 1.0000000 0.8876934 0.9247665 0.2959295 0.6535773 0.5300639
## x2 0.9048352 0.8876934 1.0000000 0.8437421 0.2442516 0.5332810 0.5437600
## x4 0.9259925 0.9247665 0.8437421 1.0000000 0.4889257 0.6332790 0.5600016
## x5 0.2941591 0.2959295 0.2442516 0.4889257 1.0000000 0.7946873 0.3268482
## x6 0.6491497 0.6535773 0.5332810 0.6332790 0.7946873 1.0000000 0.3497951
## x7 0.5358176 0.5300639 0.5437600 0.5600016 0.3268482 0.3497951 1.0000000
## x8 0.6623152 0.6567621 0.6549693 0.6364813 0.2923794 0.4422172 0.9708413
## x9 0.9081255 0.8912874 0.9996069 0.8435918 0.2433478 0.5368376 0.5345184
## x8 x9
## y 0.6623152 0.9081255
## x1 0.6567621 0.8912874
## x2 0.6549693 0.9996069
## x4 0.6364813 0.8435918
## x5 0.2923794 0.2433478
## x6 0.4422172 0.5368376
## x7 0.9708413 0.5345184
## x8 1.0000000 0.6451273
## x9 0.6451273 1.0000000
#install.packages("corrplot")
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
corrplot(cor(dat_clean),method="number",type="upper")
From this corrplot, we can find that x1 and x2, x1 and x4,x1 and x9, x2 and x4, x2 and x9,x4 and x9,x7 and x8 have high correlation, so it’s suitable to delete some of them to avoid overfitting. Also, relating to the high correlation between predictor variables and y, we try to serve x1 and x8, and delete x2, x4, x7 and x9.
Before model selection, we first build our first-order basic model: \[\hat{y}=\beta_0+\beta_1 x_1+\beta_5 x_5+\beta_6 x_6+\beta_8 x_8\]
# Fit basic model
fit1 <- lm(dat_clean$y~dat_clean$x1+dat_clean$x5+dat_clean$x6+dat_clean$x8)
summary(fit1)
##
## Call:
## lm(formula = dat_clean$y ~ dat_clean$x1 + dat_clean$x5 + dat_clean$x6 +
## dat_clean$x8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8796.5 -3868.1 -404.3 2900.9 19073.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.137e+04 8.295e+03 1.370 0.179
## dat_clean$x1 1.081e+00 1.230e-02 87.888 <2e-16 ***
## dat_clean$x5 2.958e+00 5.696e+00 0.519 0.607
## dat_clean$x6 -3.371e-02 3.708e-02 -0.909 0.369
## dat_clean$x8 7.105e-02 6.076e-02 1.169 0.250
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5473 on 36 degrees of freedom
## Multiple R-squared: 0.9986, Adjusted R-squared: 0.9985
## F-statistic: 6457 on 4 and 36 DF, p-value: < 2.2e-16
plot(fit1,1)
plot(fit1,2)
plot(fit1,3)
##Check VIFs ##
#install.package("car")
vif(fit1) # lots of the vifs are concerning
## dat_clean$x1 dat_clean$x5 dat_clean$x6 dat_clean$x8
## 3.347536 3.720703 5.819666 1.841477
So the first model is:
\[ y = 1.137e+04 + 1.08e x1 + 2.958e x5 -3.371e x6-02 +7.105e-02x8 \]
As shown in Residuals vs Fitted , there is no outward - opening funnel pattern which implies that the model seems extreamly perfect, so there might be overfitting among them.
As shown in Q - Q Residuals, we can observe a positive skew, which implies the non-normality.
As shown in Scale- Location, there are some outliers whose standardized residuals are greater than 1.5.
As shown in Scale- Location, the node 4 is high leverage which come close to leverage 1; node 14 and node 6 are out of the Cook’s Distance, meaning that they are influential points.
Therefore, our basic model doesn’t have model adequacy.
Here, the VIF of x6 is more than 5, that means x6 has significance multicolinearity with other predictor variables. So we can build another model without x6, and see its influence.
In this section, we try to build another model without x6(Main_Area_Value), which has significance multicolinearity with other predictor variables.
fit2 <- lm(dat_clean$y~dat_clean$x1+dat_clean$x5+dat_clean$x8)
summary(fit2)
##
## Call:
## lm(formula = dat_clean$y ~ dat_clean$x1 + dat_clean$x5 + dat_clean$x8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9038.8 -4088.7 -617.6 2774.3 19100.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.176e+04 8.264e+03 1.423 0.163
## dat_clean$x1 1.074e+00 8.990e-03 119.431 <2e-16 ***
## dat_clean$x5 -1.374e+00 3.113e+00 -0.441 0.662
## dat_clean$x8 8.013e-02 5.979e-02 1.340 0.188
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5460 on 37 degrees of freedom
## Multiple R-squared: 0.9986, Adjusted R-squared: 0.9985
## F-statistic: 8650 on 3 and 37 DF, p-value: < 2.2e-16
vif(fit2)
## dat_clean$x1 dat_clean$x5 dat_clean$x8
## 1.795790 1.116659 1.791689
dat_clean2 <- dat_clean[,-15]
fit2 <- lm(dat_clean2$y~dat_clean2$x1+dat_clean2$x5+dat_clean$x8)
summary(fit2)
##
## Call:
## lm(formula = dat_clean2$y ~ dat_clean2$x1 + dat_clean2$x5 + dat_clean$x8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9038.8 -4088.7 -617.6 2774.3 19100.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.176e+04 8.264e+03 1.423 0.163
## dat_clean2$x1 1.074e+00 8.990e-03 119.431 <2e-16 ***
## dat_clean2$x5 -1.374e+00 3.113e+00 -0.441 0.662
## dat_clean$x8 8.013e-02 5.979e-02 1.340 0.188
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5460 on 37 degrees of freedom
## Multiple R-squared: 0.9986, Adjusted R-squared: 0.9985
## F-statistic: 8650 on 3 and 37 DF, p-value: < 2.2e-16
So the refine model is:
\[ y = 1.176e^4 + 1.074e x1 - 1.374e x5 +8.013e^{-02}x8 \]
From the summary, we can know that the multiple R-squared of fit2 is 0.9986, meaning that the model explains 99.86% of variation in market value very strong fit. The Adjusted R-squared is 0.9985, meaning this model is not overfitted. And the p-value <0.05, meaning that these response variables have a meaningfully contribution in predicting the 2025 market value.
vif(fit2)
## dat_clean2$x1 dat_clean2$x5 dat_clean$x8
## 1.795790 1.116659 1.791689
After having a meaningful model fit2, we try to use fit2 and the 95% prediction interval to predict the 2025 market value of 6321 88th Street.
# Create new data frame for prediction
predict_value_6321 <- data.frame(
x1 = 43767,
x5 = 451287,
x8 = 7546
)
# Make prediction with prediction interval
predict(fit2, newdata = predict_value_6321, interval = "prediction", level = 0.95)
## Warning: 'newdata' had 1 row but variables found have 41 rows
## fit lwr upr
## 1 740965.5 729236.2 752694.7
## 2 562424.0 550992.9 573855.0
## 3 594693.7 581804.9 607582.4
## 4 457460.0 445857.1 469062.9
## 5 969610.4 955963.9 983256.8
## 6 556303.4 544746.3 567860.6
## 7 702585.6 691066.3 714105.0
## 8 586503.3 575137.6 597869.0
## 9 476866.4 465448.5 488284.4
## 10 465664.5 454239.3 477089.7
## 11 607543.7 596145.7 618941.8
## 12 542497.7 531268.9 553726.5
## 13 1214116.4 1200175.9 1228056.8
## 14 535308.1 524035.5 546580.7
## 15 626148.8 614591.1 637706.4
## 16 511732.6 500471.0 522994.2
## 17 600470.3 589009.3 611931.3
## 18 498099.3 486791.0 509407.5
## 19 490776.4 479427.3 502125.5
## 20 512644.8 501368.4 523921.3
## 21 411982.4 400351.5 423613.2
## 22 587989.5 576649.8 599329.3
## 23 580000.8 568245.5 591756.1
## 24 539446.5 528153.9 550739.1
## 25 466683.0 455134.7 478231.3
## 26 616190.0 603655.6 628724.4
## 27 526232.2 515003.2 537461.3
## 28 522813.9 511568.9 534058.9
## 29 531475.5 520234.8 542716.1
## 30 560364.3 548971.2 571757.5
## 31 464377.7 452851.2 475904.2
## 32 540083.7 528810.5 551357.0
## 33 532578.6 521271.6 543885.7
## 34 561968.4 550546.2 573390.6
## 35 503624.4 492359.2 514889.5
## 36 532813.6 521541.9 544085.3
## 37 579299.1 567960.2 590637.9
## 38 430653.4 419210.6 442096.2
## 39 531974.2 520664.5 543283.9
## 40 511552.3 500284.4 522820.1
## 41 499579.7 488283.7 510875.7