1 Introduction

2. DATA Description and Pre-treatment

In order to assess the property tax, we firstly add the data from the Lubbock Central Appraisal District.

dat <- read.csv("https://raw.githubusercontent.com/Hola-Xu/SA-Project-2/refs/heads/main/project2%20dataset.csv")
head(dat)
##     No X2025.Market.Value Total.Improvement.Market.Value
## 1 6309             735026                         677026
## 2 6310             663907                         603222
## 3 6311             569992                         511992
## 4 6312             602427                         538677
## 5 6313             460288                         415135
## 6 6314             968766                         888796
##   Total.Land.Market.Value Homestead.Cap.Loss Total.Main.Area..Sq..Ft..
## 1                   58000                  0                     3462 
## 2                   60685                  0                     3226 
## 3                   58000                  0                     3036 
## 4                   63750                  0                     3277 
## 5                   45153                  0                     2241 
## 6                   79970                  0                      4188
##   Main.Area..Sq..Ft.. Main.Area..Value. Garage..Sq..Ft.. Garage..Value.
## 1                3192           558004              1063         83622 
## 2                3226               N/A              672            N/A
## 3                3036           447924               965         64068 
## 4                2877           432168              1309        106509 
## 5                2241           376845               506         38290 
## 6                2041           367600               985         79833 
##   Land..Sq..Ft..  X
## 1          10000 NA
## 2          10463 NA
## 3          10000 NA
## 4          10625 NA
## 5           7785 NA
## 6          13788 NA
str(dat)
## 'data.frame':    42 obs. of  12 variables:
##  $ No                            : num  6309 6310 6311 6312 6313 ...
##  $ X2025.Market.Value            : num  735026 663907 569992 602427 460288 ...
##  $ Total.Improvement.Market.Value: num  677026 603222 511992 538677 415135 ...
##  $ Total.Land.Market.Value       : num  58000 60685 58000 63750 45153 ...
##  $ Homestead.Cap.Loss            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Total.Main.Area..Sq..Ft..     : chr  "3462 " "3226 " "3036 " "3277 " ...
##  $ Main.Area..Sq..Ft..           : num  3192 3226 3036 2877 2241 ...
##  $ Main.Area..Value.             : chr  "558004 " "N/A" "447924 " "432168 " ...
##  $ Garage..Sq..Ft..              : num  1063 672 965 1309 506 ...
##  $ Garage..Value.                : chr  "83622 " "N/A" "64068 " "106509 " ...
##  $ Land..Sq..Ft..                : num  10000 10463 10000 10625 7785 ...
##  $ X                             : logi  NA NA NA NA NA NA ...
summary(dat)
##        No       X2025.Market.Value Total.Improvement.Market.Value
##  Min.   :6309   Min.   : 418286    Min.   : 373092               
##  1st Qu.:6319   1st Qu.: 506390    1st Qu.: 460406               
##  Median :6330   Median : 536431    Median : 490236               
##  Mean   :6330   Mean   : 570191    Mean   : 519710               
##  3rd Qu.:6340   3rd Qu.: 580772    3rd Qu.: 535879               
##  Max.   :6350   Max.   :1218146    Max.   :1116617               
##  Total.Land.Market.Value Homestead.Cap.Loss Total.Main.Area..Sq..Ft..
##  Min.   : 43506          Min.   :0          Length:42                
##  1st Qu.: 44986          1st Qu.:0          Class :character         
##  Median : 45658          Median :0          Mode  :character         
##  Mean   : 50480          Mean   :0                                   
##  3rd Qu.: 47379          3rd Qu.:0                                   
##  Max.   :101529          Max.   :0                                   
##  Main.Area..Sq..Ft.. Main.Area..Value.  Garage..Sq..Ft.. Garage..Value.    
##  Min.   :2041        Length:42          Min.   : 479.0   Length:42         
##  1st Qu.:2597        Class :character   1st Qu.: 528.0   Class :character  
##  Median :2752        Mode  :character   Median : 551.0   Mode  :character  
##  Mean   :2754                           Mean   : 694.9                     
##  3rd Qu.:2921                           3rd Qu.: 917.8                     
##  Max.   :3532                           Max.   :1309.0                     
##  Land..Sq..Ft..     X          
##  Min.   : 7501   Mode:logical  
##  1st Qu.: 7756   NA's:42       
##  Median : 7872                 
##  Mean   : 8695                 
##  3rd Qu.: 8169                 
##  Max.   :17505

Since the data provided to us is not in the most usable form for regression, we perform data pre-treatment. Here, we convert the data types from character variables to numeric variables.

After converting all the variables to the appropriate form, we shall now decide which of the variables are going to be predictor variables. Setting the “Ordered.to.Complete…Mins” as response variable (𝑦) and all other variables except “Unique.Identifier” shall be set as predictor variables($$x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,x_9$$). We shall then bind these variables into a data frame(dat_clean).

clean_numeric <- function(x) {
  x <- gsub("[$,]", "", x)                     
  x[x %in% c("", "NA", "N/A", "Not Available")] <- NA  
  as.numeric(x)
}

dat$X2025.Market.Value <- clean_numeric(dat$X2025.Market.Value)
dat$Total.Improvement.Market.Value <- clean_numeric(dat$Total.Improvement.Market.Value)
dat$Total.Land.Market.Value <- clean_numeric(dat$Total.Land.Market.Value)
dat$Total.Main.Area..Sq..Ft.. <- clean_numeric(dat$Total.Main.Area..Sq..Ft..)
dat$Main.Area..Sq..Ft.. <- clean_numeric(dat$Main.Area..Sq..Ft..)
dat$Main.Area..Value. <- clean_numeric(dat$Main.Area..Value.)
dat$Garage..Sq..Ft.. <- clean_numeric(dat$Garage..Sq..Ft..)
dat$Garage..Value. <- clean_numeric(dat$Garage..Value.)
dat$Land..Sq..Ft.. <- clean_numeric(dat$Land..Sq..Ft..)
str(dat)
## 'data.frame':    42 obs. of  12 variables:
##  $ No                            : num  6309 6310 6311 6312 6313 ...
##  $ X2025.Market.Value            : num  735026 663907 569992 602427 460288 ...
##  $ Total.Improvement.Market.Value: num  677026 603222 511992 538677 415135 ...
##  $ Total.Land.Market.Value       : num  58000 60685 58000 63750 45153 ...
##  $ Homestead.Cap.Loss            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Total.Main.Area..Sq..Ft..     : num  3462 3226 3036 3277 2241 ...
##  $ Main.Area..Sq..Ft..           : num  3192 3226 3036 2877 2241 ...
##  $ Main.Area..Value.             : num  558004 NA 447924 432168 376845 ...
##  $ Garage..Sq..Ft..              : num  1063 672 965 1309 506 ...
##  $ Garage..Value.                : num  83622 NA 64068 106509 38290 ...
##  $ Land..Sq..Ft..                : num  10000 10463 10000 10625 7785 ...
##  $ X                             : logi  NA NA NA NA NA NA ...
# Converting the data frame into variables of x's and y
x <- dat[,3:11]
dat_clean <- cbind(dat$X2025.Market.Value,as.data.frame(x))
colnames(dat_clean) <- c("y","x1","x2","x3","x4","x5","x6","x7","x8","x9")

Having converted the data and having set the variables into a data frame, we are ready to build our basic model.

3.Exploratory Data Analysis(EDA)

In this part, we try to assess the distribution of the response variables and find out whether there are some relationship between different predictors. If there are some relationship, try to find out the function between them.

Firstly, analysing the distribution of Electrical Resistance data.

#Histogram
hist(dat_clean$y,
     main= "Histogram of 2025 Market Value",
     xlab = "2025 Market Value",
     ) 

From this distribution plot, we can know that there is a right skew in the 2025 market value, meaning that there might be some outliers.

4. Model Adequacy for Basic Model

Firstly, check pairwise correlation of the predictor variables.

#remove x3
dat_clean <- dat_clean[,-4]
#remove No 6310
dat_clean <- dat_clean[-2,]
cor(dat_clean) #look at correlation between predictor variables
##            y        x1        x2        x4        x5        x6        x7
## y  1.0000000 0.9992520 0.9048352 0.9259925 0.2941591 0.6491497 0.5358176
## x1 0.9992520 1.0000000 0.8876934 0.9247665 0.2959295 0.6535773 0.5300639
## x2 0.9048352 0.8876934 1.0000000 0.8437421 0.2442516 0.5332810 0.5437600
## x4 0.9259925 0.9247665 0.8437421 1.0000000 0.4889257 0.6332790 0.5600016
## x5 0.2941591 0.2959295 0.2442516 0.4889257 1.0000000 0.7946873 0.3268482
## x6 0.6491497 0.6535773 0.5332810 0.6332790 0.7946873 1.0000000 0.3497951
## x7 0.5358176 0.5300639 0.5437600 0.5600016 0.3268482 0.3497951 1.0000000
## x8 0.6623152 0.6567621 0.6549693 0.6364813 0.2923794 0.4422172 0.9708413
## x9 0.9081255 0.8912874 0.9996069 0.8435918 0.2433478 0.5368376 0.5345184
##           x8        x9
## y  0.6623152 0.9081255
## x1 0.6567621 0.8912874
## x2 0.6549693 0.9996069
## x4 0.6364813 0.8435918
## x5 0.2923794 0.2433478
## x6 0.4422172 0.5368376
## x7 0.9708413 0.5345184
## x8 1.0000000 0.6451273
## x9 0.6451273 1.0000000
#install.packages("corrplot")
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.3
## corrplot 0.95 loaded
corrplot(cor(dat_clean),method="number",type="upper")

From this corrplot, we can find that x1 and x2, x1 and x4,x1 and x9, x2 and x4, x2 and x9,x4 and x9,x7 and x8 have high correlation, so it’s suitable to delete some of them to avoid overfitting. Also, relating to the high correlation between predictor variables and y, we try to serve x1 and x8, and delete x2, x4, x7 and x9.

Before model selection, we first build our first-order basic model: \[\hat{y}=\beta_0+\beta_1 x_1+\beta_5 x_5+\beta_6 x_6+\beta_8 x_8\]

# Fit basic model
fit1 <- lm(dat_clean$y~dat_clean$x1+dat_clean$x5+dat_clean$x6+dat_clean$x8)
summary(fit1)
## 
## Call:
## lm(formula = dat_clean$y ~ dat_clean$x1 + dat_clean$x5 + dat_clean$x6 + 
##     dat_clean$x8)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8796.5 -3868.1  -404.3  2900.9 19073.4 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.137e+04  8.295e+03   1.370    0.179    
## dat_clean$x1  1.081e+00  1.230e-02  87.888   <2e-16 ***
## dat_clean$x5  2.958e+00  5.696e+00   0.519    0.607    
## dat_clean$x6 -3.371e-02  3.708e-02  -0.909    0.369    
## dat_clean$x8  7.105e-02  6.076e-02   1.169    0.250    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5473 on 36 degrees of freedom
## Multiple R-squared:  0.9986, Adjusted R-squared:  0.9985 
## F-statistic:  6457 on 4 and 36 DF,  p-value: < 2.2e-16
plot(fit1,1)

plot(fit1,2)

plot(fit1,3) 

##Check VIFs ##
#install.package("car")
vif(fit1) # lots of the vifs are concerning
## dat_clean$x1 dat_clean$x5 dat_clean$x6 dat_clean$x8 
##     3.347536     3.720703     5.819666     1.841477

So the first model is:

\[ y = 1.137e+04 + 1.08e x1 + 2.958e x5 -3.371e x6-02 +7.105e-02x8 \]

As shown in Residuals vs Fitted , there is no outward - opening funnel pattern which implies that the model seems extreamly perfect, so there might be overfitting among them.

As shown in Q - Q Residuals, we can observe a positive skew, which implies the non-normality.

As shown in Scale- Location, there are some outliers whose standardized residuals are greater than 1.5.

As shown in Scale- Location, the node 4 is high leverage which come close to leverage 1; node 14 and node 6 are out of the Cook’s Distance, meaning that they are influential points.

Therefore, our basic model doesn’t have model adequacy.

Here, the VIF of x6 is more than 5, that means x6 has significance multicolinearity with other predictor variables. So we can build another model without x6, and see its influence.

5. Refine model

In this section, we try to build another model without x6(Main_Area_Value), which has significance multicolinearity with other predictor variables.

fit2 <- lm(dat_clean$y~dat_clean$x1+dat_clean$x5+dat_clean$x8)
summary(fit2)
## 
## Call:
## lm(formula = dat_clean$y ~ dat_clean$x1 + dat_clean$x5 + dat_clean$x8)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9038.8 -4088.7  -617.6  2774.3 19100.2 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.176e+04  8.264e+03   1.423    0.163    
## dat_clean$x1  1.074e+00  8.990e-03 119.431   <2e-16 ***
## dat_clean$x5 -1.374e+00  3.113e+00  -0.441    0.662    
## dat_clean$x8  8.013e-02  5.979e-02   1.340    0.188    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5460 on 37 degrees of freedom
## Multiple R-squared:  0.9986, Adjusted R-squared:  0.9985 
## F-statistic:  8650 on 3 and 37 DF,  p-value: < 2.2e-16
vif(fit2)
## dat_clean$x1 dat_clean$x5 dat_clean$x8 
##     1.795790     1.116659     1.791689
dat_clean2 <- dat_clean[,-15]
fit2 <- lm(dat_clean2$y~dat_clean2$x1+dat_clean2$x5+dat_clean$x8)
summary(fit2)
## 
## Call:
## lm(formula = dat_clean2$y ~ dat_clean2$x1 + dat_clean2$x5 + dat_clean$x8)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9038.8 -4088.7  -617.6  2774.3 19100.2 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.176e+04  8.264e+03   1.423    0.163    
## dat_clean2$x1  1.074e+00  8.990e-03 119.431   <2e-16 ***
## dat_clean2$x5 -1.374e+00  3.113e+00  -0.441    0.662    
## dat_clean$x8   8.013e-02  5.979e-02   1.340    0.188    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5460 on 37 degrees of freedom
## Multiple R-squared:  0.9986, Adjusted R-squared:  0.9985 
## F-statistic:  8650 on 3 and 37 DF,  p-value: < 2.2e-16

So the refine model is:

\[ y = 1.176e^4 + 1.074e x1 - 1.374e x5 +8.013e^{-02}x8 \]

From the summary, we can know that the multiple R-squared of fit2 is 0.9986, meaning that the model explains 99.86% of variation in market value very strong fit. The Adjusted R-squared is 0.9985, meaning this model is not overfitted. And the p-value <0.05, meaning that these response variables have a meaningfully contribution in predicting the 2025 market value.

vif(fit2)
## dat_clean2$x1 dat_clean2$x5  dat_clean$x8 
##      1.795790      1.116659      1.791689

6 Prediction

After having a meaningful model fit2, we try to use fit2 and the 95% prediction interval to predict the 2025 market value of 6321 88th Street.

# Create new data frame for prediction
predict_value_6321 <- data.frame(
  x1 = 43767,
  x5 = 451287,
  x8 = 7546
  )

# Make prediction with prediction interval
predict(fit2, newdata = predict_value_6321, interval = "prediction", level = 0.95)
## Warning: 'newdata' had 1 row but variables found have 41 rows
##          fit       lwr       upr
## 1   740965.5  729236.2  752694.7
## 2   562424.0  550992.9  573855.0
## 3   594693.7  581804.9  607582.4
## 4   457460.0  445857.1  469062.9
## 5   969610.4  955963.9  983256.8
## 6   556303.4  544746.3  567860.6
## 7   702585.6  691066.3  714105.0
## 8   586503.3  575137.6  597869.0
## 9   476866.4  465448.5  488284.4
## 10  465664.5  454239.3  477089.7
## 11  607543.7  596145.7  618941.8
## 12  542497.7  531268.9  553726.5
## 13 1214116.4 1200175.9 1228056.8
## 14  535308.1  524035.5  546580.7
## 15  626148.8  614591.1  637706.4
## 16  511732.6  500471.0  522994.2
## 17  600470.3  589009.3  611931.3
## 18  498099.3  486791.0  509407.5
## 19  490776.4  479427.3  502125.5
## 20  512644.8  501368.4  523921.3
## 21  411982.4  400351.5  423613.2
## 22  587989.5  576649.8  599329.3
## 23  580000.8  568245.5  591756.1
## 24  539446.5  528153.9  550739.1
## 25  466683.0  455134.7  478231.3
## 26  616190.0  603655.6  628724.4
## 27  526232.2  515003.2  537461.3
## 28  522813.9  511568.9  534058.9
## 29  531475.5  520234.8  542716.1
## 30  560364.3  548971.2  571757.5
## 31  464377.7  452851.2  475904.2
## 32  540083.7  528810.5  551357.0
## 33  532578.6  521271.6  543885.7
## 34  561968.4  550546.2  573390.6
## 35  503624.4  492359.2  514889.5
## 36  532813.6  521541.9  544085.3
## 37  579299.1  567960.2  590637.9
## 38  430653.4  419210.6  442096.2
## 39  531974.2  520664.5  543283.9
## 40  511552.3  500284.4  522820.1
## 41  499579.7  488283.7  510875.7