Data Loading

library(xlsx)
data <- read.xlsx("./data/Leslie_Salt.xlsx",1)
str(data)
## 'data.frame':    31 obs. of  8 variables:
##  $ Price    : num  4.5 10.6 1.7 5 5 3.3 5.7 6.2 19.4 3.2 ...
##  $ County   : num  1 1 0 0 0 1 1 1 1 1 ...
##  $ Size     : num  138.4 52 16.1 1695.2 845 ...
##  $ Elevation: num  10 4 0 1 1 2 4 4 20 0 ...
##  $ Sewer    : num  3000 0 2640 3500 1000 10000 0 0 1300 6000 ...
##  $ Date     : num  -103 -103 -98 -93 -92 -86 -68 -64 -63 -62 ...
##  $ Flood    : num  0 0 1 0 1 0 0 0 0 0 ...
##  $ Distance : num  0.3 2.5 10.3 14 14 0 0 0 1.2 0 ...

Data Processing

Convert Country and Flood as factor variables.

data$County <- factor(data$County,  levels=c("0","1"), labels=c("San Mateo", "Santa Clara" ))
data$Flood <- factor(data$Flood,  levels=c("0","1"), labels=c("No", "Yes" ))

Data Exploration

library(GGally)
ggpairs(data)

We notice that date has the maximum corelation with the price i.e. the properties sold closer to the current date have fetched higher price. Multicollinearity between the variables does not seem to be a problem.

Regression Analysis

First, we fit the full model as the following.

fit1 <- lm(Price ~ ., data= data)
summary(fit1)
## 
## Call:
## lm(formula = Price ~ ., data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.169 -2.957 -0.256  2.070 13.031 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.364e+01  3.829e+00   6.174 2.68e-06 ***
## CountySanta Clara -8.789e+00  3.652e+00  -2.407 0.024532 *  
## Size              -6.043e-03  3.501e-03  -1.726 0.097702 .  
## Elevation          5.193e-01  2.386e-01   2.177 0.040030 *  
## Sewer             -9.573e-04  4.169e-04  -2.296 0.031126 *  
## Date               8.508e-02  4.865e-02   1.749 0.093646 .  
## FloodYes          -1.202e+01  2.989e+00  -4.020 0.000536 ***
## Distance           1.858e-01  3.395e-01   0.547 0.589386    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.431 on 23 degrees of freedom
## Multiple R-squared:  0.747,  Adjusted R-squared:   0.67 
## F-statistic: 9.703 on 7 and 23 DF,  p-value: 1.351e-05

The p-value is <0.05, thus this is a valid model. All the variables are significant minus the Distance. R-squared is 0.75, thus the model explains 75% of the variance. The Adjusted R-squared is 0.67. Lets build a model without the “Distance” variable.

fit2 <- lm(Price ~ .-Distance, data= data)
summary(fit2)
## 
## Call:
## lm(formula = Price ~ . - Distance, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6249 -2.8459 -0.5266  1.9702 12.6368 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.521e+01  2.514e+00  10.025 4.68e-10 ***
## CountySanta Clara -1.005e+01  2.789e+00  -3.604 0.001423 ** 
## Size              -5.425e-03  3.265e-03  -1.662 0.109585    
## Elevation          4.995e-01  2.324e-01   2.150 0.041869 *  
## Sewer             -1.054e-03  3.724e-04  -2.829 0.009272 ** 
## Date               7.842e-02  4.641e-02   1.690 0.104029    
## FloodYes          -1.219e+01  2.927e+00  -4.166 0.000347 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.366 on 24 degrees of freedom
## Multiple R-squared:  0.7437, Adjusted R-squared:  0.6797 
## F-statistic: 11.61 on 6 and 24 DF,  p-value: 4.257e-06

The Multiple R-squred and Adjusted R-Squared have similar values at 0.75 and 0.67 respectively when compared to model 1. The Size and Date are insignificant variables in this model. Let’s build a 3rd model without the Size and Date variables.

fit3 <- lm(Price ~ .-Distance-Size-Date, data= data)
summary(fit3)
## 
## Call:
## lm(formula = Price ~ . - Distance - Size - Date, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9477  -3.1059   0.1916   2.1658  13.8554 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.063e+01  2.605e+00   7.921 2.13e-08 ***
## CountySanta Clara -1.090e+01  2.492e+00  -4.375 0.000175 ***
## Elevation          5.428e-01  2.803e-01   1.937 0.063704 .  
## Sewer             -1.207e-03  4.347e-04  -2.777 0.010032 *  
## FloodYes          -1.318e+01  3.230e+00  -4.081 0.000379 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.283 on 26 degrees of freedom
## Multiple R-squared:  0.5936, Adjusted R-squared:  0.5311 
## F-statistic: 9.495 on 4 and 26 DF,  p-value: 7.183e-05

The Multiple R-squared and Adjusted R-Squared have fallen to 0.59 and 0.53 respectively. We will reject this model and go with the second model.

Residual Analysis and Diagnostics

par(mfrow = c(2, 2))
plot(fit2)

By and large the plots look fine minus the data point#26 which is an errant in each of the diagostic plots. Let’s re-build model #2 without the row number 26.

fit4 <- lm(Price ~ .-Distance, data= data[-26,])
summary(fit4)
## 
## Call:
## lm(formula = Price ~ . - Distance, data = data[-26, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1098 -2.0090 -0.4809  2.2081  5.1694 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       22.2741624  1.9916393  11.184 8.89e-11 ***
## CountySanta Clara -5.7200830  2.3006652  -2.486  0.02061 *  
## Size              -0.0022859  0.0025410  -0.900  0.37766    
## Elevation          0.4972991  0.1737789   2.862  0.00882 ** 
## Sewer             -0.0007418  0.0002871  -2.584  0.01661 *  
## Date               0.1110019  0.0354663   3.130  0.00470 ** 
## FloodYes          -8.4899819  2.3413474  -3.626  0.00142 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.265 on 23 degrees of freedom
## Multiple R-squared:  0.7824, Adjusted R-squared:  0.7256 
## F-statistic: 13.78 on 6 and 23 DF,  p-value: 1.307e-06

Size is insignificant variable in this model, let’s buid a model without it.

fit5 <- lm(Price ~ .-Distance-Size, data= data[-26,])
summary(fit5)
## 
## Call:
## lm(formula = Price ~ . - Distance - Size, data = data[-26, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0186 -2.2651 -0.3114  2.1549  5.1596 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       22.0187525  1.9634490  11.214 5.01e-11 ***
## CountySanta Clara -4.4613706  1.8189990  -2.453  0.02183 *  
## Elevation          0.5086667  0.1726287   2.947  0.00704 ** 
## Sewer             -0.0006846  0.0002789  -2.455  0.02173 *  
## Date               0.1308357  0.0276699   4.728 8.28e-05 ***
## FloodYes          -7.6795702  2.1524916  -3.568  0.00156 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.252 on 24 degrees of freedom
## Multiple R-squared:  0.7747, Adjusted R-squared:  0.7278 
## F-statistic: 16.51 on 5 and 24 DF,  p-value: 4.372e-07

Predict the price of the Leslie_Salt property

County = Santa Clara, Size = 246.8, Elevation = 0 (property at sea level),Sewer = 0(no data provided), Date = 6 (assuming property will be sold in next 6 months), Flood = 0 (property diked), Distance = 0 (as distance is relative to Leslie Salt property)

leslie_salt <- data.frame(0,"Santa Clara",246.8,0,0,6,"No",0)
colnames(leslie_salt) <- c("Price", "County", "Size", "Elevation", "Sewer", "Date", "Flood", "Distance")
data<-rbind(data,leslie_salt)
leslie_salt_price <- predict(fit5, newdata = data[32,])
leslie_salt_price
##      32 
## 18.3424

The price of Leslie Salt property is $18342/acre.