library(xlsx)
data <- read.xlsx("./data/Leslie_Salt.xlsx",1)
str(data)
## 'data.frame': 31 obs. of 8 variables:
## $ Price : num 4.5 10.6 1.7 5 5 3.3 5.7 6.2 19.4 3.2 ...
## $ County : num 1 1 0 0 0 1 1 1 1 1 ...
## $ Size : num 138.4 52 16.1 1695.2 845 ...
## $ Elevation: num 10 4 0 1 1 2 4 4 20 0 ...
## $ Sewer : num 3000 0 2640 3500 1000 10000 0 0 1300 6000 ...
## $ Date : num -103 -103 -98 -93 -92 -86 -68 -64 -63 -62 ...
## $ Flood : num 0 0 1 0 1 0 0 0 0 0 ...
## $ Distance : num 0.3 2.5 10.3 14 14 0 0 0 1.2 0 ...
Convert Country and Flood as factor variables.
data$County <- factor(data$County, levels=c("0","1"), labels=c("San Mateo", "Santa Clara" ))
data$Flood <- factor(data$Flood, levels=c("0","1"), labels=c("No", "Yes" ))
library(GGally)
ggpairs(data)
We notice that date has the maximum corelation with the price i.e. the properties sold closer to the current date have fetched higher price. Multicollinearity between the variables does not seem to be a problem.
First, we fit the full model as the following.
fit1 <- lm(Price ~ ., data= data)
summary(fit1)
##
## Call:
## lm(formula = Price ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.169 -2.957 -0.256 2.070 13.031
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.364e+01 3.829e+00 6.174 2.68e-06 ***
## CountySanta Clara -8.789e+00 3.652e+00 -2.407 0.024532 *
## Size -6.043e-03 3.501e-03 -1.726 0.097702 .
## Elevation 5.193e-01 2.386e-01 2.177 0.040030 *
## Sewer -9.573e-04 4.169e-04 -2.296 0.031126 *
## Date 8.508e-02 4.865e-02 1.749 0.093646 .
## FloodYes -1.202e+01 2.989e+00 -4.020 0.000536 ***
## Distance 1.858e-01 3.395e-01 0.547 0.589386
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.431 on 23 degrees of freedom
## Multiple R-squared: 0.747, Adjusted R-squared: 0.67
## F-statistic: 9.703 on 7 and 23 DF, p-value: 1.351e-05
The p-value is <0.05, thus this is a valid model. All the variables are significant minus the Distance. R-squared is 0.75, thus the model explains 75% of the variance. The Adjusted R-squared is 0.67. Lets build a model without the “Distance” variable.
fit2 <- lm(Price ~ .-Distance, data= data)
summary(fit2)
##
## Call:
## lm(formula = Price ~ . - Distance, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6249 -2.8459 -0.5266 1.9702 12.6368
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.521e+01 2.514e+00 10.025 4.68e-10 ***
## CountySanta Clara -1.005e+01 2.789e+00 -3.604 0.001423 **
## Size -5.425e-03 3.265e-03 -1.662 0.109585
## Elevation 4.995e-01 2.324e-01 2.150 0.041869 *
## Sewer -1.054e-03 3.724e-04 -2.829 0.009272 **
## Date 7.842e-02 4.641e-02 1.690 0.104029
## FloodYes -1.219e+01 2.927e+00 -4.166 0.000347 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.366 on 24 degrees of freedom
## Multiple R-squared: 0.7437, Adjusted R-squared: 0.6797
## F-statistic: 11.61 on 6 and 24 DF, p-value: 4.257e-06
The Multiple R-squred and Adjusted R-Squared have similar values at 0.75 and 0.67 respectively when compared to model 1. The Size and Date are insignificant variables in this model. Let’s build a 3rd model without the Size and Date variables.
fit3 <- lm(Price ~ .-Distance-Size-Date, data= data)
summary(fit3)
##
## Call:
## lm(formula = Price ~ . - Distance - Size - Date, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9477 -3.1059 0.1916 2.1658 13.8554
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.063e+01 2.605e+00 7.921 2.13e-08 ***
## CountySanta Clara -1.090e+01 2.492e+00 -4.375 0.000175 ***
## Elevation 5.428e-01 2.803e-01 1.937 0.063704 .
## Sewer -1.207e-03 4.347e-04 -2.777 0.010032 *
## FloodYes -1.318e+01 3.230e+00 -4.081 0.000379 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.283 on 26 degrees of freedom
## Multiple R-squared: 0.5936, Adjusted R-squared: 0.5311
## F-statistic: 9.495 on 4 and 26 DF, p-value: 7.183e-05
The Multiple R-squared and Adjusted R-Squared have fallen to 0.59 and 0.53 respectively. We will reject this model and go with the second model.
par(mfrow = c(2, 2))
plot(fit2)
By and large the plots look fine minus the data point#26 which is an errant in each of the diagostic plots. Let’s re-build model #2 without the row number 26.
fit4 <- lm(Price ~ .-Distance, data= data[-26,])
summary(fit4)
##
## Call:
## lm(formula = Price ~ . - Distance, data = data[-26, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1098 -2.0090 -0.4809 2.2081 5.1694
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.2741624 1.9916393 11.184 8.89e-11 ***
## CountySanta Clara -5.7200830 2.3006652 -2.486 0.02061 *
## Size -0.0022859 0.0025410 -0.900 0.37766
## Elevation 0.4972991 0.1737789 2.862 0.00882 **
## Sewer -0.0007418 0.0002871 -2.584 0.01661 *
## Date 0.1110019 0.0354663 3.130 0.00470 **
## FloodYes -8.4899819 2.3413474 -3.626 0.00142 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.265 on 23 degrees of freedom
## Multiple R-squared: 0.7824, Adjusted R-squared: 0.7256
## F-statistic: 13.78 on 6 and 23 DF, p-value: 1.307e-06
Size is insignificant variable in this model, let’s buid a model without it.
fit5 <- lm(Price ~ .-Distance-Size, data= data[-26,])
summary(fit5)
##
## Call:
## lm(formula = Price ~ . - Distance - Size, data = data[-26, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0186 -2.2651 -0.3114 2.1549 5.1596
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.0187525 1.9634490 11.214 5.01e-11 ***
## CountySanta Clara -4.4613706 1.8189990 -2.453 0.02183 *
## Elevation 0.5086667 0.1726287 2.947 0.00704 **
## Sewer -0.0006846 0.0002789 -2.455 0.02173 *
## Date 0.1308357 0.0276699 4.728 8.28e-05 ***
## FloodYes -7.6795702 2.1524916 -3.568 0.00156 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.252 on 24 degrees of freedom
## Multiple R-squared: 0.7747, Adjusted R-squared: 0.7278
## F-statistic: 16.51 on 5 and 24 DF, p-value: 4.372e-07
County = Santa Clara, Size = 246.8, Elevation = 0 (property at sea level),Sewer = 0(no data provided), Date = 6 (assuming property will be sold in next 6 months), Flood = 0 (property diked), Distance = 0 (as distance is relative to Leslie Salt property)
leslie_salt <- data.frame(0,"Santa Clara",246.8,0,0,6,"No",0)
colnames(leslie_salt) <- c("Price", "County", "Size", "Elevation", "Sewer", "Date", "Flood", "Distance")
data<-rbind(data,leslie_salt)
leslie_salt_price <- predict(fit5, newdata = data[32,])
leslie_salt_price
## 32
## 18.3424
The price of Leslie Salt property is $18342/acre.