The output shows that all the numerical variables have been standardized with a mean value of zero.
preproc1 <- preProcess(data_complete[,c(7,8,14)], method=c("center", "scale"))
data_complete_scaled <- predict(preproc1, data_complete[,c(7,8,14)])
summary(data_complete_scaled)
## median_sale_price total_new_listings median_days_on_market
## Min. :-1.5193 Min. :-0.9273 Min. :-1.4734
## 1st Qu.:-0.5716 1st Qu.:-0.5789 1st Qu.:-0.7290
## Median :-0.2458 Median :-0.2629 Median :-0.1598
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2548 3rd Qu.: 0.2039 3rd Qu.: 0.5408
## Max. :13.4952 Max. : 8.0932 Max. :14.2904
#or we can use the scale function
#data_complete_scaled <- as.data.frame(scale(data_complete[,c(7,8,14)]))
#summary(data_complete_scaled)
The multiple linear regression model explores the relationship between the median sale price (Y) and median days on market (X1) and total new listings in that area (X2).
linear_model <- lm(data_complete_scaled$median_sale_price ~ data_complete_scaled$median_days_on_market + data_complete_scaled$total_new_listings, data=data_complete_scaled)
The summary shows that both median days on market (X1) and total new listings in that area (X2) have strong influence on the median sale price (Y), as both the Pr are less than 0.05.
summary(linear_model)
##
## Call:
## lm(formula = data_complete_scaled$median_sale_price ~ data_complete_scaled$median_days_on_market +
## data_complete_scaled$total_new_listings, data = data_complete_scaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5697 -0.5671 -0.2375 0.2390 14.8289
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -3.326e-17 1.076e-02 0.000
## data_complete_scaled$median_days_on_market -9.413e-02 1.076e-02 -8.746
## data_complete_scaled$total_new_listings 3.720e-02 1.076e-02 3.457
## Pr(>|t|)
## (Intercept) 1.00000
## data_complete_scaled$median_days_on_market < 2e-16 ***
## data_complete_scaled$total_new_listings 0.00055 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.995 on 8545 degrees of freedom
## Multiple R-squared: 0.01021, Adjusted R-squared: 0.009978
## F-statistic: 44.07 on 2 and 8545 DF, p-value: < 2.2e-16
explanatory = c("median_days_on_market", "total_new_listings")
dependent = 'median_sale_price'
data_complete_scaled %>%
finalfit(dependent, explanatory)
## Dependent: median_sale_price unit value
## median_days_on_market [-1.5,14.3] Mean (sd) -0.0 (1.0)
## total_new_listings [-0.9,8.1] Mean (sd) -0.0 (1.0)
## Coefficient (univariable) Coefficient (multivariable)
## -0.09 (-0.12 to -0.07, p<0.001) -0.09 (-0.12 to -0.07, p<0.001)
## 0.04 (0.02 to 0.06, p=0.001) 0.04 (0.02 to 0.06, p=0.001)
plot(linear_model)
I performed the logistic regression analysis on the median sale price (Y) and the region state (X3). Region state is a categorical variable. The result is not significant, indicating that there is not a strong correlationship between Y and X3. The linear model performs better.
data_complete2 <- data_complete
#data_complete2$median_sale_price <- log(data_complete2$median_sale_price)
preproc2 <- preProcess(data_complete[, c(3,7)], method=c("range"))
data_complete_scaled2 <- predict(preproc2, data_complete2)
summary(data_complete_scaled2)
## region_id region_name region_state region_type
## Min. : 2 Length:8548 Length:8548 Length:8548
## 1st Qu.: 495 Class :character Class :character Class :character
## Median :1512 Mode :character Mode :character Mode :character
## Mean :1604
## 3rd Qu.:2450
## Max. :3230
## period_begin total_homes_sold median_sale_price
## Min. :2020-01-06 00:00:00 Min. : 1.0 Min. :0.00000
## 1st Qu.:2020-03-23 00:00:00 1st Qu.: 81.0 1st Qu.:0.06312
## Median :2020-06-08 00:00:00 Median : 152.0 Median :0.08482
## Mean :2020-06-05 12:26:57 Mean : 216.3 Mean :0.10119
## 3rd Qu.:2020-08-17 00:00:00 3rd Qu.: 270.0 3rd Qu.:0.11816
## Max. :2020-10-26 00:00:00 Max. :2363.0 Max. :1.00000
## total_new_listings median_new_listing_price active_listings
## Min. : 1.0 Min. : 89000 Min. : 24.0
## 1st Qu.: 98.0 1st Qu.: 265000 1st Qu.: 855.8
## Median : 186.0 Median : 329900 Median : 1545.0
## Mean : 259.2 Mean : 379438 Mean : 2484.8
## 3rd Qu.: 316.0 3rd Qu.: 429900 3rd Qu.: 2850.0
## Max. :2513.0 Max. :1650000 Max. :21084.0
## median_active_list_price average_of_median_list_price
## Min. : 100000 Min. : 15000
## 1st Qu.: 289900 1st Qu.: 307450
## Median : 360000 Median : 399900
## Mean : 411947 Mean : 452090
## 3rd Qu.: 472875 3rd Qu.: 529815
## Max. :1599999 Max. :3498000
## average_of_median_offer_price median_days_on_market
## Min. : 9000 Min. : 3.00
## 1st Qu.: 300000 1st Qu.: 20.00
## Median : 399000 Median : 33.00
## Mean : 453531 Mean : 36.65
## 3rd Qu.: 530000 3rd Qu.: 49.00
## Max. :16825370 Max. :363.00
logit_model = glm(formula = data_complete_scaled2$median_sale_price ~ data_complete_scaled2$region_state, data = data_complete_scaled2,
family = binomial)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(logit_model)
##
## Call:
## glm(formula = data_complete_scaled2$median_sale_price ~ data_complete_scaled2$region_state,
## family = binomial, data = data_complete_scaled2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.44092 -0.06679 -0.01098 0.04760 1.87351
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.328458 2.485915 -0.937 0.349
## data_complete_scaled2$region_stateAL -0.480333 2.557036 -0.188 0.851
## data_complete_scaled2$region_stateAR -0.877865 2.821646 -0.311 0.756
## data_complete_scaled2$region_stateAZ -0.190592 2.509134 -0.076 0.939
## data_complete_scaled2$region_stateCA 1.043279 2.487423 0.419 0.675
## data_complete_scaled2$region_stateCO 0.405049 2.490892 0.163 0.871
## data_complete_scaled2$region_stateCT 0.004749 2.512979 0.002 0.998
## data_complete_scaled2$region_stateDC 0.867426 2.516331 0.345 0.730
## data_complete_scaled2$region_stateDE -0.342118 2.615430 -0.131 0.896
## data_complete_scaled2$region_stateFL -0.228935 2.490982 -0.092 0.927
## data_complete_scaled2$region_stateGA -0.206442 2.495340 -0.083 0.934
## data_complete_scaled2$region_stateHI 0.845426 2.516751 0.336 0.737
## data_complete_scaled2$region_stateIA -0.558026 2.827840 -0.197 0.844
## data_complete_scaled2$region_stateID 0.159142 2.522079 0.063 0.950
## data_complete_scaled2$region_stateIL -0.348811 2.497950 -0.140 0.889
## data_complete_scaled2$region_stateIN -0.467608 2.525910 -0.185 0.853
## data_complete_scaled2$region_stateKS -0.153340 2.611177 -0.059 0.953
## data_complete_scaled2$region_stateKY -0.627398 2.551886 -0.246 0.806
## data_complete_scaled2$region_stateLA -0.371053 2.534121 -0.146 0.884
## data_complete_scaled2$region_stateMA 0.518737 2.491666 0.208 0.835
## data_complete_scaled2$region_stateMD 0.031990 2.491041 0.013 0.990
## data_complete_scaled2$region_stateME -0.027056 2.585597 -0.010 0.992
## data_complete_scaled2$region_stateMI -0.569998 2.505281 -0.228 0.820
## data_complete_scaled2$region_stateMN -0.099276 2.501652 -0.040 0.968
## data_complete_scaled2$region_stateMO -0.618999 2.550401 -0.243 0.808
## data_complete_scaled2$region_stateNC -0.198758 2.496417 -0.080 0.937
## data_complete_scaled2$region_stateNE -0.485422 2.601005 -0.187 0.852
## data_complete_scaled2$region_stateNH 0.019726 2.511005 0.008 0.994
## data_complete_scaled2$region_stateNJ 0.125727 2.491167 0.050 0.960
## data_complete_scaled2$region_stateNM -0.224021 2.542958 -0.088 0.930
## data_complete_scaled2$region_stateNV 0.098844 2.513073 0.039 0.969
## data_complete_scaled2$region_stateNY 0.763282 2.496235 0.306 0.760
## data_complete_scaled2$region_stateOH -0.682701 2.506941 -0.272 0.785
## data_complete_scaled2$region_stateOK -0.737949 2.535204 -0.291 0.771
## data_complete_scaled2$region_stateOR 0.331066 2.494513 0.133 0.894
## data_complete_scaled2$region_statePA -0.298989 2.497610 -0.120 0.905
## data_complete_scaled2$region_stateRI -0.145015 2.523541 -0.057 0.954
## data_complete_scaled2$region_stateSC -0.294939 2.504714 -0.118 0.906
## data_complete_scaled2$region_stateTN -0.115858 2.496503 -0.046 0.963
## data_complete_scaled2$region_stateTX -0.206583 2.490530 -0.083 0.934
## data_complete_scaled2$region_stateUT 0.244043 2.502481 0.098 0.922
## data_complete_scaled2$region_stateVA 0.319310 2.489414 0.128 0.898
## data_complete_scaled2$region_stateWA 0.367052 2.490516 0.147 0.883
## data_complete_scaled2$region_stateWI -0.340596 2.509502 -0.136 0.892
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 324.40 on 8547 degrees of freedom
## Residual deviance: 143.69 on 8504 degrees of freedom
## AIC: 1989.2
##
## Number of Fisher Scoring iterations: 6
#Print the table
explanatory2 = c("region_state")
dependent2 = 'median_sale_price'
data_complete_scaled2 %>%
finalfit(dependent2, explanatory2)
## Dependent: median_sale_price unit value
## region_state AK Mean (sd) 0.1 (0.0)
## AL Mean (sd) 0.1 (0.0)
## AR Mean (sd) 0.0 (0.0)
## AZ Mean (sd) 0.1 (0.0)
## CA Mean (sd) 0.2 (0.1)
## CO Mean (sd) 0.1 (0.0)
## CT Mean (sd) 0.1 (0.0)
## DC Mean (sd) 0.2 (0.0)
## DE Mean (sd) 0.1 (0.0)
## FL Mean (sd) 0.1 (0.0)
## GA Mean (sd) 0.1 (0.0)
## HI Mean (sd) 0.2 (0.0)
## IA Mean (sd) 0.1 (0.0)
## ID Mean (sd) 0.1 (0.0)
## IL Mean (sd) 0.1 (0.0)
## IN Mean (sd) 0.1 (0.0)
## KS Mean (sd) 0.1 (0.0)
## KY Mean (sd) 0.0 (0.0)
## LA Mean (sd) 0.1 (0.0)
## MA Mean (sd) 0.1 (0.0)
## MD Mean (sd) 0.1 (0.0)
## ME Mean (sd) 0.1 (0.0)
## MI Mean (sd) 0.1 (0.0)
## MN Mean (sd) 0.1 (0.0)
## MO Mean (sd) 0.0 (0.0)
## NC Mean (sd) 0.1 (0.0)
## NE Mean (sd) 0.1 (0.0)
## NH Mean (sd) 0.1 (0.0)
## NJ Mean (sd) 0.1 (0.0)
## NM Mean (sd) 0.1 (0.0)
## NV Mean (sd) 0.1 (0.0)
## NY Mean (sd) 0.2 (0.1)
## OH Mean (sd) 0.0 (0.0)
## OK Mean (sd) 0.0 (0.0)
## OR Mean (sd) 0.1 (0.0)
## PA Mean (sd) 0.1 (0.0)
## RI Mean (sd) 0.1 (0.0)
## SC Mean (sd) 0.1 (0.0)
## TN Mean (sd) 0.1 (0.0)
## TX Mean (sd) 0.1 (0.0)
## UT Mean (sd) 0.1 (0.1)
## VA Mean (sd) 0.1 (0.1)
## WA Mean (sd) 0.1 (0.0)
## WI Mean (sd) 0.1 (0.0)
## Coefficient (univariable) Coefficient (multivariable)
## - -
## -0.03 (-0.10 to 0.04, p=0.357) -0.03 (-0.10 to 0.04, p=0.357)
## -0.05 (-0.12 to 0.02, p=0.168) -0.05 (-0.12 to 0.02, p=0.168)
## -0.01 (-0.08 to 0.05, p=0.677) -0.01 (-0.08 to 0.05, p=0.677)
## 0.13 (0.06 to 0.19, p<0.001) 0.13 (0.06 to 0.19, p<0.001)
## 0.04 (-0.03 to 0.11, p=0.256) 0.04 (-0.03 to 0.11, p=0.256)
## 0.00 (-0.07 to 0.07, p=0.991) 0.00 (-0.07 to 0.07, p=0.991)
## 0.10 (0.03 to 0.17, p=0.004) 0.10 (0.03 to 0.17, p=0.004)
## -0.02 (-0.09 to 0.05, p=0.496) -0.02 (-0.09 to 0.05, p=0.496)
## -0.02 (-0.08 to 0.05, p=0.621) -0.02 (-0.08 to 0.05, p=0.621)
## -0.02 (-0.08 to 0.05, p=0.653) -0.02 (-0.08 to 0.05, p=0.653)
## 0.10 (0.03 to 0.16, p=0.006) 0.10 (0.03 to 0.16, p=0.006)
## -0.04 (-0.11 to 0.04, p=0.331) -0.04 (-0.11 to 0.04, p=0.331)
## 0.01 (-0.05 to 0.08, p=0.691) 0.01 (-0.05 to 0.08, p=0.691)
## -0.02 (-0.09 to 0.04, p=0.473) -0.02 (-0.09 to 0.04, p=0.473)
## -0.03 (-0.10 to 0.04, p=0.363) -0.03 (-0.10 to 0.04, p=0.363)
## -0.01 (-0.08 to 0.06, p=0.743) -0.01 (-0.08 to 0.06, p=0.743)
## -0.04 (-0.11 to 0.03, p=0.255) -0.04 (-0.11 to 0.03, p=0.255)
## -0.03 (-0.09 to 0.04, p=0.455) -0.03 (-0.09 to 0.04, p=0.455)
## 0.05 (-0.01 to 0.12, p=0.128) 0.05 (-0.01 to 0.12, p=0.128)
## 0.00 (-0.06 to 0.07, p=0.939) 0.00 (-0.06 to 0.07, p=0.939)
## -0.00 (-0.07 to 0.07, p=0.951) -0.00 (-0.07 to 0.07, p=0.951)
## -0.04 (-0.10 to 0.03, p=0.285) -0.04 (-0.10 to 0.03, p=0.285)
## -0.01 (-0.07 to 0.06, p=0.822) -0.01 (-0.07 to 0.06, p=0.822)
## -0.04 (-0.11 to 0.03, p=0.259) -0.04 (-0.11 to 0.03, p=0.259)
## -0.01 (-0.08 to 0.05, p=0.664) -0.01 (-0.08 to 0.05, p=0.664)
## -0.03 (-0.10 to 0.04, p=0.358) -0.03 (-0.10 to 0.04, p=0.358)
## 0.00 (-0.07 to 0.07, p=0.963) 0.00 (-0.07 to 0.07, p=0.963)
## 0.01 (-0.06 to 0.08, p=0.753) 0.01 (-0.06 to 0.08, p=0.753)
## -0.02 (-0.08 to 0.05, p=0.633) -0.02 (-0.08 to 0.05, p=0.633)
## 0.01 (-0.06 to 0.08, p=0.809) 0.01 (-0.06 to 0.08, p=0.809)
## 0.08 (0.02 to 0.15, p=0.014) 0.08 (0.02 to 0.15, p=0.014)
## -0.04 (-0.11 to 0.03, p=0.220) -0.04 (-0.11 to 0.03, p=0.220)
## -0.04 (-0.11 to 0.02, p=0.197) -0.04 (-0.11 to 0.02, p=0.197)
## 0.03 (-0.04 to 0.10, p=0.369) 0.03 (-0.04 to 0.10, p=0.369)
## -0.02 (-0.09 to 0.05, p=0.531) -0.02 (-0.09 to 0.05, p=0.531)
## -0.01 (-0.08 to 0.06, p=0.748) -0.01 (-0.08 to 0.06, p=0.748)
## -0.02 (-0.09 to 0.05, p=0.536) -0.02 (-0.09 to 0.05, p=0.536)
## -0.01 (-0.08 to 0.06, p=0.793) -0.01 (-0.08 to 0.06, p=0.793)
## -0.02 (-0.08 to 0.05, p=0.652) -0.02 (-0.08 to 0.05, p=0.652)
## 0.02 (-0.05 to 0.09, p=0.524) 0.02 (-0.05 to 0.09, p=0.524)
## 0.03 (-0.04 to 0.10, p=0.387) 0.03 (-0.04 to 0.10, p=0.387)
## 0.03 (-0.03 to 0.10, p=0.311) 0.03 (-0.03 to 0.10, p=0.311)
## -0.02 (-0.09 to 0.04, p=0.484) -0.02 (-0.09 to 0.04, p=0.484)
From the graph below, there is almost no correlation between median days on market (X1) and total new listings in that area (X2), which is good because there would be no multicollinearity. It also makes sense if the price of the house would be lower if it is listed on the market for too long because it means fewer people showed interest in it.
corr <- round(cor(data_complete_scaled),1)
ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE, outline.col = "white",
ggtheme = ggplot2::theme_grey,
colors = c("#6D9EC1", "white", "#E46726"))
ggcorrplot(cor(data_complete_scaled), p.mat = cor_pmat(data_complete_scaled), hc.order=TRUE, type='lower')
#install.packages("corrplot")
#library(corrplot)
#corrplot(corr, is.corr = TRUE, win.asp = 1, method = "color", type='lower')