Step 1. Scale or normalize your data. Make sure to apply imputation if needed

The output shows that all the numerical variables have been standardized with a mean value of zero.

preproc1 <- preProcess(data_complete[,c(7,8,14)], method=c("center", "scale"))
data_complete_scaled <- predict(preproc1, data_complete[,c(7,8,14)])
summary(data_complete_scaled)
##  median_sale_price total_new_listings median_days_on_market
##  Min.   :-1.5193   Min.   :-0.9273    Min.   :-1.4734      
##  1st Qu.:-0.5716   1st Qu.:-0.5789    1st Qu.:-0.7290      
##  Median :-0.2458   Median :-0.2629    Median :-0.1598      
##  Mean   : 0.0000   Mean   : 0.0000    Mean   : 0.0000      
##  3rd Qu.: 0.2548   3rd Qu.: 0.2039    3rd Qu.: 0.5408      
##  Max.   :13.4952   Max.   : 8.0932    Max.   :14.2904
#or we can use the scale function
#data_complete_scaled <- as.data.frame(scale(data_complete[,c(7,8,14)]))
#summary(data_complete_scaled)

Step 2. Build a multiple linear regression model or logistic regression (based on your Y)

The multiple linear regression model explores the relationship between the median sale price (Y) and median days on market (X1) and total new listings in that area (X2).

linear_model <- lm(data_complete_scaled$median_sale_price ~ data_complete_scaled$median_days_on_market + data_complete_scaled$total_new_listings, data=data_complete_scaled)

Step 3. Print summary and interpret table (see lecture slides). Describe the summary

The summary shows that both median days on market (X1) and total new listings in that area (X2) have strong influence on the median sale price (Y), as both the Pr are less than 0.05.

summary(linear_model)
## 
## Call:
## lm(formula = data_complete_scaled$median_sale_price ~ data_complete_scaled$median_days_on_market + 
##     data_complete_scaled$total_new_listings, data = data_complete_scaled)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5697 -0.5671 -0.2375  0.2390 14.8289 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## (Intercept)                                -3.326e-17  1.076e-02   0.000
## data_complete_scaled$median_days_on_market -9.413e-02  1.076e-02  -8.746
## data_complete_scaled$total_new_listings     3.720e-02  1.076e-02   3.457
##                                            Pr(>|t|)    
## (Intercept)                                 1.00000    
## data_complete_scaled$median_days_on_market  < 2e-16 ***
## data_complete_scaled$total_new_listings     0.00055 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.995 on 8545 degrees of freedom
## Multiple R-squared:  0.01021,    Adjusted R-squared:  0.009978 
## F-statistic: 44.07 on 2 and 8545 DF,  p-value: < 2.2e-16
explanatory = c("median_days_on_market", "total_new_listings")
dependent = 'median_sale_price'
data_complete_scaled %>%
  finalfit(dependent, explanatory)
##  Dependent: median_sale_price                  unit      value
##         median_days_on_market [-1.5,14.3] Mean (sd) -0.0 (1.0)
##            total_new_listings  [-0.9,8.1] Mean (sd) -0.0 (1.0)
##        Coefficient (univariable)     Coefficient (multivariable)
##  -0.09 (-0.12 to -0.07, p<0.001) -0.09 (-0.12 to -0.07, p<0.001)
##     0.04 (0.02 to 0.06, p=0.001)    0.04 (0.02 to 0.06, p=0.001)
plot(linear_model)

Step 4. Perform another model and evaluate which model performs better.

I performed the logistic regression analysis on the median sale price (Y) and the region state (X3). Region state is a categorical variable. The result is not significant, indicating that there is not a strong correlationship between Y and X3. The linear model performs better.

data_complete2 <- data_complete
#data_complete2$median_sale_price <- log(data_complete2$median_sale_price)
preproc2 <- preProcess(data_complete[, c(3,7)], method=c("range"))
data_complete_scaled2 <- predict(preproc2, data_complete2)
summary(data_complete_scaled2)
##    region_id    region_name        region_state       region_type       
##  Min.   :   2   Length:8548        Length:8548        Length:8548       
##  1st Qu.: 495   Class :character   Class :character   Class :character  
##  Median :1512   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1604                                                           
##  3rd Qu.:2450                                                           
##  Max.   :3230                                                           
##   period_begin                 total_homes_sold median_sale_price
##  Min.   :2020-01-06 00:00:00   Min.   :   1.0   Min.   :0.00000  
##  1st Qu.:2020-03-23 00:00:00   1st Qu.:  81.0   1st Qu.:0.06312  
##  Median :2020-06-08 00:00:00   Median : 152.0   Median :0.08482  
##  Mean   :2020-06-05 12:26:57   Mean   : 216.3   Mean   :0.10119  
##  3rd Qu.:2020-08-17 00:00:00   3rd Qu.: 270.0   3rd Qu.:0.11816  
##  Max.   :2020-10-26 00:00:00   Max.   :2363.0   Max.   :1.00000  
##  total_new_listings median_new_listing_price active_listings  
##  Min.   :   1.0     Min.   :  89000          Min.   :   24.0  
##  1st Qu.:  98.0     1st Qu.: 265000          1st Qu.:  855.8  
##  Median : 186.0     Median : 329900          Median : 1545.0  
##  Mean   : 259.2     Mean   : 379438          Mean   : 2484.8  
##  3rd Qu.: 316.0     3rd Qu.: 429900          3rd Qu.: 2850.0  
##  Max.   :2513.0     Max.   :1650000          Max.   :21084.0  
##  median_active_list_price average_of_median_list_price
##  Min.   : 100000          Min.   :  15000             
##  1st Qu.: 289900          1st Qu.: 307450             
##  Median : 360000          Median : 399900             
##  Mean   : 411947          Mean   : 452090             
##  3rd Qu.: 472875          3rd Qu.: 529815             
##  Max.   :1599999          Max.   :3498000             
##  average_of_median_offer_price median_days_on_market
##  Min.   :    9000              Min.   :  3.00       
##  1st Qu.:  300000              1st Qu.: 20.00       
##  Median :  399000              Median : 33.00       
##  Mean   :  453531              Mean   : 36.65       
##  3rd Qu.:  530000              3rd Qu.: 49.00       
##  Max.   :16825370              Max.   :363.00
logit_model = glm(formula = data_complete_scaled2$median_sale_price ~ data_complete_scaled2$region_state, data = data_complete_scaled2, 
              family = binomial)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(logit_model)
## 
## Call:
## glm(formula = data_complete_scaled2$median_sale_price ~ data_complete_scaled2$region_state, 
##     family = binomial, data = data_complete_scaled2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.44092  -0.06679  -0.01098   0.04760   1.87351  
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)                          -2.328458   2.485915  -0.937    0.349
## data_complete_scaled2$region_stateAL -0.480333   2.557036  -0.188    0.851
## data_complete_scaled2$region_stateAR -0.877865   2.821646  -0.311    0.756
## data_complete_scaled2$region_stateAZ -0.190592   2.509134  -0.076    0.939
## data_complete_scaled2$region_stateCA  1.043279   2.487423   0.419    0.675
## data_complete_scaled2$region_stateCO  0.405049   2.490892   0.163    0.871
## data_complete_scaled2$region_stateCT  0.004749   2.512979   0.002    0.998
## data_complete_scaled2$region_stateDC  0.867426   2.516331   0.345    0.730
## data_complete_scaled2$region_stateDE -0.342118   2.615430  -0.131    0.896
## data_complete_scaled2$region_stateFL -0.228935   2.490982  -0.092    0.927
## data_complete_scaled2$region_stateGA -0.206442   2.495340  -0.083    0.934
## data_complete_scaled2$region_stateHI  0.845426   2.516751   0.336    0.737
## data_complete_scaled2$region_stateIA -0.558026   2.827840  -0.197    0.844
## data_complete_scaled2$region_stateID  0.159142   2.522079   0.063    0.950
## data_complete_scaled2$region_stateIL -0.348811   2.497950  -0.140    0.889
## data_complete_scaled2$region_stateIN -0.467608   2.525910  -0.185    0.853
## data_complete_scaled2$region_stateKS -0.153340   2.611177  -0.059    0.953
## data_complete_scaled2$region_stateKY -0.627398   2.551886  -0.246    0.806
## data_complete_scaled2$region_stateLA -0.371053   2.534121  -0.146    0.884
## data_complete_scaled2$region_stateMA  0.518737   2.491666   0.208    0.835
## data_complete_scaled2$region_stateMD  0.031990   2.491041   0.013    0.990
## data_complete_scaled2$region_stateME -0.027056   2.585597  -0.010    0.992
## data_complete_scaled2$region_stateMI -0.569998   2.505281  -0.228    0.820
## data_complete_scaled2$region_stateMN -0.099276   2.501652  -0.040    0.968
## data_complete_scaled2$region_stateMO -0.618999   2.550401  -0.243    0.808
## data_complete_scaled2$region_stateNC -0.198758   2.496417  -0.080    0.937
## data_complete_scaled2$region_stateNE -0.485422   2.601005  -0.187    0.852
## data_complete_scaled2$region_stateNH  0.019726   2.511005   0.008    0.994
## data_complete_scaled2$region_stateNJ  0.125727   2.491167   0.050    0.960
## data_complete_scaled2$region_stateNM -0.224021   2.542958  -0.088    0.930
## data_complete_scaled2$region_stateNV  0.098844   2.513073   0.039    0.969
## data_complete_scaled2$region_stateNY  0.763282   2.496235   0.306    0.760
## data_complete_scaled2$region_stateOH -0.682701   2.506941  -0.272    0.785
## data_complete_scaled2$region_stateOK -0.737949   2.535204  -0.291    0.771
## data_complete_scaled2$region_stateOR  0.331066   2.494513   0.133    0.894
## data_complete_scaled2$region_statePA -0.298989   2.497610  -0.120    0.905
## data_complete_scaled2$region_stateRI -0.145015   2.523541  -0.057    0.954
## data_complete_scaled2$region_stateSC -0.294939   2.504714  -0.118    0.906
## data_complete_scaled2$region_stateTN -0.115858   2.496503  -0.046    0.963
## data_complete_scaled2$region_stateTX -0.206583   2.490530  -0.083    0.934
## data_complete_scaled2$region_stateUT  0.244043   2.502481   0.098    0.922
## data_complete_scaled2$region_stateVA  0.319310   2.489414   0.128    0.898
## data_complete_scaled2$region_stateWA  0.367052   2.490516   0.147    0.883
## data_complete_scaled2$region_stateWI -0.340596   2.509502  -0.136    0.892
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 324.40  on 8547  degrees of freedom
## Residual deviance: 143.69  on 8504  degrees of freedom
## AIC: 1989.2
## 
## Number of Fisher Scoring iterations: 6
#Print the table
explanatory2 = c("region_state")
dependent2 = 'median_sale_price'
data_complete_scaled2 %>%
  finalfit(dependent2, explanatory2)
##  Dependent: median_sale_price         unit     value
##                  region_state AK Mean (sd) 0.1 (0.0)
##                               AL Mean (sd) 0.1 (0.0)
##                               AR Mean (sd) 0.0 (0.0)
##                               AZ Mean (sd) 0.1 (0.0)
##                               CA Mean (sd) 0.2 (0.1)
##                               CO Mean (sd) 0.1 (0.0)
##                               CT Mean (sd) 0.1 (0.0)
##                               DC Mean (sd) 0.2 (0.0)
##                               DE Mean (sd) 0.1 (0.0)
##                               FL Mean (sd) 0.1 (0.0)
##                               GA Mean (sd) 0.1 (0.0)
##                               HI Mean (sd) 0.2 (0.0)
##                               IA Mean (sd) 0.1 (0.0)
##                               ID Mean (sd) 0.1 (0.0)
##                               IL Mean (sd) 0.1 (0.0)
##                               IN Mean (sd) 0.1 (0.0)
##                               KS Mean (sd) 0.1 (0.0)
##                               KY Mean (sd) 0.0 (0.0)
##                               LA Mean (sd) 0.1 (0.0)
##                               MA Mean (sd) 0.1 (0.0)
##                               MD Mean (sd) 0.1 (0.0)
##                               ME Mean (sd) 0.1 (0.0)
##                               MI Mean (sd) 0.1 (0.0)
##                               MN Mean (sd) 0.1 (0.0)
##                               MO Mean (sd) 0.0 (0.0)
##                               NC Mean (sd) 0.1 (0.0)
##                               NE Mean (sd) 0.1 (0.0)
##                               NH Mean (sd) 0.1 (0.0)
##                               NJ Mean (sd) 0.1 (0.0)
##                               NM Mean (sd) 0.1 (0.0)
##                               NV Mean (sd) 0.1 (0.0)
##                               NY Mean (sd) 0.2 (0.1)
##                               OH Mean (sd) 0.0 (0.0)
##                               OK Mean (sd) 0.0 (0.0)
##                               OR Mean (sd) 0.1 (0.0)
##                               PA Mean (sd) 0.1 (0.0)
##                               RI Mean (sd) 0.1 (0.0)
##                               SC Mean (sd) 0.1 (0.0)
##                               TN Mean (sd) 0.1 (0.0)
##                               TX Mean (sd) 0.1 (0.0)
##                               UT Mean (sd) 0.1 (0.1)
##                               VA Mean (sd) 0.1 (0.1)
##                               WA Mean (sd) 0.1 (0.0)
##                               WI Mean (sd) 0.1 (0.0)
##       Coefficient (univariable)    Coefficient (multivariable)
##                               -                              -
##  -0.03 (-0.10 to 0.04, p=0.357) -0.03 (-0.10 to 0.04, p=0.357)
##  -0.05 (-0.12 to 0.02, p=0.168) -0.05 (-0.12 to 0.02, p=0.168)
##  -0.01 (-0.08 to 0.05, p=0.677) -0.01 (-0.08 to 0.05, p=0.677)
##    0.13 (0.06 to 0.19, p<0.001)   0.13 (0.06 to 0.19, p<0.001)
##   0.04 (-0.03 to 0.11, p=0.256)  0.04 (-0.03 to 0.11, p=0.256)
##   0.00 (-0.07 to 0.07, p=0.991)  0.00 (-0.07 to 0.07, p=0.991)
##    0.10 (0.03 to 0.17, p=0.004)   0.10 (0.03 to 0.17, p=0.004)
##  -0.02 (-0.09 to 0.05, p=0.496) -0.02 (-0.09 to 0.05, p=0.496)
##  -0.02 (-0.08 to 0.05, p=0.621) -0.02 (-0.08 to 0.05, p=0.621)
##  -0.02 (-0.08 to 0.05, p=0.653) -0.02 (-0.08 to 0.05, p=0.653)
##    0.10 (0.03 to 0.16, p=0.006)   0.10 (0.03 to 0.16, p=0.006)
##  -0.04 (-0.11 to 0.04, p=0.331) -0.04 (-0.11 to 0.04, p=0.331)
##   0.01 (-0.05 to 0.08, p=0.691)  0.01 (-0.05 to 0.08, p=0.691)
##  -0.02 (-0.09 to 0.04, p=0.473) -0.02 (-0.09 to 0.04, p=0.473)
##  -0.03 (-0.10 to 0.04, p=0.363) -0.03 (-0.10 to 0.04, p=0.363)
##  -0.01 (-0.08 to 0.06, p=0.743) -0.01 (-0.08 to 0.06, p=0.743)
##  -0.04 (-0.11 to 0.03, p=0.255) -0.04 (-0.11 to 0.03, p=0.255)
##  -0.03 (-0.09 to 0.04, p=0.455) -0.03 (-0.09 to 0.04, p=0.455)
##   0.05 (-0.01 to 0.12, p=0.128)  0.05 (-0.01 to 0.12, p=0.128)
##   0.00 (-0.06 to 0.07, p=0.939)  0.00 (-0.06 to 0.07, p=0.939)
##  -0.00 (-0.07 to 0.07, p=0.951) -0.00 (-0.07 to 0.07, p=0.951)
##  -0.04 (-0.10 to 0.03, p=0.285) -0.04 (-0.10 to 0.03, p=0.285)
##  -0.01 (-0.07 to 0.06, p=0.822) -0.01 (-0.07 to 0.06, p=0.822)
##  -0.04 (-0.11 to 0.03, p=0.259) -0.04 (-0.11 to 0.03, p=0.259)
##  -0.01 (-0.08 to 0.05, p=0.664) -0.01 (-0.08 to 0.05, p=0.664)
##  -0.03 (-0.10 to 0.04, p=0.358) -0.03 (-0.10 to 0.04, p=0.358)
##   0.00 (-0.07 to 0.07, p=0.963)  0.00 (-0.07 to 0.07, p=0.963)
##   0.01 (-0.06 to 0.08, p=0.753)  0.01 (-0.06 to 0.08, p=0.753)
##  -0.02 (-0.08 to 0.05, p=0.633) -0.02 (-0.08 to 0.05, p=0.633)
##   0.01 (-0.06 to 0.08, p=0.809)  0.01 (-0.06 to 0.08, p=0.809)
##    0.08 (0.02 to 0.15, p=0.014)   0.08 (0.02 to 0.15, p=0.014)
##  -0.04 (-0.11 to 0.03, p=0.220) -0.04 (-0.11 to 0.03, p=0.220)
##  -0.04 (-0.11 to 0.02, p=0.197) -0.04 (-0.11 to 0.02, p=0.197)
##   0.03 (-0.04 to 0.10, p=0.369)  0.03 (-0.04 to 0.10, p=0.369)
##  -0.02 (-0.09 to 0.05, p=0.531) -0.02 (-0.09 to 0.05, p=0.531)
##  -0.01 (-0.08 to 0.06, p=0.748) -0.01 (-0.08 to 0.06, p=0.748)
##  -0.02 (-0.09 to 0.05, p=0.536) -0.02 (-0.09 to 0.05, p=0.536)
##  -0.01 (-0.08 to 0.06, p=0.793) -0.01 (-0.08 to 0.06, p=0.793)
##  -0.02 (-0.08 to 0.05, p=0.652) -0.02 (-0.08 to 0.05, p=0.652)
##   0.02 (-0.05 to 0.09, p=0.524)  0.02 (-0.05 to 0.09, p=0.524)
##   0.03 (-0.04 to 0.10, p=0.387)  0.03 (-0.04 to 0.10, p=0.387)
##   0.03 (-0.03 to 0.10, p=0.311)  0.03 (-0.03 to 0.10, p=0.311)
##  -0.02 (-0.09 to 0.04, p=0.484) -0.02 (-0.09 to 0.04, p=0.484)

Step 5. I would also like to explore the correlation between the variables

From the graph below, there is almost no correlation between median days on market (X1) and total new listings in that area (X2), which is good because there would be no multicollinearity. It also makes sense if the price of the house would be lower if it is listed on the market for too long because it means fewer people showed interest in it.

corr <- round(cor(data_complete_scaled),1)
ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE, outline.col = "white",
           ggtheme = ggplot2::theme_grey,
           colors = c("#6D9EC1", "white", "#E46726"))

ggcorrplot(cor(data_complete_scaled), p.mat = cor_pmat(data_complete_scaled), hc.order=TRUE, type='lower')

#install.packages("corrplot")
#library(corrplot)
#corrplot(corr, is.corr = TRUE, win.asp = 1, method = "color", type='lower')