Table of Contents

library(magrittr) #pipe operator
library(car) # Anova()
library(GGally) #ggpairs()
library(lmtest) #lmtest()
library(olsrr) # ols_step_best_subset()
library(dplyr) # mutate()

General note:

A significance level \(\alpha=5\%\) is used.

Introduction

The public records of home sales from May, 2014 through May, 2015 in the King County area, Washington State of USA is posted on Kaggle (2016) (also posted on GeoDa (2020)) with the intention of predicting house prices using regression.

Data Description/ Background information

The dataset holds 21 columns and 21,613 records/rows of which a filtered data of 6 columns and 4,385 records will be used. Only 1 response variable exists, price. Out of many possible regressor variables, only 5 are chosen for simplicity.

The 6 columns used are price, bedrooms, bathrooms, sqft_living, sqft_lot, and floors where price is our response variable and the remaining 5 are regressors.

How was data filtered?

Thus reducing the data count.

Column descriptions:

For all 21 columns, refer GeoDa (2020). For the chosen 6 columns,

price - Price of each home sold
bedrooms - Number of bedrooms
bathrooms - Number of bathrooms, where 0.5 accounts for a room with a toilet but no shower
sqft_living - Square footage of the apartments interior living space
sqft_lot - Square footage of the land space
floors - Number of floors

Collection method described

No collection method has been described other than for the zip codes of King County which were retrieved from King county GIS Open Data, GIS (2023).

The House Sales datasets are provided as direct csv downloads and can be downloaded from,

Option 1: Direct download from Kaggle (2016)

Option 2: Direct download from GeoDa (2020) webpage prepared by the Center for Spatial Data Science, CSDS (2020). Or, by navigating CSDS (2020) website. Open CSDS (2020) website, click on the data tab. Next, filter by Housing category and choosing 2014-2015 Home Sales in King County, WA.

Objective

Our aim for the House Sales data collected is to determine a relevant regression model that determines price of house sales using at least one of the 5 regressors. This is to be achieved by fitting different significant regression models through model selection and variable selection methods. Relevance of the finalized models in context with the housing data will be stated.

Read Data

kc_house_raw <- read.csv("C:/Users/admin/Downloads/aa/kaggle/kc_house_data - Copy.csv") %>% dplyr::select(3:8) 
print(paste("Row count of raw dataset -", nrow(kc_house_raw)))
## [1] "Row count of raw dataset - 21613"
# Filter data. Only include 1,2 and 3 number of bedrooms, bathrooms and floors.
kc_house <- kc_house_raw %>% dplyr::filter(bedrooms %in% c(1,2,3),bathrooms %in% c(1,2,3,1.00,2.00,3.00),floors %in% c(1,2,3,1.0,2.0,3.0)) 
print(paste("Row count of finalized dataset -", nrow(kc_house)))
## [1] "Row count of finalized dataset - 4385"
head(kc_house)
##    price bedrooms bathrooms sqft_living sqft_lot floors
## 1 221900        3         1        1180     5650      1
## 2 180000        2         1         770    10000      1
## 3 510000        3         2        1680     8080      1
## 4 229500        3         1        1780     7470      1
## 5 468000        2         1        1160     6000      1
## 6 395000        3         2        1890    14040      2

Identification of the response and the regressor variables.

For fitting linear regression model, the response is price and the regressor variables are bedrooms, bathrooms, sqft_living, sqft_lot, and floors each described as follows,

y = price = Price of each home sold
x1 = bedrooms = Number of bedrooms
x2 = bathrooms = Number of bathrooms
x3 = sqft_living = Square footage of the apartments interior living space
x4 = sqft_lot = Square footage of the land space
x5 = floors = Number of floors

Note, bedrooms, bathrooms, and floors are categorical variables each holding values 1,2 and 3. sqft_living and sqft_lot are continuous variables. Response, price is continuous as well.

Fit linear model

Let’s first fit a model for price (y) with all 5 variables using the lm().

full= lm(price ~ factor(bedrooms) + factor(bathrooms) + sqft_living + sqft_lot + factor(floors), data=kc_house)

Discussion of possible relationships between variables and the relevance of the relationship in the context of the data.

Lets see if the the 5 regressor variables have significant relationships between themselves. If present, we say the variables are correlated and there is presence of multicollinearity. Multicollinearity impacts the estimates (increases variance/covariance of least square estimates) of the individual regression coefficients. Multicollinearity in the data can be checked using the correlation coefficients or using the statistic, Variance Inflation Factor (VIF).

In context of the kc_house data, we expect correlations between all the 5 variables. For example, generally, we expect houses with more floors to have more bedrooms and bathrooms. When size of the lot more, generally, we expect size of the living area to be more and vice versa. Similar statements can be made using every other variable pair. If the relationships are found to be significant or relevant, then we can introduce interactions between the variables as additional parameter of the fitted model. Lets now check presence of multicollinearity,

Lets plot the correlation coefficients between each predictor,

ggpairs(data = kc_house, columns = c(1,2,3,4,5,6), progress = FALSE) # library(GGally)

As shown in the matrix above,
- bedrooms and bathrooms are correlated by 0.316 (correlation coefficient). bedrooms and sqft_living by 0.422.
- bathrooms and sqft_living by 0.645. bathrooms and floors by 0.479.
- sqft_living and floors by 0.305.
- All other correlations are negligible.
Overall, there is indication of multicollinearity.

To test multicollinearity statistically, we can use the Variance Inflation Factor (VIF). VIF can be viewed as the factor by which the variance of the coefficient is increase due to multicollinearity.

car::vif(full) 
##                       GVIF Df GVIF^(1/(2*Df))
## factor(bedrooms)  1.253801  2        1.058174
## factor(bathrooms) 2.172852  2        1.214109
## sqft_living       2.029398  1        1.424569
## sqft_lot          1.029082  1        1.014437
## factor(floors)    1.442439  2        1.095909

Since we have categorical variables, Generalized Variance Inflation Factors (GVIF) are given. Also, standardized GVIF (GVIF^(1/(2xDf))) are given. Since the GVIF is < 5, multicollinearity is not significant (VIF between 5 to 10 is considered significant).

Descriptive statistics and Relative graphs, justifications and comments

Lets get the summary statistics of the kc_house dataset,

summary(kc_house)
##      price            bedrooms       bathrooms     sqft_living  
##  Min.   :  78000   Min.   :1.000   Min.   :1.00   Min.   : 390  
##  1st Qu.: 242500   1st Qu.:2.000   1st Qu.:1.00   1st Qu.: 970  
##  Median : 333000   Median :3.000   Median :1.00   Median :1200  
##  Mean   : 371431   Mean   :2.563   Mean   :1.34   Mean   :1309  
##  3rd Qu.: 450000   3rd Qu.:3.000   3rd Qu.:2.00   3rd Qu.:1510  
##  Max.   :3100000   Max.   :3.000   Max.   :3.00   Max.   :4910  
##     sqft_lot           floors     
##  Min.   :    600   Min.   :1.000  
##  1st Qu.:   5000   1st Qu.:1.000  
##  Median :   7208   Median :1.000  
##  Mean   :  12195   Mean   :1.128  
##  3rd Qu.:   9568   3rd Qu.:1.000  
##  Max.   :1164794   Max.   :3.000
sd(kc_house$price) # Standard deviation
## [1] 193926

For response, price :

Since we are generating regression model which estimates the response, price, lets focus on price’s statistics.

mean and median :
Since the mean and median are not close, the distribution is likely to be skewed.
Mean of price of house sold is 333,000 USD. Median is is 371,431 USD.

max and min :
Since the max (3,100,000) is farther from mean (371,431) than min (78,000), the distribution will be right-skewed.

standard deviation :
Standard deviation of 193,926 USD suggests the data is spread quite a lot in relation to the mean price.

Now lets plot the histogram of prices of the house sold and see if the summary statistics are justified,

hist(kc_house$price, xlab = 'Price of house sold (USD)', main = "Histogram of sold house's price" )

As we expected, the spread/distribution of the price response is right skewed.

Comment: Transformation/s (box-cox) on the response variable might be necessary to satisfy normality assumption required by Least Square parameter estimation techniques for the linear model fitting.

For regressor variables :

Lets look at the scatter plot between response (price) and the continuous variables, sqft_living and sqft_lot.

par(mfrow=c(1,2))
plot(kc_house$price, kc_house$sqft_lot, 
     xlab = "Size of lot",ylab = "Price of house sold (USD)",
     main="Price vs lot size")

plot(kc_house$price, kc_house$sqft_living, 
     xlab = "Size of living area",ylab = "Price of house sold (USD)",
     main="Price vs living area size")

print("Correlation coefficients between, ")
## [1] "Correlation coefficients between, "
print(paste("Price and lot size is :", cor(kc_house$price,kc_house$sqft_lot)))
## [1] "Price and lot size is : 0.0459611411386067"
print(paste("Price and living area is :", cor(kc_house$price,kc_house$sqft_living)))
## [1] "Price and living area is : 0.542327732890229"
print(paste("Price and number of bedrooms is :", cor(kc_house$price,kc_house$bedrooms)))
## [1] "Price and number of bedrooms is : 0.0935197418746122"
print(paste("Price and number of bathrooms is :", cor(kc_house$price,kc_house$bathrooms)))
## [1] "Price and number of bathrooms is : 0.388467122389912"
print(paste("Price and number of floors is :", cor(kc_house$price,kc_house$floors)))
## [1] "Price and number of floors is : 0.264817238298091"

From the 2 scatter plots, as size of lot or living area increases, price increases. Thus, suggesting linear relationship.

Summarizing the outputs above,
- A moderate positive linear relationship (0.54) exists between response, price (y) and regressor, sqft_living (x3).
- A low positive linear relationship (0.04) exists between response, price (y) and regressor, sqft_lot (x3).
- Positive correlation coefficients, 0.09, 0.38 and 0.26 for the categorical regressors, number of bedrooms (x1), bathrooms and floors suggest linear relationship with response as well. (Scatter plot not shown for simplicity)

Thus, since all the regressors exhibit linear relationship with the response variable, fitting a linear regression model seems apt.

Model 1 (Linear model)

Fit \(full\) model

Recall the fitted linear regression model, full, where price is the continuous response and sqft_living and sqft_lot are continuous predictors. Bedrooms, bathrooms and floors are categorical variables and hence factorized when feeding in the model.

full= lm(price ~ factor(bedrooms) + factor(bathrooms) + sqft_living + sqft_lot + factor(floors), data=kc_house)
summary(full) 
## 
## Call:
## lm(formula = price ~ factor(bedrooms) + factor(bathrooms) + sqft_living + 
##     sqft_lot + factor(floors), data = kc_house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -492457 -101938  -21165   78696 2081165 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.261e+05  1.495e+04   8.435  < 2e-16 ***
## factor(bedrooms)2  -3.973e+03  1.456e+04  -0.273  0.78500    
## factor(bedrooms)3  -6.908e+04  1.476e+04  -4.680 2.95e-06 ***
## factor(bathrooms)2  3.482e+02  6.773e+03   0.051  0.95901    
## factor(bathrooms)3  7.083e+04  1.609e+04   4.402 1.10e-05 ***
## sqft_living         2.152e+02  6.807e+00  31.610  < 2e-16 ***
## sqft_lot           -1.989e-01  6.493e-02  -3.063  0.00220 ** 
## factor(floors)2     5.022e+04  9.419e+03   5.332 1.02e-07 ***
## factor(floors)3     5.573e+04  1.943e+04   2.868  0.00415 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 158400 on 4376 degrees of freedom
## Multiple R-squared:  0.3338, Adjusted R-squared:  0.3326 
## F-statistic: 274.1 on 8 and 4376 DF,  p-value: < 2.2e-16

Summary() suggests Linear Regression model full is significant at 5% significance level as p-value (< 2.2e-16). The R-squared suggests 33.38% variability in price is explained by the model.

The fitted model equation using all the 5 variables is,
\[\hat y= 126061.8 - 3973.2*bedrooms2 - 69077.7*bedrooms3 + 348.1*bathrooms2 + 70825.4*bathrooms3 - 215.1*sqft_living - 0.2*sqft_lot + 50222*floors2 + 55727.1*floors3 + \varepsilon\] \[or\] \[E(\hat y)= 126061.8 - 3973.2*bedrooms2 - 69077.7*bedrooms3 + 348.1*bathrooms2 + 70825.4*bathrooms3 - 215.1*sqft_living - 0.2*sqft_lot + 50222*floors2 + 55727.1*floors3\]

Graphical and statistical tests of assumptions for \(full\) model (Residual Analysis)

That is, residual analysis to test model assumptions.

Lets perform Residual Analysis to check if any model assumptions have been violated.

The estimator error (or residual) is defined by:

\(\hat{\epsilon_i}\) = \(Y_i\) - \(\hat{Y_i}\) (i.e. observed value less - trend value)

Residual checks are done by plotting error/residual plots which will show up the following problems:

  1. Residuals vs Fitted to check Linearity and Randomness and zero mean
  2. Normal Q-Q to check normality
  3. Scale-Location to check Homoscedasticity
  4. Residuals vs Leverage to check for outliers

Residual plots:

par(mfrow=c(2,2))
plot(full)

1. Residuals vs Fitted

  • Horizontal red line is not close to the 0 level, thus mean of residuals/errors might not be 0 . Lets quickly calculate mean of residuals,
mean(rstudent(full))
## [1] 0.0003090681

Mean is very close to zero. Hence, zero mean assumption is not violated, that is \(E(\epsilon_i) = 0\)

  • Residuals spread almost equally around the horizontal line without forming any distinct pattern indicating linear relationships.
  • No distinct pattern indicates randomness of the residuals.

2. Normal Q-Q

  • Residuals drift off near the tails, especially at the upper tail. All the other points are close to the normal line. Residuals may or may not be normally distributed. Need to perform statistical test.

3. Scale-Location

  • Increase in the spread of residuals is seen along the range of predictors indicating homoscedasticity might be violated.

It is difficult to judge the residual’s behavior with certainty from the plots. So, we need to test the assumptions statistically.

1- Non-constant error variance test

\(H_0\): Errors have a constant variance
\(H_1\): Errors have a non-constant variance

ncvTest(full) #library(car)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 2846.35, Df = 1, p = < 2.22e-16

Since the p-value is \(<\) \(0.05\), we have enough evidence to reject \(H_0\). This implies that constant error variance assumption is violated by the reduced model.

2- Test for Normally Distributed Errors

\(H_0\): Errors are normally distributed
\(H_1\): Errors are not normally distributed

shapiro.test(full$residuals) # library(stats)
## 
##  Shapiro-Wilk normality test
## 
## data:  full$residuals
## W = 0.88595, p-value < 2.2e-16

Since the p-value is < 0.05 we reject \(H_0\). This implies that normality error assumption is violated.

3- Test for Autocorrelated Errors

\(H_0\): Errors are uncorrelated
\(H_1\): Errors are correlated

durbinWatsonTest(full) # library(car)
##  lag Autocorrelation D-W Statistic p-value
##    1      0.02738716      1.945155   0.086
##  Alternative hypothesis: rho != 0

Since the p-value is \(>\) \(0.05\), so we do not have enough evidence to reject \(H_0\). This implies that uncorrelated error assumption is not violated.

dwtest(formula = full,  alternative = "two.sided") #library(lmtest)
## 
##  Durbin-Watson test
## 
## data:  full
## DW = 1.9452, p-value = 0.06847
## alternative hypothesis: true autocorrelation is not 0

Since the p-value obtained from the Durbin-Watson test is not significant (d = 1.9452,p = 0.06847), we reject the null hypothesis. This implies that uncorrelated error assumption is not violated.

acf(full$residuals)

Significant correlation at lag 0 indicate autocorrelation or error terms with themselves. We are interested in autocorrelation at lags 1,2,3..etc. Although few bars exceed the significance level (dashed blue line), they are mostly late lags (lag 6, 19, 25) or are borderline significant. Thus, uncorrelated error assumption is not violated.

4. Residuals vs Leverage

This plot helps us to find influential cases (i.e., subjects) if any. We have noticed many observations form exceptions. But do they influence the regression results significantly and form outliers?

We notice no observations situated outside the 0.5 (or 1) dashed line, cook’s distance of 0.5 (or 1). This means, no observations have significant influence on the regression results. Hence, no influential points are found for the full regression.

Summarizing residual analysis on \(full\) model:

Assumption 1: The error terms are randomly distributed and thus show linearity: Not violated
Assumption 2: The mean value of E is zero (zero mean residuals): Not violated
Assumption 3: The variance of E is constant, i.e. the errors are homoscedastic: violated
Assumption 4: The error terms are independently distributed, i.e. they are not autocorrelated: Not violated
Assumption 5: The errors are normally distributed. Violated

Since the normality and homoscedasticity are violated, we need to apply appropriate transformations. Before that, lets see if any subset of the full have a better fit.

Compare subsets of \(full\) model (Model selection)

Using ols_step_best_subset():

Best subsets of regression models with interactions are obtained using ols_step_best_subset(full),

best_sub<-ols_step_best_subset(full) # library(olsrr)
best_sub
##                                Best Subsets Regression                               
## -------------------------------------------------------------------------------------
## Model Index    Predictors
## -------------------------------------------------------------------------------------
##      1         sqft_living                                                            
##      2         factor(bedrooms) sqft_living                                           
##      3         factor(bedrooms) sqft_living factor(floors)                            
##      4         factor(bedrooms) factor(bathrooms) sqft_living factor(floors)          
##      5         factor(bedrooms) factor(bathrooms) sqft_living sqft_lot factor(floors) 
## -------------------------------------------------------------------------------------
## 
##                                                                    Subsets Regression Summary                                                                   
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                                          
## Model    R-Square    R-Square    R-Square      C(p)          AIC           SBIC            SBC            MSEP              FPE               HSP          APC  
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------
##   1        0.2941      0.2940      0.2923    255.8710    117698.5386    105254.2245    117717.6964    1.164319e+14    26564423277.2510    6059405.0110    0.7065 
##   2        0.3184      0.3179      0.3162     98.4660    117549.1402    105102.9245    117581.0699    1.124549e+14    25668759104.1741    5855103.8155    0.6825 
##   3        0.3287      0.3279      0.3257     32.9945    117486.5592    105038.4228    117531.2608    1.107856e+14    25299264756.3205    5770823.6946    0.6726 
##   4        0.3324      0.3313      0.3282     10.3825    117466.0181    105015.9263    117523.4916    1.101925e+14    25175300720.6660    5742550.1844    0.6691 
##   5        0.3338      0.3326      0.3292      3.0000    117458.6264    105008.5601    117522.4858    1.099818e+14    25132902808.3387    5732882.6913    0.6680 
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria
plot(best_sub) 

Using the plots, best model as per,

  • Adjusted R2 (highest) is model 5 (0.3326)
  • Cp (least) is model 5
  • AIC (least) is model 2
  • SBIC (least) is model 5
  • SBC (least) is model 5 (also known as BIC)
  • Other statistics such as MSEP, FPE, HSP, APC all suggest model 5 is best.

Thus, model 5 or the \(full\) model having all 5 predictors is the best model as per multiple statistical measures.

Using ols_step_all_possible():

All possible set of models having different combinations of predictors is given by ols_step_all_possible(full),

all_models<-ols_step_all_possible(full)
plot(all_models)

Using the plots above, again, \(full\) model is best as per Adjusted R2, R2, Cp, AIC, SBIC and BIC statistics.

Transformation

As the response, price is not normally distributed as per the shapiro test (also seen from histogram which is right skewed), lets check if box-cox transformation helps improve normality. Lets calculate lambda (\(\lambda\)) to know which transformation is best for our response variable,

bc <- MASS::boxcox(full)

trans <- bc$x[which.max(bc$y)]
lambda <- round(trans,1)
lambda
## [1] 0.1

As \(\lambda\) (0.1) is close to 0, log transformation can be chosen. Lets fit the log-transformed model (\(model1\)) and check the assumptions,

Fit transformed model, \(model1\)

Linear model using all 5 predictors with response, price, log-transformed is,

model1 = lm(log(price) ~ factor(bedrooms) + factor(bathrooms)   + sqft_living + sqft_lot + factor(floors), data=kc_house) #best model till now
summary(model1)
## 
## Call:
## lm(formula = log(price) ~ factor(bedrooms) + factor(bathrooms) + 
##     sqft_living + sqft_lot + factor(floors), data = kc_house)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27182 -0.27520  0.00848  0.27168  1.22528 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.214e+01  3.629e-02 334.700  < 2e-16 ***
## factor(bedrooms)2   3.894e-02  3.536e-02   1.101  0.27084    
## factor(bedrooms)3  -1.156e-01  3.584e-02  -3.226  0.00127 ** 
## factor(bathrooms)2  5.283e-02  1.645e-02   3.212  0.00133 ** 
## factor(bathrooms)3  6.403e-02  3.906e-02   1.639  0.10126    
## sqft_living         4.608e-04  1.653e-05  27.885  < 2e-16 ***
## sqft_lot           -4.176e-07  1.577e-07  -2.649  0.00810 ** 
## factor(floors)2     1.025e-01  2.287e-02   4.480 7.64e-06 ***
## factor(floors)3     2.302e-01  4.718e-02   4.880 1.10e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3846 on 4376 degrees of freedom
## Multiple R-squared:  0.2819, Adjusted R-squared:  0.2806 
## F-statistic: 214.7 on 8 and 4376 DF,  p-value: < 2.2e-16

The log-transformed \(model1\) is significant as p-value (< 2.2e-16). The R-squared suggests 28.19% variability in log(price) is explained by the model. The model equation is,

\[log(\hat y)= 12.14 + 0.0389*bedrooms2 - 0.1155*bedrooms3 + 0.0528*bathrooms2 + 0.064*bathrooms3 - 4.608e-04*sqft_living - 4.176e-07*sqft_lot + 0.1024*floors2 + 0.2302*floors3 + \varepsilon\] \[or\] \[E(log(\hat y))= 12.14 + 0.0389*bedrooms2 - 0.1155*bedrooms3 + 0.0528*bathrooms2 + 0.064*bathrooms3 - 4.608e-04*sqft_living - 4.176e-07*sqft_lot + 0.1024*floors2 + 0.2302*floors3\]

Graphical and statistical tests of assumptions for transformed \(model1\) (Residual Analysis)

Lets perform Residual Analysis using residual plots for \(model1\) to check if any of the following model assumptions have been violated.

  1. Check Linearity and Randomness and zero mean using Residuals vs Fitted plot
  2. Check normality using Normal Q-Q plot
  3. Check Homoscedasticity using Scale-Location plot
  4. Check for outliers using Residuals vs Leverage

Residual plots:

par(mfrow=c(2,2))
plot(model1)

1. Residuals vs Fitted

  • Horizontal red line basically at the 0 level, thus mean of residuals/errors is 0. Lets quickly calculate mean of residuals.
  • Residuals spread almost equally around the horizontal line without forming any distinct pattern indicating linear relationships.
  • No distinct pattern indicates randomness of the residuals.

2. Normal Q-Q

  • Very few residuals drift off near the tails. All the other points are close to the normal line. Residuals might be normally distributed. Need to perform statistical test.

3. Scale-Location

  • Change in the spread of residuals is not seen much along the range of predictors indicating homoscedasticity.

It is difficult to judge the residual’s behavior with certainty from the plots. So, we need to test the assumptions statistically.

1- Non-constant error variance test

\(H_0\): Errors have a constant variance
\(H_1\): Errors have a non-constant variance

ncvTest(model1) #library(car)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.09914357, Df = 1, p = 0.75286

Since the p-value is \(>\) \(0.05\), null hypothesis \(H_0\) is true. This implies that constant error variance assumption is not violated by the reduced model.

2- Test for Normally Distributed Errors

\(H_0\): Errors are normally distributed
\(H_1\): Errors are not normally distributed

shapiro.test(model1$residuals) # library(stats)
## 
##  Shapiro-Wilk normality test
## 
## data:  model1$residuals
## W = 0.99842, p-value = 0.0002386

Since the p-value is < 0.05 we reject \(H_0\). This implies that normality error assumption is violated.

3- Test for Autocorrelated Errors

\(H_0\): Errors are uncorrelated
\(H_1\): Errors are correlated

durbinWatsonTest(model1) # library(car)
##  lag Autocorrelation D-W Statistic p-value
##    1      0.02495699       1.94998   0.102
##  Alternative hypothesis: rho != 0

Since the p-value is \(>\) \(0.05\), so we do not have enough evidence to reject \(H_0\). This implies that uncorrelated error assumption is not violated.

dwtest(formula = model1,  alternative = "two.sided") #library(lmtest)
## 
##  Durbin-Watson test
## 
## data:  model1
## DW = 1.95, p-value = 0.09649
## alternative hypothesis: true autocorrelation is not 0

Since the p-value obtained from the Durbin-Watson test is not significant (d = 1.95,p = 0.09649), we reject the null hypothesis. This implies that uncorrelated error assumption is not violated.

acf(model1$residuals)

No significant correlated errors at lags are seen. Thus, uncorrelated error assumption is not violated.

Summarizing residual analysis on \(full\) model:

Assumption 1: The error terms are randomly distributed and thus show linearity: Not violated
Assumption 2: The mean value of E is zero (zero mean residuals): Not violated
Assumption 3: The variance of E is constant, i.e. the errors are homoscedastic: Not violated
Assumption 4: The error terms are independently distributed, i.e. they are not autocorrelated: Not violated
Assumption 5: The errors are normally distributed. Violated

ANOVA table to assess models overall fit (Significance test for regression)

To assess overall of fit of the model, Lack of fit is checked by assessing the significance of the model (having all 5 predictors) using the Test statistic \(F_0\) given by,

\(F_0 = \frac{SS_{LOF}/(m-2)}{SS_{PE}/(n-m)} = \frac{MS_{LOF}}{MS_{PE}}\) (i.e, MS of Lack of fit vs pure error )

Significance test using F-statistic or P-value statistic using ANOVA table will be performed on the regression. Lets state the hypothesis for the significance test,

Null hypothesis; \(H_0\): \(\beta_1\) to \(\beta_8\) = 0
Alternate hypothesis; \(H_1\): \(\beta_1\) to \(\beta_8 \not = 0\).

Lets generate Anova table using anova(),

ANOVA <- anova(model1)
ANOVA
## Analysis of Variance Table
## 
## Response: log(price)
##                     Df Sum Sq Mean Sq  F value    Pr(>F)    
## factor(bedrooms)     2   8.19   4.096  27.6862 1.126e-12 ***
## factor(bathrooms)    2 121.18  60.590 409.5278 < 2.2e-16 ***
## sqft_living          1 118.04 118.042 797.8428 < 2.2e-16 ***
## sqft_lot             1   1.16   1.160   7.8427  0.005125 ** 
## factor(floors)       2   5.58   2.790  18.8582 7.000e-09 ***
## Residuals         4376 647.43   0.148                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We cannot rely on this output as we need to all the regressors together to test significance of regression. We need to sum up the predictors and calculate the p value for them as shown below,

# Manually
regression<- ANOVA$`Sum Sq`[1]+ ANOVA$`Sum Sq`[2]+ ANOVA$`Sum Sq`[3] + ANOVA$`Sum Sq`[4]+ ANOVA$`Sum Sq`[5]
Mean_Sq<-regression/(ANOVA$Df[1]+ANOVA$Df[2]+ANOVA$Df[3]+ANOVA$Df[4]+ANOVA$Df[5])
F_value<-Mean_Sq/ANOVA$`Mean Sq`[6]
pf(F_value, 8, ANOVA$Df[6], lower.tail = FALSE) # gives p-value: 8.521644e-308 which is < 2.2e-16 given by summary(model1)
## [1] 8.552035e-308
# using anova()
null <- lm(log(price)~1,kc_house) # ANOVA of full model compared with null model gives type 3 anova table for full model 
anova(null,model1) # p-value same as that from summary()
## Analysis of Variance Table
## 
## Model 1: log(price) ~ 1
## Model 2: log(price) ~ factor(bedrooms) + factor(bathrooms) + sqft_living + 
##     sqft_lot + factor(floors)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   4384 901.59                                  
## 2   4376 647.43  8    254.16 214.73 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value (< 2.2e-16) is < 0.05, the F-statistic is significant, implying that the transformed linear regression model (\(model1\)) is significant.

Additionally, Anova(type = 3) produces same output as that of the summary() but we do not have the overall model significance ehre. It shows each coefficients significance.

car::Anova(model1, type="III")
## Anova Table (Type III tests)
## 
## Response: log(price)
##                    Sum Sq   Df    F value    Pr(>F)    
## (Intercept)       16574.1    1 1.1202e+05 < 2.2e-16 ***
## factor(bedrooms)     20.0    2 6.7556e+01 < 2.2e-16 ***
## factor(bathrooms)     1.5    2 5.1930e+00  0.005589 ** 
## sqft_living         115.0    1 7.7760e+02 < 2.2e-16 ***
## sqft_lot              1.0    1 7.0175e+00  0.008100 ** 
## factor(floors)        5.6    2 1.8858e+01     7e-09 ***
## Residuals           647.4 4376                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For model1, all 5 predictors and the intercept are significant at 0.05 significance level.

Significance test for coefficients (Variable selection)

Lets test significance of the predictor coefficients, \(\beta_1\) to \(\beta_8\) of variables factor(bedrooms)2 , factor(bedrooms)3 … etc. Also, significance of intercept \(\beta_0\) will be checked.

Recall t-values for the coefficients are,

summary(model1)
## 
## Call:
## lm(formula = log(price) ~ factor(bedrooms) + factor(bathrooms) + 
##     sqft_living + sqft_lot + factor(floors), data = kc_house)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27182 -0.27520  0.00848  0.27168  1.22528 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.214e+01  3.629e-02 334.700  < 2e-16 ***
## factor(bedrooms)2   3.894e-02  3.536e-02   1.101  0.27084    
## factor(bedrooms)3  -1.156e-01  3.584e-02  -3.226  0.00127 ** 
## factor(bathrooms)2  5.283e-02  1.645e-02   3.212  0.00133 ** 
## factor(bathrooms)3  6.403e-02  3.906e-02   1.639  0.10126    
## sqft_living         4.608e-04  1.653e-05  27.885  < 2e-16 ***
## sqft_lot           -4.176e-07  1.577e-07  -2.649  0.00810 ** 
## factor(floors)2     1.025e-01  2.287e-02   4.480 7.64e-06 ***
## factor(floors)3     2.302e-01  4.718e-02   4.880 1.10e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3846 on 4376 degrees of freedom
## Multiple R-squared:  0.2819, Adjusted R-squared:  0.2806 
## F-statistic: 214.7 on 8 and 4376 DF,  p-value: < 2.2e-16

For \(\beta_1\):

For the coefficient \(\beta_1\) of bedrooms2 the hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_1\) = 0
Alternate hypothesis: \(H_a\) : \(\beta_1 \not = 0\)

And the Test statistic is:

\(t = \frac {\hat{\beta_j} - \beta_j} {se\left(\hat{\beta_j}\right)}\)

We reject \(H_0\) if \(|t|\) > \(t_{\alpha/2, n-p}\) or by p-value.

The \(|t|\)-critical for 2 tailed t-test with n-p or n-k-1 DF is -1.960506,

qt(p = .025, df = length(kc_house$price) - 8 - 1)
## [1] -1.960506

From the summary(model1), the t-statistic value (1.101) is closer to 0 than the t-critical value (-1.960506), hence the null hypothesis is not rejected. The \(H_0\) is true. Thus, the coefficient \(\beta_1\) is insignificant for the regression at 5% significance level.

For \(\beta_2\) :

For the coefficient B2 of bedrooms3 the hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_2\) = 0
Alternate hypothesis: \(H_a\) : \(\beta_2 \not = 0\)

The t-statistic value (-3.226) is father from 0 than the t-critical value (-1.960506), hence the null hypothesis is rejected. The \(H_a\) is true and \(\beta_3\) is not zero. Thus, the coefficient \(\beta_2\) is significant for the regression.

For \(\beta_3\) :

For the coefficient B3 of bathrooms2 the hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_3\) = 0
Alternate hypothesis: \(H_a\) : \(\beta_3 \not = 0\)

The t-statistic value (3.212) is father from 0 than the t-critical value (-1.960506). The coefficient \(\beta_3\) is significant for the regression.

For \(\beta_4\) :

For the coefficient B3 of bathrooms3 the hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_4\) = 0
Alternate hypothesis: \(H_a\) : \(\beta_4 \not = 0\)

The t-statistic value (1.639) is closer to 0 than the t-critical value (-1.960506). The coefficient \(\beta_3\) is insignificant for the regression.

For \(\beta_5\) :

For the coefficient B5 of sqft_living the hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_5\) = 0
Alternate hypothesis: \(H_a\) : \(\beta_5 \not = 0\)

The t-statistic value (27.885) is father from 0 than the t-critical value (-1.960506). The coefficient \(\beta_5\) is significant for the regression.

For \(\beta_6\) :

For the coefficient B6 of sqft_lot the hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_6\) = 0
Alternate hypothesis: \(H_a\) : \(\beta_6 \not = 0\)

The t-statistic value (-2.649) is father from 0 than the t-critical value (-1.960506). The coefficient \(\beta_6\) is significant for the regression.

For \(\beta_7\) :

For the coefficient B7 of floors2 the hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_7\) = 0
Alternate hypothesis: \(H_a\) : \(\beta_7 \not = 0\)

The t-statistic value (4.480) is father from 0 than the t-critical value (-1.960506). The coefficient \(\beta_7\) is significant for the regression.

For \(\beta_8\) :

For the coefficient B8 of floors3 the hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_8\) = 0
Alternate hypothesis: \(H_a\) : \(\beta_8 \not = 0\)

The t-statistic value (4.880) is father from 0 than the t-critical value (-1.960506). The coefficient \(\beta_8\) is significant for the regression.

Significance test for Intercept

For the intercept B0 the hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_0\) = 0
Alternate hypothesis: \(H_a\) : \(\beta_0 \not = 0\)

The t-statistic value (334.700) is farther from 0 than the t-critical value (-1.960506), hence intercept is significant at 5% level.

Model equation and Interpretation

Fitted model equation is,

\(\hat y\) = \(\beta_0\) + \(\beta_1\)x1 + \(\beta_2\)x2 + \(\beta_3\)x3 + \(\beta_4\)x4 + \(\beta_5\)x5 + \(\beta_6\)x6 + \(\beta_7\)x7 + \(\beta_8\)x8 + \(\varepsilon\), i.e,

\[log(\hat y)= 12.14 + 0.0389*bedrooms2 - 0.1155*bedrooms3 + 0.0528*bathrooms2 + 0.064*bathrooms3 - 4.608e-04*sqft_living - 4.176e-07*sqft_lot + 0.1024*floors2 + 0.2302*floors3 + \varepsilon\] \[or\] \[E(log(\hat y))= 12.14 + 0.0389*bedrooms2 - 0.1155*bedrooms3 + 0.0528*bathrooms2 + 0.064*bathrooms3 - 4.608e-04*sqft_living - 4.176e-07*sqft_lot + 0.1024*floors2 + 0.2302*floors3\]

Where the coefficients are interpreted as,

\(B_0\) (12.14) :-

When sqft_living & sqft_lot are fixed and factors bedrooms, bathrooms and floors are fixed (bedrooms2 & bedrooms3 set to 0 (representing 1st bedroom of the house), bathrooms2 & bathrooms3 set to 0 (representing 1st bathroom of the house), floors2 & floors3 set to 0 (representing 1st floor of the house) in the model equation), the mean of log(price) is equal to \(B_0\).

\(B_1\) (0.0389) :-

when sqft_living, sqft_lot, bathrooms, floors and bedrooms is set to 2 (bedrooms3 is fixed), then the mean of log(price) is increased by \(B_1\)

In similar fashion, \(B_2\) to \(B_8\) can be interpreted.

when we remove insignificant predictors \(\beta_1\) and \(\beta_4\), the equation becomes,
\[log(\hat y)= 12.14 - 0.1155*bedrooms3 + 0.0528*bathrooms2 - 4.608e-04*sqft_living - 4.176e-07*sqft_lot + 0.1024*floors2 + 0.2302*floors3 + \varepsilon\]

Compare transformed and base model

  • \(full\) model violates normality and homoscedasticity. Transformed model \(model1\) violates normality. Hence, the coefficient estimates of transformed model are more accurate.
  • 28.19% variation in response price is explained by \(model1\) compared to 33.38% by \(full\) model.

Using press statistic-

DAAG::press(model1)
## [1] 649.989
DAAG::press(full)
## [1] 1.106e+14

since, press statistic is smaller for log transformed model1 predictive ability is better for model1 than full model. Thus, fit is better for model1. Model1 is better.

Model 2 (Logistic model)

A logistic regression model is fit when the response variable forms a binomial distribution of success (1) and failure (0). This transformed responses, success and failures, are called logits and the regression is called logistic regression. The (log of) Odds of success is then represented as a linear regression, hence a Generalized Linear Model (GLM). The data for the logistic model needs to be in matrix form.

In context to our kc_house data, the Response variable is defined as the the Odds of success where success is, getting a lower priced house. We create a matrix for our response having SUCCESSES (LowPrice) and FAILURES (HighPrice). The response is categorical variable here. The predictors, bedrooms, bathrooms, and floors remain categorical. And sqft_living and sqft_lot remain continuous.

The following set of codes derives the data in matrix form for logistic regression,

# Create threshold using median 
median(kc_house$price) # 333,000
## [1] 333000
# Flag high price and low price using median as threshold
kc_house_flag <- kc_house %>% mutate(HighPrice = ifelse(price > 333000, 1, 0),
                                     LowPrice = ifelse(price < 333000, 1, 0))

# Count LowPrice and HighPrice at the group level of bedrooms x bathrooms x floors. 
a <- aggregate(cbind(LowPrice,HighPrice) ~ bedrooms + bathrooms + floors, data = kc_house_flag, FUN = sum, na.rm = TRUE)

# Take mean of sqft_living and sqft_lot 
b <- aggregate(cbind(sqft_living, sqft_lot) ~ bedrooms + bathrooms + floors, data = kc_house_flag, FUN = mean, na.rm = TRUE)
c <- merge(a, b)

# Reorder columns
kc_house_logistic <- c[,c(4,5,1,2,3,6,7)]
kc_house_logistic
##    LowPrice HighPrice bedrooms bathrooms floors sqft_living   sqft_lot
## 1        79        43        1         1      1    799.5082   9363.123
## 2         0         2        1         1      2    855.0000   3139.500
## 3         2         0        1         2      1   1150.0000   9812.000
## 4         2         1        1         2      2   1363.3333  64646.667
## 5       749       656        2         1      1   1014.9018  11366.468
## 6        18        25        2         1      2   1211.9302   4705.488
## 7         1         3        2         1      3   1015.2500   1061.750
## 8        31        96        2         2      1   1483.4961   8252.228
## 9        19        32        2         2      2   1391.4314  22270.431
## 10        4        11        2         2      3   1327.3333   1274.067
## 11        0         2        2         3      1   2290.0000 493639.000
## 12        0         8        2         3      2   1848.7500  10101.500
## 13        0         2        2         3      3   1610.0000   1120.500
## 14      916       535        3         1      1   1228.1945  10654.033
## 15       16        27        3         1      2   1488.8837   9288.558
## 16      299       451        3         2      1   1688.1640  14544.408
## 17       31       123        3         2      2   1850.2792  23678.948
## 18        2        26        3         2      3   1421.8214   1416.321
## 19       11        26        3         3      1   2373.2432  19927.541
## 20        6        97        3         3      2   2652.2330  13329.000
## 21        2        26        3         3      3   1621.0714   1668.964

Note- Sqft_living and sqft_lot are averages aggregated at bedrooms x bathrooms x floors level.

Fit logistic regression model, \(model2\)

The logistic regression model for the binomially distributed response, odds of getting lower priced house which is generalized as a linear model is fit using the glm() by passing the success and failures of the response as shown below,

model2 <- glm(cbind(LowPrice, HighPrice) ~ factor(bedrooms) + factor(bathrooms) + factor(floors) + sqft_living + sqft_lot, family=binomial, kc_house_logistic)
summary.glm(model2)
## 
## Call:
## glm(formula = cbind(LowPrice, HighPrice) ~ factor(bedrooms) + 
##     factor(bathrooms) + factor(floors) + sqft_living + sqft_lot, 
##     family = binomial, data = kc_house_logistic)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.291e+00  6.666e-01   3.436 0.000589 ***
## factor(bedrooms)2  -6.327e-02  2.591e-01  -0.244 0.807066    
## factor(bedrooms)3   7.619e-01  4.028e-01   1.892 0.058532 .  
## factor(bathrooms)2 -1.574e-02  3.690e-01  -0.043 0.965988    
## factor(bathrooms)3  2.725e-01  9.221e-01   0.295 0.767649    
## factor(floors)2    -4.080e-01  1.852e-01  -2.203 0.027579 *  
## factor(floors)3    -1.968e+00  4.987e-01  -3.947 7.93e-05 ***
## sqft_living        -2.058e-03  8.027e-04  -2.564 0.010342 *  
## sqft_lot            4.140e-07  4.363e-06   0.095 0.924409    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429.301  on 20  degrees of freedom
## Residual deviance:  29.989  on 12  degrees of freedom
## AIC: 117.16
## 
## Number of Fisher Scoring iterations: 5

Summary() gives us the AIC score which we can use to compare models. We notice many insignificant predictors. First, lets check if this model is significant or not.

Check Adequacy of logistic model (Tests of assumptions)

Adequacy of logistic regression model can be checked by testing the significance of the logistic model using Deviance which is a goodness of fit test based on chi-square statistic. Other methods are Pearson and Hosmer-Lemeshow.

Unlike simple linear models, residual analysis cant be performed as,
1. The error terms take on only two values, so they can’t possibly be normally distributed
2. The variance of the observations is a function of the mean (see previous slide)
3. A linear response function could result in predicted values that fall outside the 0, 1 range, and this is impossible because \(0<E(y_i) = \mu_i \le 1\)

The Adequacy of the logistic regression model is checked by testing the goodness of fit using the Deviance.

The hypothesis when using Deviance are,

\(H_0\): The model fits the data well.
\(H_1\): The model does not fit the data well.

# Calculate deviance of logistic model
deviance(model2)
## [1] 29.98885
# Gives the p value for the goodness of fit test
pchisq(model2$deviance, df=model2$df.residual, lower.tail=FALSE) # model2$df.residual = 12
## [1] 0.002803238

The chi-square test statistic of 29.98885 with 12 degree of freedom gives a p-value of 0.002803238, indicating that the null hypothesis is not plausible. Hence, \(model2\) logistic model is insignificant.

Check which predictors are significant (Variable selection)

We want to check if a reduced model is plausible. Individual tests on the predictor variables helps us determine which variables can be removed.

Tests on individual model coefficients can also be done using Wald inference. The MLEs have an approximate normal distribution, so the distribution of \(Z_0\),
\[Z_0 = \frac{\hat \beta}{se(\hat \beta)}\]
is standard normal if the true value of the parameter is zero. We can either report square of Z which is chi-square (displayed by summary()). Or, we can calculate p-value using the chi-square statistic.

The Type 3 Anova() for categorical response (odds of getting lower priced house) and categorical and continuous predictors gives a Chi-Square statistic which is assessed using p-value. (Note: A t-test can also be used as both independent(predictor) and dependent (response) variables are categorical) ** Add reference

A t-test on the model parameters/ predictors will indicate which of the predictors have a significant effect on the odds of getting lower priced house (response). T-test using anova table is,

car::Anova(model2, type="III") # library(car) 
## Analysis of Deviance Table (Type III tests)
## 
## Response: cbind(LowPrice, HighPrice)
##                   LR Chisq Df Pr(>Chisq)    
## factor(bedrooms)    29.165  2  4.645e-07 ***
## factor(bathrooms)    1.028  2    0.59801    
## factor(floors)      39.776  2  2.306e-09 ***
## sqft_living          6.191  1    0.01284 *  
## sqft_lot             0.009  1    0.92606    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on ANOVA table, Bathrooms and sqft_lot could be removed as they are not significant at 5% level.

Fit reduced model, \(reducedmodel2\)

Lets fit the reduced model with bedrooms, floors and floors predictors,

reducedmodel2 <- glm(cbind(LowPrice, HighPrice) ~ factor(bedrooms) + factor(floors) + floors , family=binomial, kc_house_logistic)
summary.glm(reducedmodel2)
## 
## Call:
## glm(formula = cbind(LowPrice, HighPrice) ~ factor(bedrooms) + 
##     factor(floors) + floors, family = binomial, data = kc_house_logistic)
## 
## Coefficients: (1 not defined because of singularities)
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.6466     0.1855   3.485 0.000492 ***
## factor(bedrooms)2  -0.5663     0.1921  -2.949 0.003192 ** 
## factor(bedrooms)3  -0.4875     0.1899  -2.567 0.010272 *  
## factor(floors)2    -1.3773     0.1233 -11.170  < 2e-16 ***
## factor(floors)3    -2.1604     0.3563  -6.063 1.34e-09 ***
## floors                  NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429.30  on 20  degrees of freedom
## Residual deviance: 219.66  on 16  degrees of freedom
## AIC: 298.84
## 
## Number of Fisher Scoring iterations: 4

Note the AIC score for model comparison.

Check Adequacy of reduced logistic model (Tests of assumptions)

The hypothesis when using Deviance are,

\(H_0\): The model fits the data well.
\(H_1\): The model does not fit the data well.

deviance(reducedmodel2)
## [1] 219.6638
pchisq(reducedmodel2$deviance, df=reducedmodel2$df.residual, lower.tail=FALSE) # reducedmodel2$df.residual = 15
## [1] 8.158237e-38

The chi-square test statistic of 31.02393 with 15 degree of freedom gives a p-value of 0.002803238, indicating that the null hypothesis is not plausible. Hence, \(reducedmodel2\) reduced logistic model is insignificant.

Explore Interactions (Model selection)

Although multicollinearity check in the beginning didn’t give any significant correlation between the predictors, in context with our House Sales data, we know number of bedrooms, floors and size of living predictors are related to each other. When predictors are expected to be related, interactions between these predictors can be added to the regression model.

Fit interaction based models

Since the reduced model has 3 predictors, adding an interaction to the reduced model can be done in \(^3C_2\) = 3 combinations. Lets fit each of these interaction based models.

# Fit models
reducedmodel2a <- glm(cbind(LowPrice, HighPrice) ~ factor(bedrooms)  + factor(floors) + sqft_living + factor(bedrooms)*factor(floors), family=binomial, kc_house_logistic) # ONLY SIGNIFICANT MODEL

reducedmodel2b <- glm(cbind(LowPrice, HighPrice) ~ factor(bedrooms)  + factor(floors) + sqft_living + factor(bedrooms)*sqft_living, family=binomial, kc_house_logistic)

reducedmodel2c <- glm(cbind(LowPrice, HighPrice) ~ factor(bedrooms)  + factor(floors) + sqft_living + factor(floors)*sqft_living, family=binomial, kc_house_logistic)

Check adequacy of interaction based models

Again, the hypothesis when using Deviance are,

\(H_0\): The model fits the data well.
\(H_1\): The model does not fit the data well.

# Check model adequacies
deviance(reducedmodel2a)
## [1] 20.59735
print(paste(pchisq(reducedmodel2a$deviance, df=reducedmodel2a$df.residual, lower.tail=FALSE), "Significant!!"))# reducedmodel2a$df.residual = 12
## [1] "0.0565967308332755 Significant!!"
deviance(reducedmodel2b)
## [1] 27.52464
pchisq(reducedmodel2b$deviance, df=reducedmodel2b$df.residual, lower.tail=FALSE) # reducedmodel2b$df.residual = 13
## [1] 0.01053559
deviance(reducedmodel2c)
## [1] 27.26707
pchisq(reducedmodel2c$deviance, df=reducedmodel2c$df.residual, lower.tail=FALSE) # reducedmodel2c$df.residual = 13
## [1] 0.01143388

We found a significant logistic regression model! The reduced model \(reducedmodel2a\) with interaction term bedrooms x floors and the other 3 predictors, bedrooms, floors and sqft_living has the chi-square test statistic of 20.59735 with 12 degree of freedom gives a p-value of 0.056, indicating that the null hypothesis is true. Hence, this \(reducedmodel2a\) logistic model is Significant.

Lets look at the summary(),

summary(reducedmodel2a)
## 
## Call:
## glm(formula = cbind(LowPrice, HighPrice) ~ factor(bedrooms) + 
##     factor(floors) + sqft_living + factor(bedrooms) * factor(floors), 
##     family = binomial, data = kc_house_logistic)
## 
## Coefficients: (1 not defined because of singularities)
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        2.2266966  0.2273662   9.793  < 2e-16 ***
## factor(bedrooms)2                 -0.1096435  0.1994530  -0.550 0.582511    
## factor(bedrooms)3                  0.7374060  0.2149977   3.430 0.000604 ***
## factor(floors)2                   -0.3585976  0.9606131  -0.373 0.708925    
## factor(floors)3                   -2.5361335  0.5216365  -4.862 1.16e-06 ***
## sqft_living                       -0.0019780  0.0001571 -12.591  < 2e-16 ***
## factor(bedrooms)2:factor(floors)2  0.3334882  0.9828605   0.339 0.734381    
## factor(bedrooms)3:factor(floors)2 -0.2653644  0.9731551  -0.273 0.785096    
## factor(bedrooms)2:factor(floors)3  1.7920471  0.7363305   2.434 0.014943 *  
## factor(bedrooms)3:factor(floors)3         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 429.301  on 20  degrees of freedom
## Residual deviance:  20.597  on 12  degrees of freedom
## AIC: 107.77
## 
## Number of Fisher Scoring iterations: 4

Note AIC score.

Check which predictors are significant (Variable selection)

T-test using anova table for each predictor shows,

car::Anova(reducedmodel2a, type="III") # library(car) 
## Error in Anova.III.LR.glm(mod, singular.ok = singular.ok): there are aliased coefficients in the model

Lets use a manual approach. Using anova(), get the Deviance for each predictor and use pchisq() to get p-values,

where,

\(H_0\): \(\beta_i = 0\) \(H_a\): \(\beta_i \not = 0\)

anova(reducedmodel2a)
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: cbind(LowPrice, HighPrice)
## 
## Terms added sequentially (first to last)
## 
## 
##                                 Df Deviance Resid. Df Resid. Dev
## NULL                                               20     429.30
## factor(bedrooms)                 2   11.164        18     418.14
## factor(floors)                   2  198.473        16     219.66
## sqft_living                      1  188.640        15      31.02
## factor(bedrooms):factor(floors)  3   10.427        12      20.60

using the Deviance values above, p-values for the significance of the predictors are,

pchisq(11.164, df=2, lower.tail=FALSE) # For bedrooms
## [1] 0.003765028
pchisq(198.473, df=2, lower.tail=FALSE) # For floors
## [1] 7.98244e-44
pchisq(188.640, df=1, lower.tail=FALSE) # For sqft_living
## [1] 6.297891e-43
pchisq(10.427, df=3, lower.tail=FALSE) # For bedrooms x floors
## [1] 0.01526436

Equation of Reduced model with interaction:

Let - \(p\) be probability of success, i.e, probability of getting high price. Based on the output of R, the fitted model equation is:

\[p= \frac{e^{2.226 + 0.737 \hspace{0.1 cm} bedrooms3 - 2.53 \hspace{0.1 cm} floors3 - 0.00197 \hspace{0.1 cm} sqft_living + 1.79 \hspace{0.1 cm} bedrooms2*floors3}}{1+e^{2.226 + 0.737 \hspace{0.1 cm} bedrooms3 - 2.53 \hspace{0.1 cm} floors3 - 0.00197 \hspace{0.1 cm} sqft_living + 1.79 \hspace{0.1 cm} bedrooms2*floors3}}\]

Or can be written as,

\[p=\frac{1}{1+e^{-2.226 - 0.737 \hspace{0.1 cm} bedrooms3 + 2.53 \hspace{0.1 cm} floors3 + 0.00197 \hspace{0.1 cm} sqft_living - 1.79 \hspace{0.1 cm} bedrooms2*floors3}}\]

Another way;

\[log(\frac{p}{1-p})= -2.226 - 0.737 \hspace{0.1 cm} bedrooms3 + 2.53 \hspace{0.1 cm} floors3 + 0.00197 \hspace{0.1 cm} sqft_living - 1.79 \hspace{0.1 cm} bedrooms2*floors3\]

Where the coefficients are interpreted as,

\(B_0\) (-2.29) :-

When sqft_living & sqft_lot are fixed and factors bedrooms, bathrooms and floors are fixed (bedrooms2 & bedrooms3 set to 0 (representing 1st bedroom of the house), bathrooms2 & bathrooms3 set to 0 (representing 1st bathroom of the house), floors2 & floors3 set to 0 (representing 1st floor of the house) in the model equation), the odds of success (getting lower priced house) is equal to exp(\(B_0\)), i.e, exp(-2.29).

\(B_1\) (0.063) :-

when sqft_living, sqft_lot, bathrooms, floors and bedrooms is set to 2 (bedrooms3 is fixed), then the odds of getting lower priced house is increased by exp(\(B_1\)), i.e, exp(0.063).

similarly, \(B_2\) to \(B_8\) can be defined.

Interpretation: One unit increase in the sqft_living will decrease the odds of getting lower priced house by exp(-0.002058339) = 0.99 times when all the other variables are fixed.

Compare logistic models

  • \(model2\) and \(reducedmodel2\) fail the significance tests. \(reducedmodel2a\) passes the significance test.
  • \(reducedmodel2a\) (107.77) has the best (lowest) AIC score compared to \(model2\) (117.16) and \(reducedmodel2\) (112.2).

Using press statistic

DAAG::press(model2)
## [1] 152.6449
DAAG::press(reducedmodel2)
## [1] 730.2107
DAAG::press(reducedmodel2a) 
## [1] 212.4817

Although \(reducedmodel2a\) fits the data well (significant model), it is not better at prediction as press statistics for \(model2\) and \(reducedmodel2\) are lower. (Lower PRESS value indicates better prediction)

Model 3 (Poisson model)

A poisson regression model is used to model counts with the intention to model the main parameter \(\lambda\), the average number of occurrences per unit of time or space, as a function of one or more covariates. Log transformation are used by default since \(\lambda_i\) can only take on values from \(0\) to \(\infty\).

In context with kc_house data, the Response variable is defined as the count per price range. Averages of the 5 regressors aggregated at the bucket level can be taken as the predictors.

The data for the poisson regression is derived as shown below,

# Divide the response, price into 6 buckets. One-sixth of the mean(kc_house$price) is used.
onesixth <- mean(kc_house$price)/3

# Create 6 price ranges 
kc_house_poisson_flag <- kc_house %>% mutate(range = ifelse(price >= min(kc_house$price) & price <= onesixth, 1, 
                                                           ifelse(price > onesixth & price <= onesixth*2, 2, 
                                                                  ifelse(price > onesixth*2 & price <= onesixth*3, 3, 
                                                                         ifelse(price > onesixth*3 & price <= onesixth*4, 4,
                                                                                ifelse(price > onesixth*4 & price <= onesixth*5, 5,
                                                                                       ifelse(price > onesixth*5 & price <= max(kc_house$price), 6, 0)))))))

# Aggregate the 5 predictors and merge into a final dataframe 
aa <- kc_house_poisson_flag %>% group_by(range) %>% summarise(range_count = n())
bb <- aggregate(cbind(bedrooms, bathrooms, floors) ~ range, data = kc_house_poisson_flag, FUN = mean, na.rm = TRUE) %>% round()
cc <- aggregate(cbind(sqft_living, sqft_lot) ~ range, data = kc_house_poisson_flag, FUN = mean, na.rm = TRUE) %>% round()
dd <- merge(merge(aa, bb), cc)

# Provide row names for ranges
rownames(dd) <- c("Range 78000 to 123810","Range 123810 to 247620","Range 247620 to 371430","Range 371430 to 495241","Range 495241 to 619051","Range 371430 to 3100000")

# reorder columns
kc_house_poisson <- dd[,c(2,3,4,5,6,7)]
kc_house_poisson
##                         range_count bedrooms bathrooms floors sqft_living
## Range 78000 to 123810            63        2         1      1         848
## Range 123810 to 247620         1093        3         1      1        1099
## Range 247620 to 371430         1423        3         1      1        1225
## Range 371430 to 495241          980        3         1      1        1323
## Range 495241 to 619051          476        3         2      1        1563
## Range 371430 to 3100000         350        3         2      1        2003
##                         sqft_lot
## Range 78000 to 123810       9942
## Range 123810 to 247620     11382
## Range 247620 to 371430     11644
## Range 371430 to 495241     10599
## Range 495241 to 619051     15744
## Range 371430 to 3100000    17023

Fit poisson model, \(model3\)

The poisson regression model for the response, Count per price range which is generalized as a linear model is fit using the glm() as,

model3 <- glm(range_count ~ bedrooms + bathrooms + floors + sqft_living + sqft_lot, family = poisson(link = "log"), kc_house_poisson)
summary.glm(model3)
## 
## Call:
## glm(formula = range_count ~ bedrooms + bathrooms + floors + sqft_living + 
##     sqft_lot, family = poisson(link = "log"), data = kc_house_poisson)
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.246e+00  3.915e-01  -3.182  0.00146 ** 
## bedrooms     2.942e+00  1.393e-01  21.116  < 2e-16 ***
## bathrooms   -1.461e+00  1.754e-01  -8.331  < 2e-16 ***
## floors              NA         NA      NA       NA    
## sqft_living -5.889e-04  1.217e-04  -4.840 1.30e-06 ***
## sqft_lot     1.475e-04  3.232e-05   4.563 5.04e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2118.86  on 5  degrees of freedom
## Residual deviance:   67.66  on 1  degrees of freedom
## AIC: 126
## 
## Number of Fisher Scoring iterations: 4

Note AIC score.

Check Adequacy of poisson model (Tests of assumptions)

The Adequacy of the logistic regression model is checked by testing the goodness of fit using the Deviance.

The hypothesis when using Deviance are,

\(H_0\): The model fits the data well.
\(H_1\): The model does not fit the data well.

deviance(model3)
## [1] 67.66024
pchisq(model3$deviance, df=model3$df.residual, lower.tail=FALSE) # model3$df.residual = 1
## [1] 1.942412e-16

The chi-square test statistic of 67.66024 with 1 degree of freedom gives a p-value of 1.942412e-16, indicating that the null hypothesis is not plausible. Hence, the poisson model is not significant.

Lets fit a different model.

Model 4 (GLM model with log transformed gamma distributed response)

A Generalized Linear Models (GLMs) with gamma distributed response is used when the distribution of the response variable is continuous and positively skewed. In such case as the response variable is not normally distributed, a GLM model is achieved by log transformation of the response. Thus, it can stated, the mean of the response variable is related to the predictor variables through a log link function.

In context of our kc_house data, response, price is right skewed as shown by the histogram below,

hist(kc_house$price)

Hence, a log link transformation will be used.

Fit GLM model, \(model4\)

Lets fit the Generalized Linear Model for price response following gamma distribution using glm() with log link transformation,

model4 <- glm(price~factor(bedrooms)+factor(bathrooms)+factor(floors)+sqft_living+sqft_lot,
            family=Gamma(link="log"),kc_house)
summary.glm(model4)
## 
## Call:
## glm(formula = price ~ factor(bedrooms) + factor(bathrooms) + 
##     factor(floors) + sqft_living + sqft_lot, family = Gamma(link = "log"), 
##     data = kc_house)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.222e+01  3.691e-02 330.975  < 2e-16 ***
## factor(bedrooms)2   2.848e-02  3.597e-02   0.792 0.428494    
## factor(bedrooms)3  -1.316e-01  3.646e-02  -3.609 0.000311 ***
## factor(bathrooms)2  3.151e-02  1.673e-02   1.884 0.059674 .  
## factor(bathrooms)3  5.481e-02  3.974e-02   1.379 0.167847    
## factor(floors)2     1.041e-01  2.326e-02   4.473 7.89e-06 ***
## factor(floors)3     1.874e-01  4.799e-02   3.904 9.61e-05 ***
## sqft_living         4.768e-04  1.681e-05  28.362  < 2e-16 ***
## sqft_lot           -4.642e-07  1.604e-07  -2.894 0.003822 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.1531135)
## 
##     Null deviance: 932.60  on 4384  degrees of freedom
## Residual deviance: 640.31  on 4376  degrees of freedom
## AIC: 115678
## 
## Number of Fisher Scoring iterations: 5

Bathroom predictor is insignificant.

Variable selection

A t-test on the model parameters/ predictors will indicate which of the predictors have a significant effect on the response. T-test using anova table is,

Anova(model4,type = 3)
## Analysis of Deviance Table (Type III tests)
## 
## Response: price
##                   LR Chisq Df Pr(>Chisq)    
## factor(bedrooms)    141.13  2  < 2.2e-16 ***
## factor(bathrooms)     3.84  2   0.146657    
## factor(floors)       30.73  2  2.124e-07 ***
## sqft_living         813.68  1  < 2.2e-16 ***
## sqft_lot              8.86  1   0.002919 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can remove bathroom predictor from the model as it is not significant at 5% significance level.

Fit reduced model

Lets fit a reduced model by removing bathroom predictor,

reducedmodel4 <- glm(price~factor(bedrooms)+factor(floors)+sqft_living+sqft_lot,
                   family=Gamma(link="log"),kc_house)
summary.glm(reducedmodel4)
## 
## Call:
## glm(formula = price ~ factor(bedrooms) + factor(floors) + sqft_living + 
##     sqft_lot, family = Gamma(link = "log"), data = kc_house)
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.220e+01  3.618e-02 337.248  < 2e-16 ***
## factor(bedrooms)2  2.640e-02  3.590e-02   0.735 0.462232    
## factor(bedrooms)3 -1.311e-01  3.636e-02  -3.605 0.000316 ***
## factor(floors)2    1.154e-01  2.196e-02   5.256 1.54e-07 ***
## factor(floors)3    2.150e-01  4.512e-02   4.765 1.95e-06 ***
## sqft_living        4.950e-04  1.414e-05  34.994  < 2e-16 ***
## sqft_lot          -4.682e-07  1.601e-07  -2.924 0.003476 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.1528168)
## 
##     Null deviance: 932.6  on 4384  degrees of freedom
## Residual deviance: 640.9  on 4378  degrees of freedom
## AIC: 115678
## 
## Number of Fisher Scoring iterations: 5

Note AIC score.

Model selection

Using p-value:

anova(model4,reducedmodel4, test="Chi")
## Analysis of Deviance Table
## 
## Model 1: price ~ factor(bedrooms) + factor(bathrooms) + factor(floors) + 
##     sqft_living + sqft_lot
## Model 2: price ~ factor(bedrooms) + factor(floors) + sqft_living + sqft_lot
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      4376     640.31                     
## 2      4378     640.90 -2 -0.58785   0.1467

p-value (0.14) > 0.05, hence \(Model4\) is better.

Using AIC scores: - Both \(reducedmodel4\) and \(model4\) have same AIC scores.

As per anova table’s p-value test, \(Model4\) is better.

Check Adequacy of fitted GLM model (Tests of assumptions)

The Adequacy of the logistic regression model is checked by testing the goodness of fit using the Deviance.

The hypothesis when using Deviance are,

\(H_0\): The model fits the data well. \(H_1\): The model does not fit the data well.

deviance(model4)
## [1] 640.3131
pchisq(model4$deviance, df=model4$df.residual, lower.tail=FALSE) # model4$df.residual = 4376
## [1] 1

The chi-square test statistic of 640.3131 with 4376 degree of freedom gives a p-value of 1, indicating that the null hypothesis is plausible. Hence, this GLM model with log transformed gamma distributed response is significant.

Detailed conclusion on the most appropriate model including context

Lets summarize the 4 finalized models

Model 1 - Linear Regression Model:

Model 2 - Logistic Regression Model:

Model 3 - Poisson Regression Model:

Model 4 - GLM Regression Model with gamma distributed response:

From this summary, the Poisson regression model \(model3\) is insignificant and does not fit our House Sales data. So it can be rejected.

Coming to the remaining 3 models. Linear regression model \(Model1\) has a log-transformed response, logistic regression model \(reducedmodel2a\) has a Odds ratio response, and GLM model with gamma distributed response \(model4\) has a generalized response with gamma distribution. Hence, no statistical tests can be done to compare these models.

Fitting \(reducedmodel2a\) feels forceful as our response (price) was not in the ideal format of successes and failures. Also, recall we had to take mean of sqft_living predictor which is a crude way of creating data. For these reasons, a logistic model, although significant, does not feel the best choice of model.

Choosing between \(model1\) and \(model4\) is difficult, but with the context of House Sales data in mind, a GLM model with gamma distribution seems most apt as the response (price) had a right skewed distribution. Also, recall, Normality assumption of errors/residuals was violated by \(model1\).

Final Analysis Conclusion

Using multiple model and variable selections, 3 significant models finalized, each with different response types. A GLM regression model having log link for the gamma distributed response was chosen best fitting in context with the data in hand (response was right skewed). The model is significant based on the Deviance statistic with a p-value of 1. Although, number of bathrooms predictor is insignificant, the full model \(model1\) is better with this insignificant parameter than the reduced model as per significance test between these models (p-value = 0.1467 in favor of full model).

References

Kaggle (2016) House Sales in King County, USA. Predict house price using regression, Kaggle website, accessed 25th May 2023. https://www.kaggle.com/datasets/harlfoxem/housesalesprediction

GeoDa (2020) 2014-15 Home Sales in King County, WA, Geodacenter github website, accessed 2nd June 2023. https://geodacenter.github.io/data-and-lab//KingCounty-HouseSales2015/

GIS (2023) Zipcodes for King County and Surrounding Area (Shorelines) / zipcode shore area, GIS website, accessed 2nd June 2023. https://gis-kingcounty.opendata.arcgis.com/datasets/zipcodes-for-king-county-and-surrounding-area-shorelines-zipcode-shore-area/explore

CSDS (2020) The center for spatial data science. The university of Chicago, Univeristy of Chicago website, accessed 2nd June 2023. https://spatial.uchicago.edu/