September 06, 2024library(magrittr) #pipe operator
library(car) # Anova()
library(GGally) #ggpairs()
library(lmtest) #lmtest()
library(olsrr) # ols_step_best_subset()
library(dplyr) # mutate()
A significance level \(\alpha=5\%\) is used.
The public records of home sales from May, 2014 through May, 2015 in the King County area, Washington State of USA is posted on Kaggle (2016) (also posted on GeoDa (2020)) with the intention of predicting house prices using regression.
The dataset holds 21 columns and 21,613 records/rows of which a filtered data of 6 columns and 4,385 records will be used. Only 1 response variable exists, price. Out of many possible regressor variables, only 5 are chosen for simplicity.
The 6 columns used are price, bedrooms, bathrooms, sqft_living, sqft_lot, and floors where price is our response variable and the remaining 5 are regressors.
How was data filtered?
Thus reducing the data count.
Column descriptions:
For all 21 columns, refer GeoDa (2020). For the chosen 6 columns,
price - Price of each home sold
bedrooms - Number of bedrooms
bathrooms - Number of bathrooms, where 0.5 accounts for
a room with a toilet but no shower
sqft_living -
Square footage of the apartments interior living space
sqft_lot - Square footage of the land space
floors - Number of floors
No collection method has been described other than for the zip codes of King County which were retrieved from King county GIS Open Data, GIS (2023).
The House Sales datasets are provided as direct csv downloads and can be downloaded from,
Option 1: Direct download from Kaggle (2016)
Option 2: Direct download from GeoDa (2020) webpage prepared by the Center for Spatial Data Science, CSDS (2020). Or, by navigating CSDS (2020) website. Open CSDS (2020) website, click on the data tab. Next, filter by Housing category and choosing 2014-2015 Home Sales in King County, WA.
Our aim for the House Sales data collected is to determine a relevant regression model that determines price of house sales using at least one of the 5 regressors. This is to be achieved by fitting different significant regression models through model selection and variable selection methods. Relevance of the finalized models in context with the housing data will be stated.
kc_house_raw <- read.csv("C:/Users/admin/Downloads/aa/kaggle/kc_house_data - Copy.csv") %>% dplyr::select(3:8)
print(paste("Row count of raw dataset -", nrow(kc_house_raw)))
## [1] "Row count of raw dataset - 21613"
# Filter data. Only include 1,2 and 3 number of bedrooms, bathrooms and floors.
kc_house <- kc_house_raw %>% dplyr::filter(bedrooms %in% c(1,2,3),bathrooms %in% c(1,2,3,1.00,2.00,3.00),floors %in% c(1,2,3,1.0,2.0,3.0))
print(paste("Row count of finalized dataset -", nrow(kc_house)))
## [1] "Row count of finalized dataset - 4385"
head(kc_house)
## price bedrooms bathrooms sqft_living sqft_lot floors
## 1 221900 3 1 1180 5650 1
## 2 180000 2 1 770 10000 1
## 3 510000 3 2 1680 8080 1
## 4 229500 3 1 1780 7470 1
## 5 468000 2 1 1160 6000 1
## 6 395000 3 2 1890 14040 2
For fitting linear regression model, the response is price and the regressor variables are bedrooms, bathrooms, sqft_living, sqft_lot, and floors each described as follows,
y = price = Price of each home sold
x1 = bedrooms = Number of
bedrooms
x2 = bathrooms = Number of bathrooms
x3 =
sqft_living = Square footage of the apartments interior living space
x4 = sqft_lot = Square footage of the land space
x5 =
floors = Number of floors
Note, bedrooms, bathrooms, and floors are categorical variables each holding values 1,2 and 3. sqft_living and sqft_lot are continuous variables. Response, price is continuous as well.
Let’s first fit a model for price (y) with
all 5 variables using the lm().
full= lm(price ~ factor(bedrooms) + factor(bathrooms) + sqft_living + sqft_lot + factor(floors), data=kc_house)
Lets see if the the 5 regressor variables have significant
relationships between themselves. If present, we say the variables are
correlated and there is presence of multicollinearity. Multicollinearity
impacts the estimates (increases variance/covariance of least square
estimates) of the individual regression coefficients. Multicollinearity
in the data can be checked using the correlation coefficients or using
the statistic, Variance Inflation Factor (VIF).
In context of the kc_house data, we expect correlations between all the 5 variables. For example, generally, we expect houses with more floors to have more bedrooms and bathrooms. When size of the lot more, generally, we expect size of the living area to be more and vice versa. Similar statements can be made using every other variable pair. If the relationships are found to be significant or relevant, then we can introduce interactions between the variables as additional parameter of the fitted model. Lets now check presence of multicollinearity,
Lets plot the correlation coefficients between each predictor,
ggpairs(data = kc_house, columns = c(1,2,3,4,5,6), progress = FALSE) # library(GGally)
As shown in the matrix above,
- bedrooms and
bathrooms are correlated by 0.316 (correlation
coefficient). bedrooms and sqft_living
by 0.422.
- bathrooms and
sqft_living by 0.645. bathrooms and
floors by 0.479.
- sqft_living
and floors by 0.305.
- All other correlations
are negligible.
Overall, there is indication of
multicollinearity.
To test multicollinearity statistically, we can use the Variance Inflation Factor (VIF). VIF can be viewed as the factor by which the variance of the coefficient is increase due to multicollinearity.
car::vif(full)
## GVIF Df GVIF^(1/(2*Df))
## factor(bedrooms) 1.253801 2 1.058174
## factor(bathrooms) 2.172852 2 1.214109
## sqft_living 2.029398 1 1.424569
## sqft_lot 1.029082 1 1.014437
## factor(floors) 1.442439 2 1.095909
Since we have categorical variables, Generalized Variance Inflation Factors (GVIF) are given. Also, standardized GVIF (GVIF^(1/(2xDf))) are given. Since the GVIF is < 5, multicollinearity is not significant (VIF between 5 to 10 is considered significant).
Lets get the summary statistics of the kc_house dataset,
summary(kc_house)
## price bedrooms bathrooms sqft_living
## Min. : 78000 Min. :1.000 Min. :1.00 Min. : 390
## 1st Qu.: 242500 1st Qu.:2.000 1st Qu.:1.00 1st Qu.: 970
## Median : 333000 Median :3.000 Median :1.00 Median :1200
## Mean : 371431 Mean :2.563 Mean :1.34 Mean :1309
## 3rd Qu.: 450000 3rd Qu.:3.000 3rd Qu.:2.00 3rd Qu.:1510
## Max. :3100000 Max. :3.000 Max. :3.00 Max. :4910
## sqft_lot floors
## Min. : 600 Min. :1.000
## 1st Qu.: 5000 1st Qu.:1.000
## Median : 7208 Median :1.000
## Mean : 12195 Mean :1.128
## 3rd Qu.: 9568 3rd Qu.:1.000
## Max. :1164794 Max. :3.000
sd(kc_house$price) # Standard deviation
## [1] 193926
For response, price :
Since we are generating regression model which estimates the
response, price, lets focus on price’s
statistics.
mean and median :
Since the mean and
median are not close, the distribution is likely to be skewed.
Mean of price of house sold is 333,000 USD. Median is is 371,431
USD.
max and min :
Since the max
(3,100,000) is farther from mean (371,431) than min (78,000), the
distribution will be right-skewed.
standard deviation :
Standard
deviation of 193,926 USD suggests the data is spread quite a lot in
relation to the mean price.
Now lets plot the histogram of prices of the house sold and see if
the summary statistics are justified,
hist(kc_house$price, xlab = 'Price of house sold (USD)', main = "Histogram of sold house's price" )
As we expected, the spread/distribution of the
price response is right skewed.
Comment: Transformation/s (box-cox) on the response
variable might be necessary to satisfy normality assumption required by
Least Square parameter estimation techniques for the linear model
fitting.
For regressor variables :
Lets look at the scatter plot between response (price) and the
continuous variables, sqft_living and sqft_lot.
par(mfrow=c(1,2))
plot(kc_house$price, kc_house$sqft_lot,
xlab = "Size of lot",ylab = "Price of house sold (USD)",
main="Price vs lot size")
plot(kc_house$price, kc_house$sqft_living,
xlab = "Size of living area",ylab = "Price of house sold (USD)",
main="Price vs living area size")
print("Correlation coefficients between, ")
## [1] "Correlation coefficients between, "
print(paste("Price and lot size is :", cor(kc_house$price,kc_house$sqft_lot)))
## [1] "Price and lot size is : 0.0459611411386067"
print(paste("Price and living area is :", cor(kc_house$price,kc_house$sqft_living)))
## [1] "Price and living area is : 0.542327732890229"
print(paste("Price and number of bedrooms is :", cor(kc_house$price,kc_house$bedrooms)))
## [1] "Price and number of bedrooms is : 0.0935197418746122"
print(paste("Price and number of bathrooms is :", cor(kc_house$price,kc_house$bathrooms)))
## [1] "Price and number of bathrooms is : 0.388467122389912"
print(paste("Price and number of floors is :", cor(kc_house$price,kc_house$floors)))
## [1] "Price and number of floors is : 0.264817238298091"
From the 2 scatter plots, as size of lot or living area increases,
price increases. Thus, suggesting linear relationship.
Summarizing the outputs above,
- A moderate positive linear
relationship (0.54) exists between response,
price (y) and regressor,
sqft_living (x3).
- A low positive
linear relationship (0.04) exists between response,
price (y) and regressor,
sqft_lot (x3).
- Positive correlation
coefficients, 0.09, 0.38 and 0.26 for the categorical regressors, number
of bedrooms (x1),
bathrooms and floors
suggest linear relationship with response as well. (Scatter plot not
shown for simplicity)
Thus, since all the regressors exhibit linear relationship with the response variable, fitting a linear regression model seems apt.
Recall the fitted linear regression model, full, where price is the continuous response and sqft_living and sqft_lot are continuous predictors. Bedrooms, bathrooms and floors are categorical variables and hence factorized when feeding in the model.
full= lm(price ~ factor(bedrooms) + factor(bathrooms) + sqft_living + sqft_lot + factor(floors), data=kc_house)
summary(full)
##
## Call:
## lm(formula = price ~ factor(bedrooms) + factor(bathrooms) + sqft_living +
## sqft_lot + factor(floors), data = kc_house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -492457 -101938 -21165 78696 2081165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.261e+05 1.495e+04 8.435 < 2e-16 ***
## factor(bedrooms)2 -3.973e+03 1.456e+04 -0.273 0.78500
## factor(bedrooms)3 -6.908e+04 1.476e+04 -4.680 2.95e-06 ***
## factor(bathrooms)2 3.482e+02 6.773e+03 0.051 0.95901
## factor(bathrooms)3 7.083e+04 1.609e+04 4.402 1.10e-05 ***
## sqft_living 2.152e+02 6.807e+00 31.610 < 2e-16 ***
## sqft_lot -1.989e-01 6.493e-02 -3.063 0.00220 **
## factor(floors)2 5.022e+04 9.419e+03 5.332 1.02e-07 ***
## factor(floors)3 5.573e+04 1.943e+04 2.868 0.00415 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 158400 on 4376 degrees of freedom
## Multiple R-squared: 0.3338, Adjusted R-squared: 0.3326
## F-statistic: 274.1 on 8 and 4376 DF, p-value: < 2.2e-16
Summary() suggests Linear Regression model full is significant at 5% significance level as p-value (< 2.2e-16). The R-squared suggests 33.38% variability in price is explained by the model.
The fitted model equation using all the 5 variables is,
\[\hat y= 126061.8 - 3973.2*bedrooms2 -
69077.7*bedrooms3 + 348.1*bathrooms2 + 70825.4*bathrooms3 -
215.1*sqft_living - 0.2*sqft_lot + 50222*floors2 + 55727.1*floors3 +
\varepsilon\] \[or\] \[E(\hat y)= 126061.8 - 3973.2*bedrooms2 -
69077.7*bedrooms3 + 348.1*bathrooms2 + 70825.4*bathrooms3 -
215.1*sqft_living - 0.2*sqft_lot + 50222*floors2 +
55727.1*floors3\]
That is, residual analysis to test model assumptions.
Lets perform Residual Analysis to check if any model assumptions have been violated.
The estimator error (or residual) is defined by:
\(\hat{\epsilon_i}\) = \(Y_i\) - \(\hat{Y_i}\) (i.e. observed value less - trend value)
Residual checks are done by plotting error/residual plots which will show up the following problems:
Residual plots:
par(mfrow=c(2,2))
plot(full)
1. Residuals vs Fitted
mean(rstudent(full))
## [1] 0.0003090681
Mean is very close to zero. Hence, zero mean assumption is
not violated, that is \(E(\epsilon_i)
= 0\)
2. Normal Q-Q
3. Scale-Location
It is difficult to judge the residual’s behavior with certainty from the plots. So, we need to test the assumptions statistically.
1- Non-constant error variance test
\(H_0\): Errors have a constant
variance
\(H_1\): Errors have a
non-constant variance
ncvTest(full) #library(car)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 2846.35, Df = 1, p = < 2.22e-16
Since the p-value is \(<\) \(0.05\), we have enough evidence to reject \(H_0\). This implies that constant error variance assumption is violated by the reduced model.
2- Test for Normally Distributed Errors
\(H_0\): Errors are normally
distributed
\(H_1\): Errors are
not normally distributed
shapiro.test(full$residuals) # library(stats)
##
## Shapiro-Wilk normality test
##
## data: full$residuals
## W = 0.88595, p-value < 2.2e-16
Since the p-value is < 0.05 we reject \(H_0\). This implies that normality error assumption is violated.
3- Test for Autocorrelated Errors
\(H_0\): Errors are uncorrelated
\(H_1\): Errors are correlated
durbinWatsonTest(full) # library(car)
## lag Autocorrelation D-W Statistic p-value
## 1 0.02738716 1.945155 0.086
## Alternative hypothesis: rho != 0
Since the p-value is \(>\) \(0.05\), so we do not have enough evidence to reject \(H_0\). This implies that uncorrelated error assumption is not violated.
dwtest(formula = full, alternative = "two.sided") #library(lmtest)
##
## Durbin-Watson test
##
## data: full
## DW = 1.9452, p-value = 0.06847
## alternative hypothesis: true autocorrelation is not 0
Since the p-value obtained from the Durbin-Watson test is not significant (d = 1.9452,p = 0.06847), we reject the null hypothesis. This implies that uncorrelated error assumption is not violated.
acf(full$residuals)
Significant correlation at lag 0 indicate autocorrelation or error terms with themselves. We are interested in autocorrelation at lags 1,2,3..etc. Although few bars exceed the significance level (dashed blue line), they are mostly late lags (lag 6, 19, 25) or are borderline significant. Thus, uncorrelated error assumption is not violated.
4. Residuals vs Leverage
This plot helps us to find influential cases (i.e., subjects) if any. We have noticed many observations form exceptions. But do they influence the regression results significantly and form outliers?
We notice no observations situated outside the 0.5 (or 1) dashed line, cook’s distance of 0.5 (or 1). This means, no observations have significant influence on the regression results. Hence, no influential points are found for the full regression.
Summarizing residual analysis on \(full\) model:
Assumption 1: The error terms are randomly distributed and thus show
linearity: Not violated
Assumption 2: The mean value of E is zero
(zero mean residuals): Not violated
Assumption 3: The variance of
E is constant, i.e. the errors are homoscedastic:
violated
Assumption 4: The error terms
are independently distributed, i.e. they are not autocorrelated: Not
violated
Assumption 5: The errors are normally distributed.
Violated
Since the normality and homoscedasticity are violated, we need to apply appropriate transformations. Before that, lets see if any subset of the full have a better fit.
Using ols_step_best_subset():
Best subsets of regression models with interactions are obtained
using ols_step_best_subset(full),
best_sub<-ols_step_best_subset(full) # library(olsrr)
best_sub
## Best Subsets Regression
## -------------------------------------------------------------------------------------
## Model Index Predictors
## -------------------------------------------------------------------------------------
## 1 sqft_living
## 2 factor(bedrooms) sqft_living
## 3 factor(bedrooms) sqft_living factor(floors)
## 4 factor(bedrooms) factor(bathrooms) sqft_living factor(floors)
## 5 factor(bedrooms) factor(bathrooms) sqft_living sqft_lot factor(floors)
## -------------------------------------------------------------------------------------
##
## Subsets Regression Summary
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------
## 1 0.2941 0.2940 0.2923 255.8710 117698.5386 105254.2245 117717.6964 1.164319e+14 26564423277.2510 6059405.0110 0.7065
## 2 0.3184 0.3179 0.3162 98.4660 117549.1402 105102.9245 117581.0699 1.124549e+14 25668759104.1741 5855103.8155 0.6825
## 3 0.3287 0.3279 0.3257 32.9945 117486.5592 105038.4228 117531.2608 1.107856e+14 25299264756.3205 5770823.6946 0.6726
## 4 0.3324 0.3313 0.3282 10.3825 117466.0181 105015.9263 117523.4916 1.101925e+14 25175300720.6660 5742550.1844 0.6691
## 5 0.3338 0.3326 0.3292 3.0000 117458.6264 105008.5601 117522.4858 1.099818e+14 25132902808.3387 5732882.6913 0.6680
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
plot(best_sub)
Using the plots, best model as per,
Thus, model 5 or the \(full\) model having all 5 predictors is the best model as per multiple statistical measures.
Using ols_step_all_possible():
All possible set of models having different combinations of
predictors is given by ols_step_all_possible(full),
all_models<-ols_step_all_possible(full)
plot(all_models)
Using the plots above, again, \(full\) model is best as per Adjusted R2, R2, Cp, AIC, SBIC and BIC statistics.
As the response, price is not normally distributed as per the shapiro test (also seen from histogram which is right skewed), lets check if box-cox transformation helps improve normality. Lets calculate lambda (\(\lambda\)) to know which transformation is best for our response variable,
bc <- MASS::boxcox(full)
trans <- bc$x[which.max(bc$y)]
lambda <- round(trans,1)
lambda
## [1] 0.1
As \(\lambda\) (0.1) is close to 0, log transformation can be chosen. Lets fit the log-transformed model (\(model1\)) and check the assumptions,
Linear model using all 5 predictors with response, price, log-transformed is,
model1 = lm(log(price) ~ factor(bedrooms) + factor(bathrooms) + sqft_living + sqft_lot + factor(floors), data=kc_house) #best model till now
summary(model1)
##
## Call:
## lm(formula = log(price) ~ factor(bedrooms) + factor(bathrooms) +
## sqft_living + sqft_lot + factor(floors), data = kc_house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27182 -0.27520 0.00848 0.27168 1.22528
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.214e+01 3.629e-02 334.700 < 2e-16 ***
## factor(bedrooms)2 3.894e-02 3.536e-02 1.101 0.27084
## factor(bedrooms)3 -1.156e-01 3.584e-02 -3.226 0.00127 **
## factor(bathrooms)2 5.283e-02 1.645e-02 3.212 0.00133 **
## factor(bathrooms)3 6.403e-02 3.906e-02 1.639 0.10126
## sqft_living 4.608e-04 1.653e-05 27.885 < 2e-16 ***
## sqft_lot -4.176e-07 1.577e-07 -2.649 0.00810 **
## factor(floors)2 1.025e-01 2.287e-02 4.480 7.64e-06 ***
## factor(floors)3 2.302e-01 4.718e-02 4.880 1.10e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3846 on 4376 degrees of freedom
## Multiple R-squared: 0.2819, Adjusted R-squared: 0.2806
## F-statistic: 214.7 on 8 and 4376 DF, p-value: < 2.2e-16
The log-transformed \(model1\) is significant as p-value (< 2.2e-16). The R-squared suggests 28.19% variability in log(price) is explained by the model. The model equation is,
\[log(\hat y)= 12.14 + 0.0389*bedrooms2 - 0.1155*bedrooms3 + 0.0528*bathrooms2 + 0.064*bathrooms3 - 4.608e-04*sqft_living - 4.176e-07*sqft_lot + 0.1024*floors2 + 0.2302*floors3 + \varepsilon\] \[or\] \[E(log(\hat y))= 12.14 + 0.0389*bedrooms2 - 0.1155*bedrooms3 + 0.0528*bathrooms2 + 0.064*bathrooms3 - 4.608e-04*sqft_living - 4.176e-07*sqft_lot + 0.1024*floors2 + 0.2302*floors3\]
Lets perform Residual Analysis using residual plots for \(model1\) to check if any of the following model assumptions have been violated.
Residual plots:
par(mfrow=c(2,2))
plot(model1)
1. Residuals vs Fitted
2. Normal Q-Q
3. Scale-Location
It is difficult to judge the residual’s behavior with certainty from the plots. So, we need to test the assumptions statistically.
1- Non-constant error variance test
\(H_0\): Errors have a constant
variance
\(H_1\): Errors have a
non-constant variance
ncvTest(model1) #library(car)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.09914357, Df = 1, p = 0.75286
Since the p-value is \(>\) \(0.05\), null hypothesis \(H_0\) is true. This implies that constant error variance assumption is not violated by the reduced model.
2- Test for Normally Distributed Errors
\(H_0\): Errors are normally
distributed
\(H_1\): Errors are
not normally distributed
shapiro.test(model1$residuals) # library(stats)
##
## Shapiro-Wilk normality test
##
## data: model1$residuals
## W = 0.99842, p-value = 0.0002386
Since the p-value is < 0.05 we reject \(H_0\). This implies that normality error assumption is violated.
3- Test for Autocorrelated Errors
\(H_0\): Errors are uncorrelated
\(H_1\): Errors are correlated
durbinWatsonTest(model1) # library(car)
## lag Autocorrelation D-W Statistic p-value
## 1 0.02495699 1.94998 0.102
## Alternative hypothesis: rho != 0
Since the p-value is \(>\) \(0.05\), so we do not have enough evidence to reject \(H_0\). This implies that uncorrelated error assumption is not violated.
dwtest(formula = model1, alternative = "two.sided") #library(lmtest)
##
## Durbin-Watson test
##
## data: model1
## DW = 1.95, p-value = 0.09649
## alternative hypothesis: true autocorrelation is not 0
Since the p-value obtained from the Durbin-Watson test is not significant (d = 1.95,p = 0.09649), we reject the null hypothesis. This implies that uncorrelated error assumption is not violated.
acf(model1$residuals)
No significant correlated errors at lags are seen. Thus, uncorrelated error assumption is not violated.
Summarizing residual analysis on \(full\) model:
Assumption 1: The error terms are randomly distributed and thus show
linearity: Not violated
Assumption 2: The mean value of E is zero
(zero mean residuals): Not violated
Assumption 3: The variance of
E is constant, i.e. the errors are homoscedastic: Not violated
Assumption 4: The error terms are independently distributed, i.e. they
are not autocorrelated: Not violated
Assumption 5: The errors are
normally distributed. Violated
To assess overall of fit of the model, Lack of fit is checked by
assessing the significance of the model (having all 5 predictors) using
the Test statistic \(F_0\) given by,
\(F_0 =
\frac{SS_{LOF}/(m-2)}{SS_{PE}/(n-m)} = \frac{MS_{LOF}}{MS_{PE}}\)
(i.e, MS of Lack of fit vs pure error )
Significance test using F-statistic or P-value statistic using ANOVA
table will be performed on the regression. Lets state the hypothesis for
the significance test,
Null hypothesis; \(H_0\): \(\beta_1\) to \(\beta_8\) = 0
Alternate hypothesis;
\(H_1\): \(\beta_1\) to \(\beta_8 \not = 0\).
Lets generate Anova table using anova(),
ANOVA <- anova(model1)
ANOVA
## Analysis of Variance Table
##
## Response: log(price)
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(bedrooms) 2 8.19 4.096 27.6862 1.126e-12 ***
## factor(bathrooms) 2 121.18 60.590 409.5278 < 2.2e-16 ***
## sqft_living 1 118.04 118.042 797.8428 < 2.2e-16 ***
## sqft_lot 1 1.16 1.160 7.8427 0.005125 **
## factor(floors) 2 5.58 2.790 18.8582 7.000e-09 ***
## Residuals 4376 647.43 0.148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We cannot rely on this output as we need to all the regressors together to test significance of regression. We need to sum up the predictors and calculate the p value for them as shown below,
# Manually
regression<- ANOVA$`Sum Sq`[1]+ ANOVA$`Sum Sq`[2]+ ANOVA$`Sum Sq`[3] + ANOVA$`Sum Sq`[4]+ ANOVA$`Sum Sq`[5]
Mean_Sq<-regression/(ANOVA$Df[1]+ANOVA$Df[2]+ANOVA$Df[3]+ANOVA$Df[4]+ANOVA$Df[5])
F_value<-Mean_Sq/ANOVA$`Mean Sq`[6]
pf(F_value, 8, ANOVA$Df[6], lower.tail = FALSE) # gives p-value: 8.521644e-308 which is < 2.2e-16 given by summary(model1)
## [1] 8.552035e-308
# using anova()
null <- lm(log(price)~1,kc_house) # ANOVA of full model compared with null model gives type 3 anova table for full model
anova(null,model1) # p-value same as that from summary()
## Analysis of Variance Table
##
## Model 1: log(price) ~ 1
## Model 2: log(price) ~ factor(bedrooms) + factor(bathrooms) + sqft_living +
## sqft_lot + factor(floors)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4384 901.59
## 2 4376 647.43 8 254.16 214.73 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value (< 2.2e-16) is < 0.05, the F-statistic is significant, implying that the transformed linear regression model (\(model1\)) is significant.
Additionally, Anova(type = 3) produces same output as that of the summary() but we do not have the overall model significance ehre. It shows each coefficients significance.
car::Anova(model1, type="III")
## Anova Table (Type III tests)
##
## Response: log(price)
## Sum Sq Df F value Pr(>F)
## (Intercept) 16574.1 1 1.1202e+05 < 2.2e-16 ***
## factor(bedrooms) 20.0 2 6.7556e+01 < 2.2e-16 ***
## factor(bathrooms) 1.5 2 5.1930e+00 0.005589 **
## sqft_living 115.0 1 7.7760e+02 < 2.2e-16 ***
## sqft_lot 1.0 1 7.0175e+00 0.008100 **
## factor(floors) 5.6 2 1.8858e+01 7e-09 ***
## Residuals 647.4 4376
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For model1, all 5 predictors and the intercept are significant at 0.05 significance level.
Lets test significance of the predictor coefficients, \(\beta_1\) to \(\beta_8\) of variables factor(bedrooms)2 , factor(bedrooms)3 … etc. Also, significance of intercept \(\beta_0\) will be checked.
Recall t-values for the coefficients are,
summary(model1)
##
## Call:
## lm(formula = log(price) ~ factor(bedrooms) + factor(bathrooms) +
## sqft_living + sqft_lot + factor(floors), data = kc_house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27182 -0.27520 0.00848 0.27168 1.22528
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.214e+01 3.629e-02 334.700 < 2e-16 ***
## factor(bedrooms)2 3.894e-02 3.536e-02 1.101 0.27084
## factor(bedrooms)3 -1.156e-01 3.584e-02 -3.226 0.00127 **
## factor(bathrooms)2 5.283e-02 1.645e-02 3.212 0.00133 **
## factor(bathrooms)3 6.403e-02 3.906e-02 1.639 0.10126
## sqft_living 4.608e-04 1.653e-05 27.885 < 2e-16 ***
## sqft_lot -4.176e-07 1.577e-07 -2.649 0.00810 **
## factor(floors)2 1.025e-01 2.287e-02 4.480 7.64e-06 ***
## factor(floors)3 2.302e-01 4.718e-02 4.880 1.10e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3846 on 4376 degrees of freedom
## Multiple R-squared: 0.2819, Adjusted R-squared: 0.2806
## F-statistic: 214.7 on 8 and 4376 DF, p-value: < 2.2e-16
For \(\beta_1\):
For the coefficient \(\beta_1\) of
bedrooms2 the hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_1\) = 0
Alternate hypothesis:
\(H_a\) : \(\beta_1 \not = 0\)
And the Test statistic is:
\(t = \frac {\hat{\beta_j} - \beta_j}
{se\left(\hat{\beta_j}\right)}\)
We reject \(H_0\) if \(|t|\) > \(t_{\alpha/2, n-p}\) or by p-value.
The \(|t|\)-critical for 2 tailed t-test with n-p or n-k-1 DF is -1.960506,
qt(p = .025, df = length(kc_house$price) - 8 - 1)
## [1] -1.960506
From the summary(model1), the t-statistic value (1.101) is closer to 0 than the t-critical value (-1.960506), hence the null hypothesis is not rejected. The \(H_0\) is true. Thus, the coefficient \(\beta_1\) is insignificant for the regression at 5% significance level.
For \(\beta_2\) :
For the coefficient B2 of bedrooms3 the
hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_2\) = 0
Alternate hypothesis:
\(H_a\) : \(\beta_2 \not = 0\)
The t-statistic value (-3.226) is father from 0 than the t-critical value (-1.960506), hence the null hypothesis is rejected. The \(H_a\) is true and \(\beta_3\) is not zero. Thus, the coefficient \(\beta_2\) is significant for the regression.
For \(\beta_3\) :
For the coefficient B3 of bathrooms2 the
hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_3\) = 0
Alternate hypothesis:
\(H_a\) : \(\beta_3 \not = 0\)
The t-statistic value (3.212) is father from 0 than the t-critical value (-1.960506). The coefficient \(\beta_3\) is significant for the regression.
For \(\beta_4\) :
For the coefficient B3 of bathrooms3 the
hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_4\) = 0
Alternate hypothesis:
\(H_a\) : \(\beta_4 \not = 0\)
The t-statistic value (1.639) is closer to 0 than the t-critical value (-1.960506). The coefficient \(\beta_3\) is insignificant for the regression.
For \(\beta_5\) :
For the coefficient B5 of sqft_living the
hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_5\) = 0
Alternate hypothesis:
\(H_a\) : \(\beta_5 \not = 0\)
The t-statistic value (27.885) is father from 0 than the t-critical value (-1.960506). The coefficient \(\beta_5\) is significant for the regression.
For \(\beta_6\) :
For the coefficient B6 of sqft_lot the
hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_6\) = 0
Alternate hypothesis:
\(H_a\) : \(\beta_6 \not = 0\)
The t-statistic value (-2.649) is father from 0 than the t-critical value (-1.960506). The coefficient \(\beta_6\) is significant for the regression.
For \(\beta_7\) :
For the coefficient B7 of floors2 the
hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_7\) = 0
Alternate hypothesis:
\(H_a\) : \(\beta_7 \not = 0\)
The t-statistic value (4.480) is father from 0 than the t-critical value (-1.960506). The coefficient \(\beta_7\) is significant for the regression.
For \(\beta_8\) :
For the coefficient B8 of floors3 the
hypothesis for the test are,
Null hypothesis: \(H_0\) : \(\beta_8\) = 0
Alternate hypothesis:
\(H_a\) : \(\beta_8 \not = 0\)
The t-statistic value (4.880) is father from 0 than the t-critical value (-1.960506). The coefficient \(\beta_8\) is significant for the regression.
For the intercept B0 the hypothesis for the test are,
Null
hypothesis: \(H_0\) : \(\beta_0\) = 0
Alternate hypothesis:
\(H_a\) : \(\beta_0 \not = 0\)
The t-statistic value (334.700) is farther from 0 than the t-critical value (-1.960506), hence intercept is significant at 5% level.
Fitted model equation is,
\(\hat y\) = \(\beta_0\) + \(\beta_1\)x1 + \(\beta_2\)x2 + \(\beta_3\)x3 + \(\beta_4\)x4 + \(\beta_5\)x5 + \(\beta_6\)x6 + \(\beta_7\)x7 + \(\beta_8\)x8 + \(\varepsilon\), i.e,
\[log(\hat y)= 12.14 + 0.0389*bedrooms2 - 0.1155*bedrooms3 + 0.0528*bathrooms2 + 0.064*bathrooms3 - 4.608e-04*sqft_living - 4.176e-07*sqft_lot + 0.1024*floors2 + 0.2302*floors3 + \varepsilon\] \[or\] \[E(log(\hat y))= 12.14 + 0.0389*bedrooms2 - 0.1155*bedrooms3 + 0.0528*bathrooms2 + 0.064*bathrooms3 - 4.608e-04*sqft_living - 4.176e-07*sqft_lot + 0.1024*floors2 + 0.2302*floors3\]
Where the coefficients are interpreted as,
\(B_0\) (12.14) :-
When sqft_living & sqft_lot are fixed and factors bedrooms,
bathrooms and floors are fixed (bedrooms2 & bedrooms3 set to 0
(representing 1st bedroom of the house), bathrooms2 & bathrooms3 set
to 0 (representing 1st bathroom of the house), floors2 & floors3 set
to 0 (representing 1st floor of the house) in the model equation), the
mean of log(price) is equal to \(B_0\).
\(B_1\) (0.0389) :-
when sqft_living, sqft_lot, bathrooms, floors and bedrooms is set to
2 (bedrooms3 is fixed), then the mean of log(price) is increased by
\(B_1\)
In similar fashion, \(B_2\) to \(B_8\) can be interpreted.
when we remove insignificant predictors \(\beta_1\) and \(\beta_4\), the equation becomes,
\[log(\hat y)= 12.14 - 0.1155*bedrooms3 +
0.0528*bathrooms2 - 4.608e-04*sqft_living - 4.176e-07*sqft_lot +
0.1024*floors2 + 0.2302*floors3 + \varepsilon\]
Using press statistic-
DAAG::press(model1)
## [1] 649.989
DAAG::press(full)
## [1] 1.106e+14
since, press statistic is smaller for log transformed model1 predictive ability is better for model1 than full model. Thus, fit is better for model1. Model1 is better.
A logistic regression model is fit when the response variable forms a
binomial distribution of success (1) and failure (0). This transformed
responses, success and failures, are called logits and the regression is
called logistic regression. The (log of) Odds of success is then
represented as a linear regression, hence a Generalized Linear Model
(GLM). The data for the logistic model needs to be in matrix form.
In context to our kc_house data, the Response variable is defined as
the the Odds of success where success is, getting a lower priced
house. We create a matrix for our response having
SUCCESSES (LowPrice) and FAILURES
(HighPrice). The response is categorical variable here. The
predictors, bedrooms,
bathrooms, and
floors remain categorical. And
sqft_living and
sqft_lot remain continuous.
The following set of codes derives the data in matrix form for
logistic regression,
# Create threshold using median
median(kc_house$price) # 333,000
## [1] 333000
# Flag high price and low price using median as threshold
kc_house_flag <- kc_house %>% mutate(HighPrice = ifelse(price > 333000, 1, 0),
LowPrice = ifelse(price < 333000, 1, 0))
# Count LowPrice and HighPrice at the group level of bedrooms x bathrooms x floors.
a <- aggregate(cbind(LowPrice,HighPrice) ~ bedrooms + bathrooms + floors, data = kc_house_flag, FUN = sum, na.rm = TRUE)
# Take mean of sqft_living and sqft_lot
b <- aggregate(cbind(sqft_living, sqft_lot) ~ bedrooms + bathrooms + floors, data = kc_house_flag, FUN = mean, na.rm = TRUE)
c <- merge(a, b)
# Reorder columns
kc_house_logistic <- c[,c(4,5,1,2,3,6,7)]
kc_house_logistic
## LowPrice HighPrice bedrooms bathrooms floors sqft_living sqft_lot
## 1 79 43 1 1 1 799.5082 9363.123
## 2 0 2 1 1 2 855.0000 3139.500
## 3 2 0 1 2 1 1150.0000 9812.000
## 4 2 1 1 2 2 1363.3333 64646.667
## 5 749 656 2 1 1 1014.9018 11366.468
## 6 18 25 2 1 2 1211.9302 4705.488
## 7 1 3 2 1 3 1015.2500 1061.750
## 8 31 96 2 2 1 1483.4961 8252.228
## 9 19 32 2 2 2 1391.4314 22270.431
## 10 4 11 2 2 3 1327.3333 1274.067
## 11 0 2 2 3 1 2290.0000 493639.000
## 12 0 8 2 3 2 1848.7500 10101.500
## 13 0 2 2 3 3 1610.0000 1120.500
## 14 916 535 3 1 1 1228.1945 10654.033
## 15 16 27 3 1 2 1488.8837 9288.558
## 16 299 451 3 2 1 1688.1640 14544.408
## 17 31 123 3 2 2 1850.2792 23678.948
## 18 2 26 3 2 3 1421.8214 1416.321
## 19 11 26 3 3 1 2373.2432 19927.541
## 20 6 97 3 3 2 2652.2330 13329.000
## 21 2 26 3 3 3 1621.0714 1668.964
Note- Sqft_living and sqft_lot are averages
aggregated at bedrooms x bathrooms x floors level.
The logistic regression model for the binomially distributed response, odds of getting lower priced house which is generalized as a linear model is fit using the glm() by passing the success and failures of the response as shown below,
model2 <- glm(cbind(LowPrice, HighPrice) ~ factor(bedrooms) + factor(bathrooms) + factor(floors) + sqft_living + sqft_lot, family=binomial, kc_house_logistic)
summary.glm(model2)
##
## Call:
## glm(formula = cbind(LowPrice, HighPrice) ~ factor(bedrooms) +
## factor(bathrooms) + factor(floors) + sqft_living + sqft_lot,
## family = binomial, data = kc_house_logistic)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.291e+00 6.666e-01 3.436 0.000589 ***
## factor(bedrooms)2 -6.327e-02 2.591e-01 -0.244 0.807066
## factor(bedrooms)3 7.619e-01 4.028e-01 1.892 0.058532 .
## factor(bathrooms)2 -1.574e-02 3.690e-01 -0.043 0.965988
## factor(bathrooms)3 2.725e-01 9.221e-01 0.295 0.767649
## factor(floors)2 -4.080e-01 1.852e-01 -2.203 0.027579 *
## factor(floors)3 -1.968e+00 4.987e-01 -3.947 7.93e-05 ***
## sqft_living -2.058e-03 8.027e-04 -2.564 0.010342 *
## sqft_lot 4.140e-07 4.363e-06 0.095 0.924409
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 429.301 on 20 degrees of freedom
## Residual deviance: 29.989 on 12 degrees of freedom
## AIC: 117.16
##
## Number of Fisher Scoring iterations: 5
Summary() gives us the AIC score which we can use to compare models. We notice many insignificant predictors. First, lets check if this model is significant or not.
Adequacy of logistic regression model can be checked by testing the
significance of the logistic model using Deviance which
is a goodness of fit test based on chi-square statistic. Other methods
are Pearson and Hosmer-Lemeshow.
Unlike simple linear models, residual analysis cant be performed as,
1. The error terms take on only two values, so they can’t
possibly be normally distributed
2. The variance of the
observations is a function of the mean (see previous slide)
3. A
linear response function could result in predicted values that fall
outside the 0, 1 range, and this is impossible because \(0<E(y_i) = \mu_i \le 1\)
The Adequacy of the logistic regression model is checked by testing the goodness of fit using the Deviance.
The hypothesis when using Deviance are,
\(H_0\): The model fits the data
well.
\(H_1\): The model does
not fit the data well.
# Calculate deviance of logistic model
deviance(model2)
## [1] 29.98885
# Gives the p value for the goodness of fit test
pchisq(model2$deviance, df=model2$df.residual, lower.tail=FALSE) # model2$df.residual = 12
## [1] 0.002803238
The chi-square test statistic of 29.98885 with 12 degree of freedom gives a p-value of 0.002803238, indicating that the null hypothesis is not plausible. Hence, \(model2\) logistic model is insignificant.
We want to check if a reduced model is plausible. Individual tests on the predictor variables helps us determine which variables can be removed.
Tests on individual model coefficients can also be done using Wald
inference. The MLEs have an approximate normal distribution, so the
distribution of \(Z_0\),
\[Z_0 = \frac{\hat \beta}{se(\hat \beta)}\]
is standard normal if the true value of the parameter is zero. We
can either report square of Z which is chi-square (displayed by
summary()). Or, we can calculate p-value using the chi-square
statistic.
The Type 3 Anova() for categorical response (odds of getting lower priced house) and categorical and continuous predictors gives a Chi-Square statistic which is assessed using p-value. (Note: A t-test can also be used as both independent(predictor) and dependent (response) variables are categorical) ** Add reference
A t-test on the model parameters/ predictors will indicate which of the predictors have a significant effect on the odds of getting lower priced house (response). T-test using anova table is,
car::Anova(model2, type="III") # library(car)
## Analysis of Deviance Table (Type III tests)
##
## Response: cbind(LowPrice, HighPrice)
## LR Chisq Df Pr(>Chisq)
## factor(bedrooms) 29.165 2 4.645e-07 ***
## factor(bathrooms) 1.028 2 0.59801
## factor(floors) 39.776 2 2.306e-09 ***
## sqft_living 6.191 1 0.01284 *
## sqft_lot 0.009 1 0.92606
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on ANOVA table, Bathrooms and sqft_lot could be removed as they are not significant at 5% level.
Lets fit the reduced model with bedrooms, floors and floors predictors,
reducedmodel2 <- glm(cbind(LowPrice, HighPrice) ~ factor(bedrooms) + factor(floors) + floors , family=binomial, kc_house_logistic)
summary.glm(reducedmodel2)
##
## Call:
## glm(formula = cbind(LowPrice, HighPrice) ~ factor(bedrooms) +
## factor(floors) + floors, family = binomial, data = kc_house_logistic)
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.6466 0.1855 3.485 0.000492 ***
## factor(bedrooms)2 -0.5663 0.1921 -2.949 0.003192 **
## factor(bedrooms)3 -0.4875 0.1899 -2.567 0.010272 *
## factor(floors)2 -1.3773 0.1233 -11.170 < 2e-16 ***
## factor(floors)3 -2.1604 0.3563 -6.063 1.34e-09 ***
## floors NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 429.30 on 20 degrees of freedom
## Residual deviance: 219.66 on 16 degrees of freedom
## AIC: 298.84
##
## Number of Fisher Scoring iterations: 4
Note the AIC score for model comparison.
The hypothesis when using Deviance are,
\(H_0\): The model fits the data
well.
\(H_1\): The model does
not fit the data well.
deviance(reducedmodel2)
## [1] 219.6638
pchisq(reducedmodel2$deviance, df=reducedmodel2$df.residual, lower.tail=FALSE) # reducedmodel2$df.residual = 15
## [1] 8.158237e-38
The chi-square test statistic of 31.02393 with 15 degree of freedom gives a p-value of 0.002803238, indicating that the null hypothesis is not plausible. Hence, \(reducedmodel2\) reduced logistic model is insignificant.
Although multicollinearity check in the beginning didn’t give any significant correlation between the predictors, in context with our House Sales data, we know number of bedrooms, floors and size of living predictors are related to each other. When predictors are expected to be related, interactions between these predictors can be added to the regression model.
Since the reduced model has 3 predictors, adding an interaction to the reduced model can be done in \(^3C_2\) = 3 combinations. Lets fit each of these interaction based models.
# Fit models
reducedmodel2a <- glm(cbind(LowPrice, HighPrice) ~ factor(bedrooms) + factor(floors) + sqft_living + factor(bedrooms)*factor(floors), family=binomial, kc_house_logistic) # ONLY SIGNIFICANT MODEL
reducedmodel2b <- glm(cbind(LowPrice, HighPrice) ~ factor(bedrooms) + factor(floors) + sqft_living + factor(bedrooms)*sqft_living, family=binomial, kc_house_logistic)
reducedmodel2c <- glm(cbind(LowPrice, HighPrice) ~ factor(bedrooms) + factor(floors) + sqft_living + factor(floors)*sqft_living, family=binomial, kc_house_logistic)
Again, the hypothesis when using Deviance are,
\(H_0\): The model fits the data
well.
\(H_1\): The model does
not fit the data well.
# Check model adequacies
deviance(reducedmodel2a)
## [1] 20.59735
print(paste(pchisq(reducedmodel2a$deviance, df=reducedmodel2a$df.residual, lower.tail=FALSE), "Significant!!"))# reducedmodel2a$df.residual = 12
## [1] "0.0565967308332755 Significant!!"
deviance(reducedmodel2b)
## [1] 27.52464
pchisq(reducedmodel2b$deviance, df=reducedmodel2b$df.residual, lower.tail=FALSE) # reducedmodel2b$df.residual = 13
## [1] 0.01053559
deviance(reducedmodel2c)
## [1] 27.26707
pchisq(reducedmodel2c$deviance, df=reducedmodel2c$df.residual, lower.tail=FALSE) # reducedmodel2c$df.residual = 13
## [1] 0.01143388
We found a significant logistic regression model! The reduced model \(reducedmodel2a\) with interaction term bedrooms x floors and the other 3 predictors, bedrooms, floors and sqft_living has the chi-square test statistic of 20.59735 with 12 degree of freedom gives a p-value of 0.056, indicating that the null hypothesis is true. Hence, this \(reducedmodel2a\) logistic model is Significant.
Lets look at the summary(),
summary(reducedmodel2a)
##
## Call:
## glm(formula = cbind(LowPrice, HighPrice) ~ factor(bedrooms) +
## factor(floors) + sqft_living + factor(bedrooms) * factor(floors),
## family = binomial, data = kc_house_logistic)
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.2266966 0.2273662 9.793 < 2e-16 ***
## factor(bedrooms)2 -0.1096435 0.1994530 -0.550 0.582511
## factor(bedrooms)3 0.7374060 0.2149977 3.430 0.000604 ***
## factor(floors)2 -0.3585976 0.9606131 -0.373 0.708925
## factor(floors)3 -2.5361335 0.5216365 -4.862 1.16e-06 ***
## sqft_living -0.0019780 0.0001571 -12.591 < 2e-16 ***
## factor(bedrooms)2:factor(floors)2 0.3334882 0.9828605 0.339 0.734381
## factor(bedrooms)3:factor(floors)2 -0.2653644 0.9731551 -0.273 0.785096
## factor(bedrooms)2:factor(floors)3 1.7920471 0.7363305 2.434 0.014943 *
## factor(bedrooms)3:factor(floors)3 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 429.301 on 20 degrees of freedom
## Residual deviance: 20.597 on 12 degrees of freedom
## AIC: 107.77
##
## Number of Fisher Scoring iterations: 4
Note AIC score.
T-test using anova table for each predictor shows,
car::Anova(reducedmodel2a, type="III") # library(car)
## Error in Anova.III.LR.glm(mod, singular.ok = singular.ok): there are aliased coefficients in the model
Lets use a manual approach. Using anova(), get the Deviance for each predictor and use pchisq() to get p-values,
where,
\(H_0\): \(\beta_i = 0\) \(H_a\): \(\beta_i \not = 0\)
anova(reducedmodel2a)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: cbind(LowPrice, HighPrice)
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 20 429.30
## factor(bedrooms) 2 11.164 18 418.14
## factor(floors) 2 198.473 16 219.66
## sqft_living 1 188.640 15 31.02
## factor(bedrooms):factor(floors) 3 10.427 12 20.60
using the Deviance values above, p-values for the significance of the predictors are,
pchisq(11.164, df=2, lower.tail=FALSE) # For bedrooms
## [1] 0.003765028
pchisq(198.473, df=2, lower.tail=FALSE) # For floors
## [1] 7.98244e-44
pchisq(188.640, df=1, lower.tail=FALSE) # For sqft_living
## [1] 6.297891e-43
pchisq(10.427, df=3, lower.tail=FALSE) # For bedrooms x floors
## [1] 0.01526436
Equation of Reduced model with interaction:
Let - \(p\) be probability of
success, i.e, probability of getting high price. Based on the output of
R, the fitted model equation is:
\[p= \frac{e^{2.226 + 0.737 \hspace{0.1 cm} bedrooms3 - 2.53 \hspace{0.1 cm} floors3 - 0.00197 \hspace{0.1 cm} sqft_living + 1.79 \hspace{0.1 cm} bedrooms2*floors3}}{1+e^{2.226 + 0.737 \hspace{0.1 cm} bedrooms3 - 2.53 \hspace{0.1 cm} floors3 - 0.00197 \hspace{0.1 cm} sqft_living + 1.79 \hspace{0.1 cm} bedrooms2*floors3}}\]
Or can be written as,
\[p=\frac{1}{1+e^{-2.226 -
0.737 \hspace{0.1 cm} bedrooms3 + 2.53 \hspace{0.1 cm} floors3 +
0.00197 \hspace{0.1 cm} sqft_living - 1.79 \hspace{0.1 cm}
bedrooms2*floors3}}\]
Another way;
\[log(\frac{p}{1-p})= -2.226 - 0.737 \hspace{0.1 cm} bedrooms3 + 2.53 \hspace{0.1 cm} floors3 + 0.00197 \hspace{0.1 cm} sqft_living - 1.79 \hspace{0.1 cm} bedrooms2*floors3\]
Where the coefficients are interpreted as,
\(B_0\) (-2.29) :-
When sqft_living & sqft_lot are fixed and factors bedrooms,
bathrooms and floors are fixed (bedrooms2 & bedrooms3 set to 0
(representing 1st bedroom of the house), bathrooms2 & bathrooms3 set
to 0 (representing 1st bathroom of the house), floors2 & floors3 set
to 0 (representing 1st floor of the house) in the model equation), the
odds of success (getting lower priced house) is equal to exp(\(B_0\)), i.e, exp(-2.29).
\(B_1\) (0.063) :-
when sqft_living, sqft_lot, bathrooms, floors and bedrooms is set to
2 (bedrooms3 is fixed), then the odds of getting lower priced house is
increased by exp(\(B_1\)), i.e,
exp(0.063).
similarly, \(B_2\) to \(B_8\) can be defined.
Interpretation: One unit increase in the sqft_living will decrease the odds of getting lower priced house by exp(-0.002058339) = 0.99 times when all the other variables are fixed.
Using press statistic
DAAG::press(model2)
## [1] 152.6449
DAAG::press(reducedmodel2)
## [1] 730.2107
DAAG::press(reducedmodel2a)
## [1] 212.4817
Although \(reducedmodel2a\) fits the data well (significant model), it is not better at prediction as press statistics for \(model2\) and \(reducedmodel2\) are lower. (Lower PRESS value indicates better prediction)
A poisson regression model is used to model counts with the intention to model the main parameter \(\lambda\), the average number of occurrences per unit of time or space, as a function of one or more covariates. Log transformation are used by default since \(\lambda_i\) can only take on values from \(0\) to \(\infty\).
In context with kc_house data, the Response variable is defined as the count per price range. Averages of the 5 regressors aggregated at the bucket level can be taken as the predictors.
The data for the poisson regression is derived as shown below,
# Divide the response, price into 6 buckets. One-sixth of the mean(kc_house$price) is used.
onesixth <- mean(kc_house$price)/3
# Create 6 price ranges
kc_house_poisson_flag <- kc_house %>% mutate(range = ifelse(price >= min(kc_house$price) & price <= onesixth, 1,
ifelse(price > onesixth & price <= onesixth*2, 2,
ifelse(price > onesixth*2 & price <= onesixth*3, 3,
ifelse(price > onesixth*3 & price <= onesixth*4, 4,
ifelse(price > onesixth*4 & price <= onesixth*5, 5,
ifelse(price > onesixth*5 & price <= max(kc_house$price), 6, 0)))))))
# Aggregate the 5 predictors and merge into a final dataframe
aa <- kc_house_poisson_flag %>% group_by(range) %>% summarise(range_count = n())
bb <- aggregate(cbind(bedrooms, bathrooms, floors) ~ range, data = kc_house_poisson_flag, FUN = mean, na.rm = TRUE) %>% round()
cc <- aggregate(cbind(sqft_living, sqft_lot) ~ range, data = kc_house_poisson_flag, FUN = mean, na.rm = TRUE) %>% round()
dd <- merge(merge(aa, bb), cc)
# Provide row names for ranges
rownames(dd) <- c("Range 78000 to 123810","Range 123810 to 247620","Range 247620 to 371430","Range 371430 to 495241","Range 495241 to 619051","Range 371430 to 3100000")
# reorder columns
kc_house_poisson <- dd[,c(2,3,4,5,6,7)]
kc_house_poisson
## range_count bedrooms bathrooms floors sqft_living
## Range 78000 to 123810 63 2 1 1 848
## Range 123810 to 247620 1093 3 1 1 1099
## Range 247620 to 371430 1423 3 1 1 1225
## Range 371430 to 495241 980 3 1 1 1323
## Range 495241 to 619051 476 3 2 1 1563
## Range 371430 to 3100000 350 3 2 1 2003
## sqft_lot
## Range 78000 to 123810 9942
## Range 123810 to 247620 11382
## Range 247620 to 371430 11644
## Range 371430 to 495241 10599
## Range 495241 to 619051 15744
## Range 371430 to 3100000 17023
The poisson regression model for the response, Count per price range which is generalized as a linear model is fit using the glm() as,
model3 <- glm(range_count ~ bedrooms + bathrooms + floors + sqft_living + sqft_lot, family = poisson(link = "log"), kc_house_poisson)
summary.glm(model3)
##
## Call:
## glm(formula = range_count ~ bedrooms + bathrooms + floors + sqft_living +
## sqft_lot, family = poisson(link = "log"), data = kc_house_poisson)
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.246e+00 3.915e-01 -3.182 0.00146 **
## bedrooms 2.942e+00 1.393e-01 21.116 < 2e-16 ***
## bathrooms -1.461e+00 1.754e-01 -8.331 < 2e-16 ***
## floors NA NA NA NA
## sqft_living -5.889e-04 1.217e-04 -4.840 1.30e-06 ***
## sqft_lot 1.475e-04 3.232e-05 4.563 5.04e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2118.86 on 5 degrees of freedom
## Residual deviance: 67.66 on 1 degrees of freedom
## AIC: 126
##
## Number of Fisher Scoring iterations: 4
Note AIC score.
The Adequacy of the logistic regression model is checked by testing the goodness of fit using the Deviance.
The hypothesis when using Deviance are,
\(H_0\): The model fits the data
well.
\(H_1\): The model does
not fit the data well.
deviance(model3)
## [1] 67.66024
pchisq(model3$deviance, df=model3$df.residual, lower.tail=FALSE) # model3$df.residual = 1
## [1] 1.942412e-16
The chi-square test statistic of 67.66024 with 1 degree of freedom gives a p-value of 1.942412e-16, indicating that the null hypothesis is not plausible. Hence, the poisson model is not significant.
Lets fit a different model.
A Generalized Linear Models (GLMs) with gamma distributed response is used when the distribution of the response variable is continuous and positively skewed. In such case as the response variable is not normally distributed, a GLM model is achieved by log transformation of the response. Thus, it can stated, the mean of the response variable is related to the predictor variables through a log link function.
In context of our kc_house data, response, price is right skewed as shown by the histogram below,
hist(kc_house$price)
Hence, a log link transformation will be used.
Lets fit the Generalized Linear Model for price response following gamma distribution using glm() with log link transformation,
model4 <- glm(price~factor(bedrooms)+factor(bathrooms)+factor(floors)+sqft_living+sqft_lot,
family=Gamma(link="log"),kc_house)
summary.glm(model4)
##
## Call:
## glm(formula = price ~ factor(bedrooms) + factor(bathrooms) +
## factor(floors) + sqft_living + sqft_lot, family = Gamma(link = "log"),
## data = kc_house)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.222e+01 3.691e-02 330.975 < 2e-16 ***
## factor(bedrooms)2 2.848e-02 3.597e-02 0.792 0.428494
## factor(bedrooms)3 -1.316e-01 3.646e-02 -3.609 0.000311 ***
## factor(bathrooms)2 3.151e-02 1.673e-02 1.884 0.059674 .
## factor(bathrooms)3 5.481e-02 3.974e-02 1.379 0.167847
## factor(floors)2 1.041e-01 2.326e-02 4.473 7.89e-06 ***
## factor(floors)3 1.874e-01 4.799e-02 3.904 9.61e-05 ***
## sqft_living 4.768e-04 1.681e-05 28.362 < 2e-16 ***
## sqft_lot -4.642e-07 1.604e-07 -2.894 0.003822 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.1531135)
##
## Null deviance: 932.60 on 4384 degrees of freedom
## Residual deviance: 640.31 on 4376 degrees of freedom
## AIC: 115678
##
## Number of Fisher Scoring iterations: 5
Bathroom predictor is insignificant.
A t-test on the model parameters/ predictors will indicate which of the predictors have a significant effect on the response. T-test using anova table is,
Anova(model4,type = 3)
## Analysis of Deviance Table (Type III tests)
##
## Response: price
## LR Chisq Df Pr(>Chisq)
## factor(bedrooms) 141.13 2 < 2.2e-16 ***
## factor(bathrooms) 3.84 2 0.146657
## factor(floors) 30.73 2 2.124e-07 ***
## sqft_living 813.68 1 < 2.2e-16 ***
## sqft_lot 8.86 1 0.002919 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can remove bathroom predictor from the model as it is not significant at 5% significance level.
Lets fit a reduced model by removing bathroom predictor,
reducedmodel4 <- glm(price~factor(bedrooms)+factor(floors)+sqft_living+sqft_lot,
family=Gamma(link="log"),kc_house)
summary.glm(reducedmodel4)
##
## Call:
## glm(formula = price ~ factor(bedrooms) + factor(floors) + sqft_living +
## sqft_lot, family = Gamma(link = "log"), data = kc_house)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.220e+01 3.618e-02 337.248 < 2e-16 ***
## factor(bedrooms)2 2.640e-02 3.590e-02 0.735 0.462232
## factor(bedrooms)3 -1.311e-01 3.636e-02 -3.605 0.000316 ***
## factor(floors)2 1.154e-01 2.196e-02 5.256 1.54e-07 ***
## factor(floors)3 2.150e-01 4.512e-02 4.765 1.95e-06 ***
## sqft_living 4.950e-04 1.414e-05 34.994 < 2e-16 ***
## sqft_lot -4.682e-07 1.601e-07 -2.924 0.003476 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.1528168)
##
## Null deviance: 932.6 on 4384 degrees of freedom
## Residual deviance: 640.9 on 4378 degrees of freedom
## AIC: 115678
##
## Number of Fisher Scoring iterations: 5
Note AIC score.
Using p-value:
anova(model4,reducedmodel4, test="Chi")
## Analysis of Deviance Table
##
## Model 1: price ~ factor(bedrooms) + factor(bathrooms) + factor(floors) +
## sqft_living + sqft_lot
## Model 2: price ~ factor(bedrooms) + factor(floors) + sqft_living + sqft_lot
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 4376 640.31
## 2 4378 640.90 -2 -0.58785 0.1467
p-value (0.14) > 0.05, hence \(Model4\) is better.
Using AIC scores: - Both \(reducedmodel4\) and \(model4\) have same AIC scores.
As per anova table’s p-value test, \(Model4\) is better.
The Adequacy of the logistic regression model is checked by testing the goodness of fit using the Deviance.
The hypothesis when using Deviance are,
\(H_0\): The model fits the data well. \(H_1\): The model does not fit the data well.
deviance(model4)
## [1] 640.3131
pchisq(model4$deviance, df=model4$df.residual, lower.tail=FALSE) # model4$df.residual = 4376
## [1] 1
The chi-square test statistic of 640.3131 with 4376 degree of freedom gives a p-value of 1, indicating that the null hypothesis is plausible. Hence, this GLM model with log transformed gamma distributed response is significant.
Lets summarize the 4 finalized models
Model 1 - Linear Regression Model:
Model 2 - Logistic Regression Model:
Model 3 - Poisson Regression Model:
Model 4 - GLM Regression Model with gamma distributed response:
From this summary, the Poisson regression model \(model3\) is insignificant and does not fit our House Sales data. So it can be rejected.
Coming to the remaining 3 models. Linear regression model \(Model1\) has a log-transformed response, logistic regression model \(reducedmodel2a\) has a Odds ratio response, and GLM model with gamma distributed response \(model4\) has a generalized response with gamma distribution. Hence, no statistical tests can be done to compare these models.
Fitting \(reducedmodel2a\) feels forceful as our response (price) was not in the ideal format of successes and failures. Also, recall we had to take mean of sqft_living predictor which is a crude way of creating data. For these reasons, a logistic model, although significant, does not feel the best choice of model.
Choosing between \(model1\) and \(model4\) is difficult, but with the context of House Sales data in mind, a GLM model with gamma distribution seems most apt as the response (price) had a right skewed distribution. Also, recall, Normality assumption of errors/residuals was violated by \(model1\).
Using multiple model and variable selections, 3 significant models finalized, each with different response types. A GLM regression model having log link for the gamma distributed response was chosen best fitting in context with the data in hand (response was right skewed). The model is significant based on the Deviance statistic with a p-value of 1. Although, number of bathrooms predictor is insignificant, the full model \(model1\) is better with this insignificant parameter than the reduced model as per significance test between these models (p-value = 0.1467 in favor of full model).
Kaggle (2016) House Sales in King County, USA. Predict house price using regression, Kaggle website, accessed 25th May 2023. https://www.kaggle.com/datasets/harlfoxem/housesalesprediction
GeoDa (2020) 2014-15 Home Sales in King County, WA, Geodacenter github website, accessed 2nd June 2023. https://geodacenter.github.io/data-and-lab//KingCounty-HouseSales2015/
GIS (2023) Zipcodes for King County and Surrounding Area (Shorelines) / zipcode shore area, GIS website, accessed 2nd June 2023. https://gis-kingcounty.opendata.arcgis.com/datasets/zipcodes-for-king-county-and-surrounding-area-shorelines-zipcode-shore-area/explore
CSDS (2020) The center for spatial data science. The university of Chicago, Univeristy of Chicago website, accessed 2nd June 2023. https://spatial.uchicago.edu/