# Loading packages and data
library(alr4)
## Loading required package: car
## Loading required package: carData
## Loading required package: effects
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(smss)
library(MPV)
## Loading required package: lattice
## Loading required package: KernSmooth
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
data("house.selling.price.2")
str(house.selling.price.2)
## 'data.frame': 93 obs. of 5 variables:
## $ P : num 48.5 55 68 137 309.4 ...
## $ S : num 1.1 1.01 1.45 2.4 3.3 0.4 1.28 0.74 0.78 0.97 ...
## $ Be : int 3 3 3 3 4 1 3 3 2 3 ...
## $ Ba : int 1 2 2 3 3 1 1 1 1 1 ...
## $ New: int 0 0 0 0 1 0 0 0 0 0 ...
The first variable I would delete would be Beds, since it has the largest P value of all the variables.
The first variable I would add for forward selection would be Size bevause it has the smallest P value of 0
Beds could have such a small P-value despite it's correlation with price due to the fact that there are only 93 observations in the sample. A small sample size can lead to misleading P values.
# R2 and Adjusted R2 models with bed
summary(lm(P ~ ., data = house.selling.price.2))
##
## Call:
## lm(formula = P ~ ., data = house.selling.price.2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.212 -9.546 1.277 9.406 71.953
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41.795 12.104 -3.453 0.000855 ***
## S 64.761 5.630 11.504 < 2e-16 ***
## Be -2.766 3.960 -0.698 0.486763
## Ba 19.203 5.650 3.399 0.001019 **
## New 18.984 3.873 4.902 4.3e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.36 on 88 degrees of freedom
## Multiple R-squared: 0.8689, Adjusted R-squared: 0.8629
## F-statistic: 145.8 on 4 and 88 DF, p-value: < 2.2e-16
# R2 and Adjusted R2 without beds
summary(lm(P ~ . -Be, data = house.selling.price.2))
##
## Call:
## lm(formula = P ~ . - Be, data = house.selling.price.2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.804 -9.496 0.917 7.931 73.338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -47.992 8.209 -5.847 8.15e-08 ***
## S 62.263 4.335 14.363 < 2e-16 ***
## Ba 20.072 5.495 3.653 0.000438 ***
## New 18.371 3.761 4.885 4.54e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.31 on 89 degrees of freedom
## Multiple R-squared: 0.8681, Adjusted R-squared: 0.8637
## F-statistic: 195.3 on 3 and 89 DF, p-value: < 2.2e-16
# PRESS model with beds
PRESS(lm(P ~ ., data = house.selling.price.2))
## [1] 28390.22
# PRESS model without beds
PRESS(lm(P ~ . -Be, data = house.selling.price.2))
## [1] 27860.05
# AIC model with beds
AIC(lm(P ~ ., data = house.selling.price.2))
## [1] 790.6225
# AIC model with out beds
AIC(lm(P ~ . -Be, data = house.selling.price.2))
## [1] 789.1366
# BIC model with beds
BIC(lm(P ~ ., data = house.selling.price.2))
## [1] 805.8181
# BIC model without beds
BIC(lm(P ~ . -Be, data = house.selling.price.2))
## [1] 801.7996
I would prefer to use the model that excludes bed to predict the selling price of homes based on these criterion. For all of the criterion excpect Adjusted R2, the models that excluded bed showed stronger predictive power than with bed.
# Loading data
data("trees")
str(trees)
## 'data.frame': 31 obs. of 3 variables:
## $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
## $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
## $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
# Multiple regression model with Volume as outcome and Girth and Heights as explanatory variables
m1 <- lm(formula = Volume ~ Girth + Height, data = trees)
summary(m1)
##
## Call:
## lm(formula = Volume ~ Girth + Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4065 -2.6493 -0.2876 2.2003 8.4847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
## Girth 4.7082 0.2643 17.816 < 2e-16 ***
## Height 0.3393 0.1302 2.607 0.0145 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
## F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
# Regression diagnostic plots
plot(m1)
Based on the Residuals vs Fitted plot, the asumption of linearity appears to be violeted since the line is curved in a postive quadratic way. Furthermore, in the Scale-Location plot, the line is not horizontal, suggesting a violation in the assumption of constant variance.
# Loading data
data("florida")
str(florida)
## 'data.frame': 67 obs. of 3 variables:
## $ Gore : int 47300 2392 18850 3072 97318 386518 2155 29641 25501 14630 ...
## $ Bush : int 34062 5610 38637 5413 115185 177279 2873 35419 29744 41745 ...
## $ Buchanan: int 262 73 248 65 570 789 90 182 270 186 ...
# Linear regression model where Buchanan is the outcome variabel and Bush is the explanatory variable
m2 <-lm(Buchanan ~ Bush, data = florida)
summary(m2)
##
## Call:
## lm(formula = Buchanan ~ Bush, data = florida)
##
## Residuals:
## Min 1Q Median 3Q Max
## -907.50 -46.10 -29.19 12.26 2610.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.529e+01 5.448e+01 0.831 0.409
## Bush 4.917e-03 7.644e-04 6.432 1.73e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 353.9 on 65 degrees of freedom
## Multiple R-squared: 0.3889, Adjusted R-squared: 0.3795
## F-statistic: 41.37 on 1 and 65 DF, p-value: 1.727e-08
# Diagnostic plots
plot(m2)
In each of the diagnostic plots, Palm Beach is an extreme outlier well above the line of best fit.
# Log of linear regression model
m3 <- lm(log(Buchanan) ~ log(Bush), data = florida)
summary(m3)
##
## Call:
## lm(formula = log(Buchanan) ~ log(Bush), data = florida)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.96075 -0.25949 0.01282 0.23826 1.66564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.57712 0.38919 -6.622 8.04e-09 ***
## log(Bush) 0.75772 0.03936 19.251 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4673 on 65 degrees of freedom
## Multiple R-squared: 0.8508, Adjusted R-squared: 0.8485
## F-statistic: 370.6 on 1 and 65 DF, p-value: < 2.2e-16
# Diagnostic Plots of log
plot(m3)
By logging the linear model, Palm Beach remains an outlier, but not to a degree as extreme as before.