Question 1

# Loading packages and data
library(alr4)
## Loading required package: car
## Loading required package: carData
## Loading required package: effects
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(smss)
library(MPV)
## Loading required package: lattice
## Loading required package: KernSmooth
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
data("house.selling.price.2")
str(house.selling.price.2)
## 'data.frame':    93 obs. of  5 variables:
##  $ P  : num  48.5 55 68 137 309.4 ...
##  $ S  : num  1.1 1.01 1.45 2.4 3.3 0.4 1.28 0.74 0.78 0.97 ...
##  $ Be : int  3 3 3 3 4 1 3 3 2 3 ...
##  $ Ba : int  1 2 2 3 3 1 1 1 1 1 ...
##  $ New: int  0 0 0 0 1 0 0 0 0 0 ...

A

The first variable I would delete would be Beds, since it has the largest P value of all the variables.

B

The first variable I would add for forward selection would be Size bevause it has the smallest P value of 0

C

Beds could have such a small P-value despite it's correlation with price due to the fact that there are only 93 observations in the sample. A small sample size can lead to misleading P values.

D

R2 & Adjusted R2

# R2 and Adjusted R2 models with bed
summary(lm(P ~ ., data = house.selling.price.2))
## 
## Call:
## lm(formula = P ~ ., data = house.selling.price.2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.212  -9.546   1.277   9.406  71.953 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -41.795     12.104  -3.453 0.000855 ***
## S             64.761      5.630  11.504  < 2e-16 ***
## Be            -2.766      3.960  -0.698 0.486763    
## Ba            19.203      5.650   3.399 0.001019 ** 
## New           18.984      3.873   4.902  4.3e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.36 on 88 degrees of freedom
## Multiple R-squared:  0.8689, Adjusted R-squared:  0.8629 
## F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16
# R2 and Adjusted R2 without beds 
summary(lm(P ~ . -Be, data = house.selling.price.2))
## 
## Call:
## lm(formula = P ~ . - Be, data = house.selling.price.2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.804  -9.496   0.917   7.931  73.338 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -47.992      8.209  -5.847 8.15e-08 ***
## S             62.263      4.335  14.363  < 2e-16 ***
## Ba            20.072      5.495   3.653 0.000438 ***
## New           18.371      3.761   4.885 4.54e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.31 on 89 degrees of freedom
## Multiple R-squared:  0.8681, Adjusted R-squared:  0.8637 
## F-statistic: 195.3 on 3 and 89 DF,  p-value: < 2.2e-16

PRESS

# PRESS model with beds
PRESS(lm(P ~ ., data = house.selling.price.2))
## [1] 28390.22
# PRESS model without beds 
PRESS(lm(P ~ . -Be, data = house.selling.price.2))
## [1] 27860.05

AIC

# AIC model with beds
AIC(lm(P ~ ., data = house.selling.price.2))
## [1] 790.6225
# AIC model with out beds
AIC(lm(P ~ . -Be, data = house.selling.price.2))
## [1] 789.1366

BIC

# BIC model with beds
BIC(lm(P ~ ., data = house.selling.price.2))
## [1] 805.8181
# BIC model without beds
BIC(lm(P ~ . -Be, data = house.selling.price.2))
## [1] 801.7996

E

I would prefer to use the model that excludes bed to predict the selling price of homes based on these criterion. For all of the criterion excpect Adjusted R2, the models that excluded bed showed stronger predictive power than with bed.

Question 2

# Loading data 
data("trees")
str(trees)
## 'data.frame':    31 obs. of  3 variables:
##  $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
##  $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
##  $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

A

# Multiple regression model with Volume as outcome and Girth and Heights as explanatory variables 
m1 <- lm(formula = Volume ~ Girth + Height, data = trees)
summary(m1)
## 
## Call:
## lm(formula = Volume ~ Girth + Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4065 -2.6493 -0.2876  2.2003  8.4847 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
## Girth         4.7082     0.2643  17.816  < 2e-16 ***
## Height        0.3393     0.1302   2.607   0.0145 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared:  0.948,  Adjusted R-squared:  0.9442 
## F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

B

# Regression diagnostic plots 
plot(m1)

Based on the Residuals vs Fitted plot, the asumption of linearity appears to be violeted since the line is curved in a postive quadratic way. Furthermore, in the Scale-Location plot, the line is not horizontal, suggesting a violation in the assumption of constant variance.

Question 3

# Loading data 
data("florida")
str(florida)
## 'data.frame':    67 obs. of  3 variables:
##  $ Gore    : int  47300 2392 18850 3072 97318 386518 2155 29641 25501 14630 ...
##  $ Bush    : int  34062 5610 38637 5413 115185 177279 2873 35419 29744 41745 ...
##  $ Buchanan: int  262 73 248 65 570 789 90 182 270 186 ...

A

# Linear regression model where Buchanan is the outcome variabel and Bush is the explanatory variable
m2 <-lm(Buchanan ~ Bush, data = florida)
summary(m2)
## 
## Call:
## lm(formula = Buchanan ~ Bush, data = florida)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -907.50  -46.10  -29.19   12.26 2610.19 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.529e+01  5.448e+01   0.831    0.409    
## Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 353.9 on 65 degrees of freedom
## Multiple R-squared:  0.3889, Adjusted R-squared:  0.3795 
## F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08
# Diagnostic plots 
plot(m2)

In each of the diagnostic plots, Palm Beach is an extreme outlier well above the line of best fit.

B

# Log of linear regression model 
m3 <- lm(log(Buchanan) ~ log(Bush), data = florida)
summary(m3)
## 
## Call:
## lm(formula = log(Buchanan) ~ log(Bush), data = florida)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96075 -0.25949  0.01282  0.23826  1.66564 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.57712    0.38919  -6.622 8.04e-09 ***
## log(Bush)    0.75772    0.03936  19.251  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4673 on 65 degrees of freedom
## Multiple R-squared:  0.8508, Adjusted R-squared:  0.8485 
## F-statistic: 370.6 on 1 and 65 DF,  p-value: < 2.2e-16
# Diagnostic Plots of log 
plot(m3)

By logging the linear model, Palm Beach remains an outlier, but not to a degree as extreme as before.