Part One

Question 1

(SMSS 14.3, 14.4, merged & modified)

(Data file: house.selling.price.2 from smss R package)

For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price.

The data is imported from the SMSS package

hide

#load house selling price.2 data from smss

data(house.selling.price.2,package="smss")

summary(house.selling.price.2)         #summary of data

       P                S              Be              Ba       
 Min.   : 17.50   Min.   :0.40   Min.   :1.000   Min.   :1.000  
 1st Qu.: 72.90   1st Qu.:1.33   1st Qu.:3.000   1st Qu.:2.000  
 Median : 96.00   Median :1.57   Median :3.000   Median :2.000  
 Mean   : 99.53   Mean   :1.65   Mean   :3.183   Mean   :1.957  
 3rd Qu.:115.00   3rd Qu.:1.98   3rd Qu.:4.000   3rd Qu.:2.000  
 Max.   :309.40   Max.   :3.85   Max.   :5.000   Max.   :3.000  
      New        
 Min.   :0.0000  
 1st Qu.:0.0000  
 Median :0.0000  
 Mean   :0.3011  
 3rd Qu.:1.0000  
 Max.   :1.0000

hide

head(house.selling.price.2,10)         #output review

       P    S Be Ba New
1   48.5 1.10  3  1   0
2   55.0 1.01  3  2   0
3   68.0 1.45  3  2   0
4  137.0 2.40  3  3   0
5  309.4 3.30  4  3   1
6   17.5 0.40  1  1   0
7   19.6 1.28  3  1   0
8   24.5 0.74  3  1   0
9   34.8 0.78  2  1   0
10  32.0 0.97  3  1   0

Before we get started with the solutions I create a correlation matrix.

hide

house_price_2<-house.selling.price.2 #new object for data
cor_mat(house_price_2)               #correlation matrix

# A tibble: 5 x 6
  rowname     P     S    Be    Ba   New
* <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>
1 P        1     0.9   0.59  0.71  0.36
2 S        0.9   1     0.67  0.66  0.18
3 Be       0.59  0.67  1     0.33  0.27
4 Ba       0.71  0.66  0.33  1     0.18
5 New      0.36  0.18  0.27  0.18  1

Now we fit the linear regression model for house selling price 2. This will also provide a table of coefficients.

hide

fit_house_price_2<-lm(P~.,house_price_2)    #linear regression for all predictors

summary(fit_house_price_2)                  #summary linear model house price 2


Call:
lm(formula = P ~ ., data = house_price_2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16

Correlation matrix of housing data set

hide

rcorr(as.matrix(house_price_2,3))    #correlation matrix

       P    S   Be   Ba  New
P   1.00 0.90 0.59 0.71 0.36
S   0.90 1.00 0.67 0.66 0.18
Be  0.59 0.67 1.00 0.33 0.27
Ba  0.71 0.66 0.33 1.00 0.18
New 0.36 0.18 0.27 0.18 1.00

n= 93 


P
    P      S      Be     Ba     New   
P          0.0000 0.0000 0.0000 0.0005
S   0.0000        0.0000 0.0000 0.0910
Be  0.0000 0.0000        0.0011 0.0096
Ba  0.0000 0.0000 0.0011        0.0807
New 0.0005 0.0910 0.0096 0.0807

Solutions:

With these four predictors,

Part A

For backward elimination, which variable would be deleted first? Why?

Backward elimination starts with a full model, and successively drops variables that do not contribute to the model meaningfully. Therefore, since the variable Be/Beds has the largest p-value of 0.487 it would be eliminated first. I verify this by using the step() function.

hide

fit_Backward<-lm(P~.-Be,data = house.selling.price.2) #fitting lm without BE variable
summary(fit_Backward)


Call:
lm(formula = P ~ . - Be, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.804  -9.496   0.917   7.931  73.338 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -47.992      8.209  -5.847 8.15e-08 ***
S             62.263      4.335  14.363  < 2e-16 ***
Ba            20.072      5.495   3.653 0.000438 ***
New           18.371      3.761   4.885 4.54e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared:  0.8681,    Adjusted R-squared:  0.8637 
F-statistic: 195.3 on 3 and 89 DF,  p-value: < 2.2e-16

Analysis:

We see that every variable in the model has a p-value of < 0.05 and is therefore statistically significant. With the removal of the Be/Beds variable, it has increased the significance of the remaining variables, S, Ba and New.

For verification, I use the step() function from the stats package for backward stepwise prediction which will build a regression model that includes all of the predictor variables S, Ba and New that are statistically significant and related to the response variable.

hide

step(object = fit_Backward,direction = "backward",scope = 
       fit_Backward) #backward eliminate

Start:  AIC=523.21
P ~ (S + Be + Ba + New) - Be

       Df Sum of Sq   RSS    AIC
<none>              23684 523.21
- Ba    1      3550 27234 534.20
- New   1      6349 30033 543.30
- S     1     54898 78582 632.75


Call:
lm(formula = P ~ (S + Be + Ba + New) - Be, data = house.selling.price.2)

Coefficients:
(Intercept)            S           Ba          New  
     -47.99        62.26        20.07        18.37

Part B

For forward selection, which variable would be added first? Why?

For Forward Selection we start with a model and successively add variables, starting with the most statistically significant.In this case we would add the New variable which has a p-value of 4.3e-06.

hide

fit_Forward<-lm(P~1,data = house.selling.price.2)
summary(fit_Forward)                                 #output of summary variable


Call:
lm(formula = P ~ 1, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-82.033 -26.633  -3.533  15.467 209.867 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   99.533      4.582   21.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 44.18 on 92 degrees of freedom

Again for verification we use the step() function, however this time we use the forward selection.

hide

step_forward<-stepAIC(object=fit_Forward,direction = "forward",
                      scope = P~S+Be+Ba+New)     #scope specifies which predictors we want                                                   to attempt enter in to the model

Start:  AIC=705.63
P ~ 1

       Df Sum of Sq    RSS    AIC
+ S     1    145097  34508 554.22
+ Ba    1     91484  88121 641.41
+ Be    1     62578 117028 667.79
+ New   1     22833 156772 694.99
<none>              179606 705.63

Step:  AIC=554.22
P ~ S

       Df Sum of Sq   RSS    AIC
+ New   1    7274.7 27234 534.20
+ Ba    1    4475.6 30033 543.30
<none>              34508 554.22
+ Be    1      40.4 34468 556.11

Step:  AIC=534.2
P ~ S + New

       Df Sum of Sq   RSS    AIC
+ Ba    1    3550.1 23684 523.21
+ Be    1     588.8 26645 534.17
<none>              27234 534.20

Step:  AIC=523.21
P ~ S + New + Ba

       Df Sum of Sq   RSS    AIC
<none>              23684 523.21
+ Be    1    130.55 23553 524.70

hide

step_forward                                     #output of step_forward


Call:
lm(formula = P ~ S + New + Ba, data = house.selling.price.2)

Coefficients:
(Intercept)            S          New           Ba  
     -47.99        62.26        18.37        20.07

Notably we see that both backward and forward selections have the same coefficients.

Price = -47.99+62.26(Size)+18.37(New)+20.07(Ba)

Part C

Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?

The large p-value of BEDS indicates that it is statistically insignificant and therefore we fail to reject the null hypothesis \(H_{0}\). Furthermore,it appears BEDS is closely associated with the variable Size. This makes sense because typically the larger a house is, the more bedrooms it may have. Therefore, BEDS may be already closely related to Size. Also, of consideration would be the small sample size,n=93 which could contribute to the large p-value and substantial correlation.

Part D

Using software with these four predictors, find the model that would be selected using each criterion:

Solutions:

The following code shows both-direction step-wise selection using the a variation of the linear regression models used earlier.

hide

housing_price<-lm(P~1,data = house.selling.price.2)
all_housing_price_m1<-lm(P~.,data = house.selling.price.2)
stepwise_housing<-step(object=housing_price,direction = "both",        #both step wise                                                                              directions
                      scope = formula(all_housing_price_m1))  #specifies which predictors we                                                               want to enter into the model

Start:  AIC=705.63
P ~ 1

       Df Sum of Sq    RSS    AIC
+ S     1    145097  34508 554.22
+ Ba    1     91484  88121 641.41
+ Be    1     62578 117028 667.79
+ New   1     22833 156772 694.99
<none>              179606 705.63

Step:  AIC=554.22
P ~ S

       Df Sum of Sq    RSS    AIC
+ New   1      7275  27234 534.20
+ Ba    1      4476  30033 543.30
<none>               34508 554.22
+ Be    1        40  34468 556.11
- S     1    145097 179606 705.63

Step:  AIC=534.2
P ~ S + New

       Df Sum of Sq    RSS    AIC
+ Ba    1      3550  23684 523.21
+ Be    1       589  26645 534.17
<none>               27234 534.20
- New   1      7275  34508 554.22
- S     1    129539 156772 694.99

Step:  AIC=523.21
P ~ S + New + Ba

       Df Sum of Sq   RSS    AIC
<none>              23684 523.21
+ Be    1       131 23553 524.70
- Ba    1      3550 27234 534.20
- New   1      6349 30033 543.30
- S     1     54898 78582 632.75

hide

summary(stepwise_housing)


Call:
lm(formula = P ~ S + New + Ba, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.804  -9.496   0.917   7.931  73.338 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -47.992      8.209  -5.847 8.15e-08 ***
S             62.263      4.335  14.363  < 2e-16 ***
New           18.371      3.761   4.885 4.54e-06 ***
Ba            20.072      5.495   3.653 0.000438 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared:  0.8681,    Adjusted R-squared:  0.8637 
F-statistic: 195.3 on 3 and 89 DF,  p-value: < 2.2e-16

\(R^{2}\)

Using stepwise regression for the model the most useful variables can be determined. We start with the first model comprising all of the variables in the housing data. The Broom package is used to extract information from the object in a more concise format. Within the Broom package I use the glance function.

Model 1

hide

#model 1 - all variables
all_housing_price_m1<-lm(P~.,data = house.selling.price.2)
summary(all_housing_price_m1)     #view of model 1 output


Call:
lm(formula = P ~ ., data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16

hide

#model 1
glance(all_housing_price_m1)

# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl>
1     0.869         0.863  16.4      146. 5.94e-38     4  -389.  791.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
#   df.residual <int>, nobs <int>

For the Model_1 with all predictors our \(R^{2}\) is 0.8689

In Model_2 the Beds variable is removed since we know it is closely related to size and has a large p-value, making it statistically insignificant.

Model 2

hide

#model 2 without the Bed variable
housing_price_nobed_m2<-lm(P~.-Be,data = house.selling.price.2)
summary(housing_price_nobed_m2)                  #view of output data


Call:
lm(formula = P ~ . - Be, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.804  -9.496   0.917   7.931  73.338 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -47.992      8.209  -5.847 8.15e-08 ***
S             62.263      4.335  14.363  < 2e-16 ***
Ba            20.072      5.495   3.653 0.000438 ***
New           18.371      3.761   4.885 4.54e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared:  0.8681,    Adjusted R-squared:  0.8637 
F-statistic: 195.3 on 3 and 89 DF,  p-value: < 2.2e-16

hide

glance(housing_price_nobed_m2)

# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl>
1     0.868         0.864  16.3      195. 4.96e-39     3  -390.  789.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
#   df.residual <int>, nobs <int>

\(R^{2}\) for the Model_2 is 0.8681

Model 3

hide

#model 3

housing_price_nobed_noba_m3<-lm(P~.-Be,-Ba,data = house.selling.price.2)
summary(housing_price_nobed_noba_m3)   #output excluding Bath and Bed variables


Call:
lm(formula = P ~ . - Be, data = house.selling.price.2, subset = -Ba)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.787  -9.785   0.897   8.108  73.084 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -49.016      8.588  -5.707 1.60e-07 ***
S             61.863      4.473  13.832  < 2e-16 ***
Ba            20.996      5.760   3.645 0.000457 ***
New           18.197      3.821   4.763 7.68e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.5 on 86 degrees of freedom
Multiple R-squared:  0.8654,    Adjusted R-squared:  0.8607 
F-statistic: 184.3 on 3 and 86 DF,  p-value: < 2.2e-16

hide

#model 3 output 
glance(housing_price_nobed_noba_m3)

# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl>
1     0.865         0.861  16.5      184. 2.48e-37     3  -378.  766.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
#   df.residual <int>, nobs <int>

We see the \(R^{2}\) is 0.8654

Model 4

hide

housing_price_new_m4<-lm(P~S+New,data = house.selling.price.2)
summary(housing_price_new_m4)   #output model 4


Call:
lm(formula = P ~ S + New, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.207  -9.763  -0.091   9.984  76.405 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -26.089      5.977  -4.365 3.39e-05 ***
S             72.575      3.508  20.690  < 2e-16 ***
New           19.587      3.995   4.903 4.16e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.4 on 90 degrees of freedom
Multiple R-squared:  0.8484,    Adjusted R-squared:  0.845 
F-statistic: 251.8 on 2 and 90 DF,  p-value: < 2.2e-16

hide

#model 4
glance(housing_price_new_m4)

# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl>
1     0.848         0.845  17.4      252. 1.37e-37     2  -396.  800.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
#   df.residual <int>, nobs <int>

\(R^{2}\) for Model_4 is 0.8484

Interpretation:

Comparing all four models for \(R^{2}\) we see Model_1 with the largest \(R^{2}\) of 0.8689 is the best model in this instance.

Adjusted \(R^{2}\)

Comparing all four models for the adjusted \(R^{2}\) we see the Model_2 with the largest adjusted \(R^{2}\) of 0.8637 is the best model in this case.

PRESS

hide

PRESS <- function(all_housing_price_m1) {
    i <- residuals(all_housing_price_m1)/(1 - lm.influence(all_housing_price_m1)$hat)
    sum(i^2)
}

PRESS Models 1 through 4

hide

PRESS(all_housing_price_m1)        #model 1 - all variables

[1] 28390.22

hide

PRESS(housing_price_nobed_m2)      #model 2 - excludes variable Bed (includes size)

[1] 27860.05

hide

PRESS(housing_price_nobed_noba_m3) #model 3 - excludes variable Bed and Bath (includes size)

[1] 27690.82

hide

PRESS(housing_price_new_m4)        #model 4 - excludes every variable except New (inc.size)

[1] 31066

Interpretation:

Given several regression models, the one with the lowest PRESS should be selected as the one that will perform best. Therefore, Model_3 which excludes variables Bed and Bath with a PRESS of 27690.82 should be considered as well as Model_2 with a PRESS of 27860.05 which excludes the variable Bed should also be considered.

Interpretation:

AIC

Comparing all four models for the AIC and since the goal is finding the model that minimizes AIC we see that Model_3 works best at 765.8876 with excluding variables Bed and Bath. Since its unlikely a house will be built without bedrooms or bathrooms Model_2 excluding the variable Bed would be a viable option at 789.1366

hide

#AIC for all 4 models
AIC(all_housing_price_m1)           #model 1

[1] 790.6225

hide

AIC(housing_price_nobed_m2)         #model 2

[1] 789.1366

hide

AIC(housing_price_nobed_noba_m3)    #model 3

[1] 765.8876

hide

AIC(housing_price_new_m4)           #model 4

[1] 800.1262

BIC

hide

#BIC for all 4 models
BIC(all_housing_price_m1)           #model 1

[1] 805.8181

hide

BIC(housing_price_nobed_m2)         #model 2

[1] 801.7996

hide

BIC(housing_price_nobed_noba_m3)    #model 3

[1] 778.3866

hide

BIC(housing_price_new_m4)           #model 4

[1] 810.2566

Interpretation:

Comparing all four models as we did with AIC we again see that with BIC model_3 the lowest BIC excludes variables Bed and Bath while including variables Size and New at 778.3866 compared to Model_2 which includes variables Size,New and Bath,excluding Bed with a BIC of 801.7996.

Part E

Explain which model you prefer and why.

Of all the models analyzed my preference for best fit would be for Model_2 which excluded the Bed variable. It consistently conveyed strong predictive values with minimal AIC,BIC and PRESS. Although, Model_3 had lower AIC,BIC and PRESS values it had a lower adjusted \(R^{2}\) of 0.8607compared to Model_2 which had a larger adjusted \(R^{2}\) of 0.8637.

Question 2

(Data file: trees from base R)

From the documentation: “This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labelled Girth in the data. It is measured at 4 ft 6 in above the ground.”

Solutions:

Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,

Part A

Fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables

First we load the trees data.

hide

data("trees")
head(trees,10)              # view of first 10 rows

   Girth Height Volume
1    8.3     70   10.3
2    8.6     65   10.3
3    8.8     63   10.2
4   10.5     72   16.4
5   10.7     81   18.8
6   10.8     83   19.7
7   11.0     66   15.6
8   11.0     75   18.2
9   11.1     80   22.6
10  11.2     75   19.9

hide

summary(trees)              #view of output

     Girth           Height       Volume     
 Min.   : 8.30   Min.   :63   Min.   :10.20  
 1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
 Median :12.90   Median :76   Median :24.20  
 Mean   :13.25   Mean   :76   Mean   :30.17  
 3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
 Max.   :20.60   Max.   :87   Max.   :77.00

I rename the variable Girth to Diameter since in the instructions it says,that Girth is labelled erroneously. Let’s fix it!

hide

trees_d<-trees %>%          #renaming erroneously labelled variable Girth to Diameter
  rename(Diameter=Girth)

head(trees_d)               #verify new variable diameter

  Diameter Height Volume
1      8.3     70   10.3
2      8.6     65   10.3
3      8.8     63   10.2
4     10.5     72   16.4
5     10.7     81   18.8
6     10.8     83   19.7

We now fit the model with Volume as the outcome while Diameter and Height are the explanatory variables.

hide

#Fitting a multiple regression model with the Volume as outcome
Tree_Vol<-lm(Volume~Diameter+Height,data=trees_d)
summary(Tree_Vol)                              #view of output


Call:
lm(formula = Volume ~ Diameter + Height, data = trees_d)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Diameter      4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

Analysis:

We see that Diameter and Height have small p-values which are indicative of statistical significance. This makes sense in that they are predictors of tree volume. Also of note is that the model has near perfect linear correlation given that the \(R^{2}\) is close to 1.00 at 0.948. The adjusted \(R^{2}\) is also close to 1.00 at 0.9442.

Part B

Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?

A matrix of rows and columns is created to display regression diagnostic plots.

hide

par(mfrow = c(2,2))              #creates a matrix of rows and columns -

Diagnostic regression plots

hide

plot(Tree_Vol,pch=18,col="blue",which = 1:6)#view output of regression diagnostic                                                       plots.

Interpretation:

1.Residuals vs. Fitted

This plot can be used to determine if the residuals exhibit non-linear patterns. If the red line across the center of the plot is roughly horizontal then it can be can assumed that the residuals follow a linear pattern. For the trees data we see that the red line deviates from a perfect horizontal line but not severely.The residuals seemingly follow a roughly non-linear pattern and would therefore indicate a non-linear model is appropriate for this dataset,and thus the regression assumptions are violated.

2.Normal Q-Q

This plot is used to determine if the residuals of the regression model are normally distributed. If the points in this plot fall roughly along a straight diagonal line, we can then assume the residuals are normally distributed. For this data we see that the points fall roughly along the straight diagonal line. A few of the observations deviate a bit from the line near the top, but not enough to declare that the residuals are non-normally distributed.

3.Scale-Location

This plot checks the assumption of equal variance which is also called “homoscedasticity” among the residuals in our regression model. If the red line is roughly horizontal across the plot, then the assumption of equal variance is likely met. In this case we see that the red line isn’t horizontal across the plot, which may suggest that equal variance may be violated.

4. Cook’s Distance

The Cook’s Distance plot is an estimate of the influence of a data point. It takes into account both the leverage and residual of each observation. Cook’s Distance is a summary of how much a regression model changes when the ith observation is removed.In this case the Cook’s Distance for each of the n=31 cases are taken into account. We see the observations with the highest/most influential Cooks Distances are 31 at 0.605, 18 at 0.177 and 3 at .167.This is verified below using the Cook’s Distance function(). We therefore see there are several observations in the data set are highly influential.

Determining the influential observations using the Cook’s Distance() function

hide

cooks_D <- cooks.distance(Tree_Vol)
influential <- cooks_D[(cooks_D > (3 * mean(cooks_D, na.rm = TRUE)))]
influential

        3        18        31 
0.1673192 0.1775359 0.6052326

5. Residuals vs. Leverage

This plot is used to identify influential observations or high leverage. If any points in this plot fall outside of Cook’s distance (the dashed lines) then it is an influential observation. In this instance we see that there is influence as indicated by highly leveraged observations.

6.Cooks Distance vs Leverage

Again, this plot is indicative of highly influential data points as observed in the Cook’s Distance plot. We see in this plot as in the Cook’s Distance plot there are several highly influential data points, particularly data points 3,18 and especially 31.

Question 3

(inspired by ALR 9.16)

(Data file: florida in alr R package)

In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.

The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan. Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

Solutions:

First the Florida data set is imported

hide

data(florida)
head(florida,10)

            Gore   Bush Buchanan
ALACHUA    47300  34062      262
BAKER       2392   5610       73
BAY        18850  38637      248
BRADFORD    3072   5413       65
BREVARD    97318 115185      570
BROWARD   386518 177279      789
CALHOUN     2155   2873       90
CHARLOTTE  29641  35419      182
CITRUS     25501  29744      270
CLAY       14630  41745      186

Using the summary () and str() functions I take a look at the data set

hide

summary(florida)

      Gore             Bush           Buchanan     
 Min.   :   788   Min.   :  1316   Min.   :   9.0  
 1st Qu.:  3055   1st Qu.:  4746   1st Qu.:  46.5  
 Median : 14152   Median : 20196   Median : 114.0  
 Mean   : 43341   Mean   : 43356   Mean   : 258.5  
 3rd Qu.: 45974   3rd Qu.: 56542   3rd Qu.: 285.5  
 Max.   :386518   Max.   :289456   Max.   :3407.0

hide

str(florida)

'data.frame':   67 obs. of  3 variables:
 $ Gore    : int  47300 2392 18850 3072 97318 386518 2155 29641 25501 14630 ...
 $ Bush    : int  34062 5610 38637 5413 115185 177279 2873 35419 29744 41745 ...
 $ Buchanan: int  262 73 248 65 570 789 90 182 270 186 ...

I now fit the linear model of the florida data set

hide

florida_vote<-(lm(Buchanan~Bush,data = florida))
summary(florida_vote)                                 #view of output of florida data


Call:
lm(formula = Buchanan ~ Bush, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-907.50  -46.10  -29.19   12.26 2610.19 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.529e+01  5.448e+01   0.831    0.409    
Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3795 
F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08

Interpretation:

I first see the Bush votes are statistically significant with a p-value of 1.73e-08, I also notice the \(R^{2}\) and adjusted \(R^{2}\) are low which may be indicative of a less than ideal fit.

I now look at diagnostic regression plots of the data.

hide

par(mfrow = c(2,2))              #creates a matrix of rows and columns -

Diagnostic regression plots

hide

plot(florida_vote,pch=18,col="blue",which = 1:6)#view output of regression diagnostic                                                       plots.

Interpretation:

The Diagnostic Regression Plots are able to give us more visual insight into the Florida data. Notably we see that Palm Beach County strongly violates the regression assumption, Dade County appears to also violate the regression assumption although not as much as Palm Beach county.

Residuals vs. Fitted

We see the red line across the center of the plot is almost fully horizontal, since this is the case then it can be can assumed that most of the residuals follow a linear pattern with the exception of the Palm Beach County residual which is near the top of the plot. Thus, Palm Beach violates the regression assumption.

Normal QQ

Since most points in this plot fall roughly along a straight line,we can assume the residuals are normally distributed,however, with the exception of Palm Beach County, we see it deviates far from the line. We can therefore presume Palm Beach violates the normality assumption.

Scale-Location

This plot checks the assumption of equal variance which is also called “homoscedasticity” among the residuals in our regression model. If the red line is roughly horizontal across the plot, then the assumption of equal variance is likely met. In this case we see that the red line isn’t horizontal across the plot, and furthermore we see that the Palm Beach County residual is an outlier that violates equal variance.

Cooks Distance

Once again I use the function cooks.distance() to help determine the influential observations.

hide

#Cook's Distance function to calculate influential observations
cooks_Dis <- cooks.distance(florida_vote)
influential <- cooks_Dis[(cooks_Dis > (3 * mean(cooks_Dis, na.rm = TRUE)))]
influential

      DADE PALM BEACH 
  1.981366   2.231935

Once again we see that Palm Beach County and Dade County display the most influential observations. The Cook’s Distance plot supports this, with Palm Beach County displaying the most influence.

Residual vs. Leverage

This plot is used to identify influential observations or high leverage. If any points in this plot fall outside of Cook’s distance (the dashed lines) then it is an influential observation. We see this is the case with Palm Beach County which is significantly outside of the Cook’s distance.

Cook’s Distance vs. Leverage

We again see that Palm Beach County and Dade County are the most highly influential data points/observations as similarly observed in the Cook’s Distance Plot.

Homework Four

Part One

Question 1

Solutions:

Question 2

Solutions:

Question 3

Solutions: