DACSS 603
(SMSS 14.3, 14.4, merged & modified)
(Data file: house.selling.price.2 from smss R package)
For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price.
The data is imported from the SMSS package
P S Be Ba
Min. : 17.50 Min. :0.40 Min. :1.000 Min. :1.000
1st Qu.: 72.90 1st Qu.:1.33 1st Qu.:3.000 1st Qu.:2.000
Median : 96.00 Median :1.57 Median :3.000 Median :2.000
Mean : 99.53 Mean :1.65 Mean :3.183 Mean :1.957
3rd Qu.:115.00 3rd Qu.:1.98 3rd Qu.:4.000 3rd Qu.:2.000
Max. :309.40 Max. :3.85 Max. :5.000 Max. :3.000
New
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.3011
3rd Qu.:1.0000
Max. :1.0000
head(house.selling.price.2,10) #output review
P S Be Ba New
1 48.5 1.10 3 1 0
2 55.0 1.01 3 2 0
3 68.0 1.45 3 2 0
4 137.0 2.40 3 3 0
5 309.4 3.30 4 3 1
6 17.5 0.40 1 1 0
7 19.6 1.28 3 1 0
8 24.5 0.74 3 1 0
9 34.8 0.78 2 1 0
10 32.0 0.97 3 1 0
Before we get started with the solutions I create a correlation matrix.
house_price_2<-house.selling.price.2 #new object for data
cor_mat(house_price_2) #correlation matrix
# A tibble: 5 x 6
rowname P S Be Ba New
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 P 1 0.9 0.59 0.71 0.36
2 S 0.9 1 0.67 0.66 0.18
3 Be 0.59 0.67 1 0.33 0.27
4 Ba 0.71 0.66 0.33 1 0.18
5 New 0.36 0.18 0.27 0.18 1
Now we fit the linear regression model for house selling price 2. This will also provide a table of coefficients.
Call:
lm(formula = P ~ ., data = house_price_2)
Residuals:
Min 1Q Median 3Q Max
-36.212 -9.546 1.277 9.406 71.953
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.795 12.104 -3.453 0.000855 ***
S 64.761 5.630 11.504 < 2e-16 ***
Be -2.766 3.960 -0.698 0.486763
Ba 19.203 5.650 3.399 0.001019 **
New 18.984 3.873 4.902 4.3e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared: 0.8689, Adjusted R-squared: 0.8629
F-statistic: 145.8 on 4 and 88 DF, p-value: < 2.2e-16
Correlation matrix of housing data set
P S Be Ba New
P 1.00 0.90 0.59 0.71 0.36
S 0.90 1.00 0.67 0.66 0.18
Be 0.59 0.67 1.00 0.33 0.27
Ba 0.71 0.66 0.33 1.00 0.18
New 0.36 0.18 0.27 0.18 1.00
n= 93
P
P S Be Ba New
P 0.0000 0.0000 0.0000 0.0005
S 0.0000 0.0000 0.0000 0.0910
Be 0.0000 0.0000 0.0011 0.0096
Ba 0.0000 0.0000 0.0011 0.0807
New 0.0005 0.0910 0.0096 0.0807
With these four predictors,
Part A
For backward elimination, which variable would be deleted first? Why?
Backward elimination starts with a full model, and successively drops variables that do not contribute to the model meaningfully. Therefore, since the variable Be/Beds has the largest p-value of 0.487 it would be eliminated first. I verify this by using the step() function.
Call:
lm(formula = P ~ . - Be, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-34.804 -9.496 0.917 7.931 73.338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -47.992 8.209 -5.847 8.15e-08 ***
S 62.263 4.335 14.363 < 2e-16 ***
Ba 20.072 5.495 3.653 0.000438 ***
New 18.371 3.761 4.885 4.54e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared: 0.8681, Adjusted R-squared: 0.8637
F-statistic: 195.3 on 3 and 89 DF, p-value: < 2.2e-16
Analysis:
We see that every variable in the model has a p-value of < 0.05 and is therefore statistically significant. With the removal of the Be/Beds variable, it has increased the significance of the remaining variables, S, Ba and New.
For verification, I use the step() function from the stats package for backward stepwise prediction which will build a regression model that includes all of the predictor variables S, Ba and New that are statistically significant and related to the response variable.
step(object = fit_Backward,direction = "backward",scope =
fit_Backward) #backward eliminate
Start: AIC=523.21
P ~ (S + Be + Ba + New) - Be
Df Sum of Sq RSS AIC
<none> 23684 523.21
- Ba 1 3550 27234 534.20
- New 1 6349 30033 543.30
- S 1 54898 78582 632.75
Call:
lm(formula = P ~ (S + Be + Ba + New) - Be, data = house.selling.price.2)
Coefficients:
(Intercept) S Ba New
-47.99 62.26 20.07 18.37
Part B
For forward selection, which variable would be added first? Why?
For Forward Selection we start with a model and successively add variables, starting with the most statistically significant.In this case we would add the New variable which has a p-value of 4.3e-06.
Call:
lm(formula = P ~ 1, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-82.033 -26.633 -3.533 15.467 209.867
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 99.533 4.582 21.72 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 44.18 on 92 degrees of freedom
Again for verification we use the step() function, however this time we use the forward selection.
step_forward<-stepAIC(object=fit_Forward,direction = "forward",
scope = P~S+Be+Ba+New) #scope specifies which predictors we want to attempt enter in to the model
Start: AIC=705.63
P ~ 1
Df Sum of Sq RSS AIC
+ S 1 145097 34508 554.22
+ Ba 1 91484 88121 641.41
+ Be 1 62578 117028 667.79
+ New 1 22833 156772 694.99
<none> 179606 705.63
Step: AIC=554.22
P ~ S
Df Sum of Sq RSS AIC
+ New 1 7274.7 27234 534.20
+ Ba 1 4475.6 30033 543.30
<none> 34508 554.22
+ Be 1 40.4 34468 556.11
Step: AIC=534.2
P ~ S + New
Df Sum of Sq RSS AIC
+ Ba 1 3550.1 23684 523.21
+ Be 1 588.8 26645 534.17
<none> 27234 534.20
Step: AIC=523.21
P ~ S + New + Ba
Df Sum of Sq RSS AIC
<none> 23684 523.21
+ Be 1 130.55 23553 524.70
step_forward #output of step_forward
Call:
lm(formula = P ~ S + New + Ba, data = house.selling.price.2)
Coefficients:
(Intercept) S New Ba
-47.99 62.26 18.37 20.07
Notably we see that both backward and forward selections have the same coefficients.
Price = -47.99+62.26(Size)+18.37(New)+20.07(Ba)
Part C
Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?
The large p-value of BEDS indicates that it is statistically insignificant and therefore we fail to reject the null hypothesis \(H_{0}\). Furthermore,it appears BEDS is closely associated with the variable Size. This makes sense because typically the larger a house is, the more bedrooms it may have. Therefore, BEDS may be already closely related to Size. Also, of consideration would be the small sample size,n=93 which could contribute to the large p-value and substantial correlation.
Part D
Using software with these four predictors, find the model that would be selected using each criterion:
Solutions:
The following code shows both-direction step-wise selection using the a variation of the linear regression models used earlier.
housing_price<-lm(P~1,data = house.selling.price.2)
all_housing_price_m1<-lm(P~.,data = house.selling.price.2)
stepwise_housing<-step(object=housing_price,direction = "both", #both step wise directions
scope = formula(all_housing_price_m1)) #specifies which predictors we want to enter into the model
Start: AIC=705.63
P ~ 1
Df Sum of Sq RSS AIC
+ S 1 145097 34508 554.22
+ Ba 1 91484 88121 641.41
+ Be 1 62578 117028 667.79
+ New 1 22833 156772 694.99
<none> 179606 705.63
Step: AIC=554.22
P ~ S
Df Sum of Sq RSS AIC
+ New 1 7275 27234 534.20
+ Ba 1 4476 30033 543.30
<none> 34508 554.22
+ Be 1 40 34468 556.11
- S 1 145097 179606 705.63
Step: AIC=534.2
P ~ S + New
Df Sum of Sq RSS AIC
+ Ba 1 3550 23684 523.21
+ Be 1 589 26645 534.17
<none> 27234 534.20
- New 1 7275 34508 554.22
- S 1 129539 156772 694.99
Step: AIC=523.21
P ~ S + New + Ba
Df Sum of Sq RSS AIC
<none> 23684 523.21
+ Be 1 131 23553 524.70
- Ba 1 3550 27234 534.20
- New 1 6349 30033 543.30
- S 1 54898 78582 632.75
summary(stepwise_housing)
Call:
lm(formula = P ~ S + New + Ba, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-34.804 -9.496 0.917 7.931 73.338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -47.992 8.209 -5.847 8.15e-08 ***
S 62.263 4.335 14.363 < 2e-16 ***
New 18.371 3.761 4.885 4.54e-06 ***
Ba 20.072 5.495 3.653 0.000438 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared: 0.8681, Adjusted R-squared: 0.8637
F-statistic: 195.3 on 3 and 89 DF, p-value: < 2.2e-16
\(R^{2}\)
Using stepwise regression for the model the most useful variables can be determined. We start with the first model comprising all of the variables in the housing data. The Broom package is used to extract information from the object in a more concise format. Within the Broom package I use the glance function.
Model 1
Call:
lm(formula = P ~ ., data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-36.212 -9.546 1.277 9.406 71.953
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.795 12.104 -3.453 0.000855 ***
S 64.761 5.630 11.504 < 2e-16 ***
Be -2.766 3.960 -0.698 0.486763
Ba 19.203 5.650 3.399 0.001019 **
New 18.984 3.873 4.902 4.3e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared: 0.8689, Adjusted R-squared: 0.8629
F-statistic: 145.8 on 4 and 88 DF, p-value: < 2.2e-16
#model 1
glance(all_housing_price_m1)
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.869 0.863 16.4 146. 5.94e-38 4 -389. 791.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
For the Model_1 with all predictors our \(R^{2}\) is 0.8689
In Model_2 the Beds variable is removed since we know it is closely related to size and has a large p-value, making it statistically insignificant.
Model 2
Call:
lm(formula = P ~ . - Be, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-34.804 -9.496 0.917 7.931 73.338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -47.992 8.209 -5.847 8.15e-08 ***
S 62.263 4.335 14.363 < 2e-16 ***
Ba 20.072 5.495 3.653 0.000438 ***
New 18.371 3.761 4.885 4.54e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared: 0.8681, Adjusted R-squared: 0.8637
F-statistic: 195.3 on 3 and 89 DF, p-value: < 2.2e-16
glance(housing_price_nobed_m2)
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.868 0.864 16.3 195. 4.96e-39 3 -390. 789.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
\(R^{2}\) for the Model_2 is 0.8681
Model 3
Call:
lm(formula = P ~ . - Be, data = house.selling.price.2, subset = -Ba)
Residuals:
Min 1Q Median 3Q Max
-34.787 -9.785 0.897 8.108 73.084
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -49.016 8.588 -5.707 1.60e-07 ***
S 61.863 4.473 13.832 < 2e-16 ***
Ba 20.996 5.760 3.645 0.000457 ***
New 18.197 3.821 4.763 7.68e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.5 on 86 degrees of freedom
Multiple R-squared: 0.8654, Adjusted R-squared: 0.8607
F-statistic: 184.3 on 3 and 86 DF, p-value: < 2.2e-16
#model 3 output
glance(housing_price_nobed_noba_m3)
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.865 0.861 16.5 184. 2.48e-37 3 -378. 766.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
We see the \(R^{2}\) is 0.8654
Model 4
Call:
lm(formula = P ~ S + New, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-47.207 -9.763 -0.091 9.984 76.405
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -26.089 5.977 -4.365 3.39e-05 ***
S 72.575 3.508 20.690 < 2e-16 ***
New 19.587 3.995 4.903 4.16e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.4 on 90 degrees of freedom
Multiple R-squared: 0.8484, Adjusted R-squared: 0.845
F-statistic: 251.8 on 2 and 90 DF, p-value: < 2.2e-16
#model 4
glance(housing_price_new_m4)
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.848 0.845 17.4 252. 1.37e-37 2 -396. 800.
# ... with 4 more variables: BIC <dbl>, deviance <dbl>,
# df.residual <int>, nobs <int>
\(R^{2}\) for Model_4 is 0.8484
Interpretation:
Comparing all four models for \(R^{2}\) we see Model_1 with the largest \(R^{2}\) of 0.8689 is the best model in this instance.
Adjusted \(R^{2}\)
Comparing all four models for the adjusted \(R^{2}\) we see the Model_2 with the largest adjusted \(R^{2}\) of 0.8637 is the best model in this case.
PRESS
PRESS <- function(all_housing_price_m1) {
i <- residuals(all_housing_price_m1)/(1 - lm.influence(all_housing_price_m1)$hat)
sum(i^2)
}
PRESS Models 1 through 4
PRESS(all_housing_price_m1) #model 1 - all variables
[1] 28390.22
PRESS(housing_price_nobed_m2) #model 2 - excludes variable Bed (includes size)
[1] 27860.05
PRESS(housing_price_nobed_noba_m3) #model 3 - excludes variable Bed and Bath (includes size)
[1] 27690.82
PRESS(housing_price_new_m4) #model 4 - excludes every variable except New (inc.size)
[1] 31066
Interpretation:
Given several regression models, the one with the lowest PRESS should be selected as the one that will perform best. Therefore, Model_3 which excludes variables Bed and Bath with a PRESS of 27690.82 should be considered as well as Model_2 with a PRESS of 27860.05 which excludes the variable Bed should also be considered.
Interpretation:
AIC
Comparing all four models for the AIC and since the goal is finding the model that minimizes AIC we see that Model_3 works best at 765.8876 with excluding variables Bed and Bath. Since its unlikely a house will be built without bedrooms or bathrooms Model_2 excluding the variable Bed would be a viable option at 789.1366
#AIC for all 4 models
AIC(all_housing_price_m1) #model 1
[1] 790.6225
AIC(housing_price_nobed_m2) #model 2
[1] 789.1366
AIC(housing_price_nobed_noba_m3) #model 3
[1] 765.8876
AIC(housing_price_new_m4) #model 4
[1] 800.1262
BIC
#BIC for all 4 models
BIC(all_housing_price_m1) #model 1
[1] 805.8181
BIC(housing_price_nobed_m2) #model 2
[1] 801.7996
BIC(housing_price_nobed_noba_m3) #model 3
[1] 778.3866
BIC(housing_price_new_m4) #model 4
[1] 810.2566
Interpretation:
Comparing all four models as we did with AIC we again see that with BIC model_3 the lowest BIC excludes variables Bed and Bath while including variables Size and New at 778.3866 compared to Model_2 which includes variables Size,New and Bath,excluding Bed with a BIC of 801.7996.
Part E
Explain which model you prefer and why.
Of all the models analyzed my preference for best fit would be for Model_2 which excluded the Bed variable. It consistently conveyed strong predictive values with minimal AIC,BIC and PRESS. Although, Model_3 had lower AIC,BIC and PRESS values it had a lower adjusted \(R^{2}\) of 0.8607compared to Model_2 which had a larger adjusted \(R^{2}\) of 0.8637.
(Data file: trees from base R)
From the documentation: “This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labelled Girth in the data. It is measured at 4 ft 6 in above the ground.”
Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,
Part A
Fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables
First we load the trees data.
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
7 11.0 66 15.6
8 11.0 75 18.2
9 11.1 80 22.6
10 11.2 75 19.9
summary(trees) #view of output
Girth Height Volume
Min. : 8.30 Min. :63 Min. :10.20
1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
Median :12.90 Median :76 Median :24.20
Mean :13.25 Mean :76 Mean :30.17
3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
Max. :20.60 Max. :87 Max. :77.00
I rename the variable Girth to Diameter since in the instructions it says,that Girth is labelled erroneously. Let’s fix it!
Diameter Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
We now fit the model with Volume as the outcome while Diameter and Height are the explanatory variables.
Call:
lm(formula = Volume ~ Diameter + Height, data = trees_d)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
Diameter 4.7082 0.2643 17.816 < 2e-16 ***
Height 0.3393 0.1302 2.607 0.0145 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Analysis:
We see that Diameter and Height have small p-values which are indicative of statistical significance. This makes sense in that they are predictors of tree volume. Also of note is that the model has near perfect linear correlation given that the \(R^{2}\) is close to 1.00 at 0.948. The adjusted \(R^{2}\) is also close to 1.00 at 0.9442.
Part B
Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?
A matrix of rows and columns is created to display regression diagnostic plots.
Diagnostic regression plots
plot(Tree_Vol,pch=18,col="blue",which = 1:6)#view output of regression diagnostic plots.
Interpretation:
1.Residuals vs. Fitted
This plot can be used to determine if the residuals exhibit non-linear patterns. If the red line across the center of the plot is roughly horizontal then it can be can assumed that the residuals follow a linear pattern. For the trees data we see that the red line deviates from a perfect horizontal line but not severely.The residuals seemingly follow a roughly non-linear pattern and would therefore indicate a non-linear model is appropriate for this dataset,and thus the regression assumptions are violated.
2.Normal Q-Q
This plot is used to determine if the residuals of the regression model are normally distributed. If the points in this plot fall roughly along a straight diagonal line, we can then assume the residuals are normally distributed. For this data we see that the points fall roughly along the straight diagonal line. A few of the observations deviate a bit from the line near the top, but not enough to declare that the residuals are non-normally distributed.
3.Scale-Location
This plot checks the assumption of equal variance which is also called “homoscedasticity” among the residuals in our regression model. If the red line is roughly horizontal across the plot, then the assumption of equal variance is likely met. In this case we see that the red line isn’t horizontal across the plot, which may suggest that equal variance may be violated.
4. Cook’s Distance
The Cook’s Distance plot is an estimate of the influence of a data point. It takes into account both the leverage and residual of each observation. Cook’s Distance is a summary of how much a regression model changes when the ith observation is removed.In this case the Cook’s Distance for each of the n=31 cases are taken into account. We see the observations with the highest/most influential Cooks Distances are 31 at 0.605, 18 at 0.177 and 3 at .167.This is verified below using the Cook’s Distance function(). We therefore see there are several observations in the data set are highly influential.
Determining the influential observations using the Cook’s Distance() function
cooks_D <- cooks.distance(Tree_Vol)
influential <- cooks_D[(cooks_D > (3 * mean(cooks_D, na.rm = TRUE)))]
influential
3 18 31
0.1673192 0.1775359 0.6052326
5. Residuals vs. Leverage
This plot is used to identify influential observations or high leverage. If any points in this plot fall outside of Cook’s distance (the dashed lines) then it is an influential observation. In this instance we see that there is influence as indicated by highly leveraged observations.
6.Cooks Distance vs Leverage
Again, this plot is indicative of highly influential data points as observed in the Cook’s Distance plot. We see in this plot as in the Cook’s Distance plot there are several highly influential data points, particularly data points 3,18 and especially 31.
(inspired by ALR 9.16)
(Data file: florida in alr R package)
In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.
The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan. Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?
First the Florida data set is imported
Gore Bush Buchanan
ALACHUA 47300 34062 262
BAKER 2392 5610 73
BAY 18850 38637 248
BRADFORD 3072 5413 65
BREVARD 97318 115185 570
BROWARD 386518 177279 789
CALHOUN 2155 2873 90
CHARLOTTE 29641 35419 182
CITRUS 25501 29744 270
CLAY 14630 41745 186
Using the summary () and str() functions I take a look at the data set
summary(florida)
Gore Bush Buchanan
Min. : 788 Min. : 1316 Min. : 9.0
1st Qu.: 3055 1st Qu.: 4746 1st Qu.: 46.5
Median : 14152 Median : 20196 Median : 114.0
Mean : 43341 Mean : 43356 Mean : 258.5
3rd Qu.: 45974 3rd Qu.: 56542 3rd Qu.: 285.5
Max. :386518 Max. :289456 Max. :3407.0
str(florida)
'data.frame': 67 obs. of 3 variables:
$ Gore : int 47300 2392 18850 3072 97318 386518 2155 29641 25501 14630 ...
$ Bush : int 34062 5610 38637 5413 115185 177279 2873 35419 29744 41745 ...
$ Buchanan: int 262 73 248 65 570 789 90 182 270 186 ...
I now fit the linear model of the florida data set
Call:
lm(formula = Buchanan ~ Bush, data = florida)
Residuals:
Min 1Q Median 3Q Max
-907.50 -46.10 -29.19 12.26 2610.19
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.529e+01 5.448e+01 0.831 0.409
Bush 4.917e-03 7.644e-04 6.432 1.73e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared: 0.3889, Adjusted R-squared: 0.3795
F-statistic: 41.37 on 1 and 65 DF, p-value: 1.727e-08
Interpretation:
I first see the Bush votes are statistically significant with a p-value of 1.73e-08, I also notice the \(R^{2}\) and adjusted \(R^{2}\) are low which may be indicative of a less than ideal fit.
I now look at diagnostic regression plots of the data.
Diagnostic regression plots
plot(florida_vote,pch=18,col="blue",which = 1:6)#view output of regression diagnostic plots.
Interpretation:
The Diagnostic Regression Plots are able to give us more visual insight into the Florida data. Notably we see that Palm Beach County strongly violates the regression assumption, Dade County appears to also violate the regression assumption although not as much as Palm Beach county.
Residuals vs. Fitted
We see the red line across the center of the plot is almost fully horizontal, since this is the case then it can be can assumed that most of the residuals follow a linear pattern with the exception of the Palm Beach County residual which is near the top of the plot. Thus, Palm Beach violates the regression assumption.
Normal QQ
Since most points in this plot fall roughly along a straight line,we can assume the residuals are normally distributed,however, with the exception of Palm Beach County, we see it deviates far from the line. We can therefore presume Palm Beach violates the normality assumption.
Scale-Location
This plot checks the assumption of equal variance which is also called “homoscedasticity” among the residuals in our regression model. If the red line is roughly horizontal across the plot, then the assumption of equal variance is likely met. In this case we see that the red line isn’t horizontal across the plot, and furthermore we see that the Palm Beach County residual is an outlier that violates equal variance.
Cooks Distance
Once again I use the function cooks.distance() to help determine the influential observations.
#Cook's Distance function to calculate influential observations
cooks_Dis <- cooks.distance(florida_vote)
influential <- cooks_Dis[(cooks_Dis > (3 * mean(cooks_Dis, na.rm = TRUE)))]
influential
DADE PALM BEACH
1.981366 2.231935
Once again we see that Palm Beach County and Dade County display the most influential observations. The Cook’s Distance plot supports this, with Palm Beach County displaying the most influence.
Residual vs. Leverage
This plot is used to identify influential observations or high leverage. If any points in this plot fall outside of Cook’s distance (the dashed lines) then it is an influential observation. We see this is the case with Palm Beach County which is significantly outside of the Cook’s distance.
Cook’s Distance vs. Leverage
We again see that Palm Beach County and Dade County are the most highly influential data points/observations as similarly observed in the Cook’s Distance Plot.