HOMEWORK 4 Part I

DACSS 603, Spring 2022

Erin Tracy
4/5/2022

PART I

Question 1

package 'smss' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\erink\AppData\Local\Temp\RtmpUtOWzQ\downloaded_packages
library("smss")
data("house.selling.price.2")
install.packages('plyr', repos = "http://cran.us.r-project.org")
package 'plyr' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\erink\AppData\Local\Temp\RtmpUtOWzQ\downloaded_packages

For the house.selling.price.2 data the tables show a correlation matrix and a model fit using four predictors of selling price.

house<- house.selling.price.2
head(house)
      P    S Be Ba New
1  48.5 1.10  3  1   0
2  55.0 1.01  3  2   0
3  68.0 1.45  3  2   0
4 137.0 2.40  3  3   0
5 309.4 3.30  4  3   1
6  17.5 0.40  1  1   0
summary(house)
       P                S              Be              Ba       
 Min.   : 17.50   Min.   :0.40   Min.   :1.000   Min.   :1.000  
 1st Qu.: 72.90   1st Qu.:1.33   1st Qu.:3.000   1st Qu.:2.000  
 Median : 96.00   Median :1.57   Median :3.000   Median :2.000  
 Mean   : 99.53   Mean   :1.65   Mean   :3.183   Mean   :1.957  
 3rd Qu.:115.00   3rd Qu.:1.98   3rd Qu.:4.000   3rd Qu.:2.000  
 Max.   :309.40   Max.   :3.85   Max.   :5.000   Max.   :3.000  
      New        
 Min.   :0.0000  
 1st Qu.:0.0000  
 Median :0.0000  
 Mean   :0.3011  
 3rd Qu.:1.0000  
 Max.   :1.0000  
lm(P~ S +Be + Ba + New, data = house)

Call:
lm(formula = P ~ S + Be + Ba + New, data = house)

Coefficients:
(Intercept)            S           Be           Ba          New  
    -41.795       64.761       -2.766       19.203       18.984  
cor<- cor(x=house)
cor
            P         S        Be        Ba       New
P   1.0000000 0.8988136 0.5902675 0.7136960 0.3565540
S   0.8988136 1.0000000 0.6691137 0.6624828 0.1762879
Be  0.5902675 0.6691137 1.0000000 0.3337966 0.2672091
Ba  0.7136960 0.6624828 0.3337966 1.0000000 0.1820651
New 0.3565540 0.1762879 0.2672091 0.1820651 1.0000000
#correlate(house)

With these four predictors,

A. For backward elimination, which variable would be deleted first? Why?
lm(P~., data = house)|>summary()

Call:
lm(formula = P ~ ., data = house)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16

We would delete Bedrooms (Be) because it has the highest P Value.

B. For forward selection, which variable would be added first? Why?

S, size would be added first because that is the variable that is the most significant (has the smallest P Value)

C. Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?

I think that there are too many other factors that might affect the price of a house with just one bedroom or many bedrooms, this variable alone isn’t significant.

D. Using software with these four predictors, find the model that would be selected using each criterion:

  1. R2

  2. Adjusted R2

lm(P~., data = house)|>summary()

Call:
lm(formula = P ~ ., data = house)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16

lm() model includes R2 and Adjusted R-Squared

  1. PRESS
PRESS <- function(linear.model) {
    pr <- residuals(linear.model)/(1 - lm.influence(linear.model)$hat)
    sum(pr^2)
}

fit <- lm( P ~ Be + Ba + S + New , data = house)
PRESS(fit)
[1] 28390.22

Used the linear.model function to find PRESS (thank you Piazza.)

  1. AIC
AIC(lm(P~., data = house))
[1] 790.6225
  1. BIC
BIC(lm(P~., data = house))
[1] 805.8181

E. Explain which model you prefer and why. I prefer the multiple regression model, which seems to provide the most information.

Question 2

(Data file: trees from base R) From the documentation: “This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labelled Girth in the data. It is measured at 4 ft 6 in above the ground.”

Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,

A. fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables

data(trees)
summary(trees)
     Girth           Height       Volume     
 Min.   : 8.30   Min.   :63   Min.   :10.20  
 1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
 Median :12.90   Median :76   Median :24.20  
 Mean   :13.25   Mean   :76   Mean   :30.17  
 3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
 Max.   :20.60   Max.   :87   Max.   :77.00  
head(trees)
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7
fit<- lm(Volume~ Girth + Height, data = trees)
fit

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Coefficients:
(Intercept)        Girth       Height  
   -57.9877       4.7082       0.3393  

B. Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?

par(mfrow=c(2,2)) 
plot(fit)
par(mfrow=c(1,1))

Residuals vs Fitted: I don’t see a horizontal line with equally spread residuals. Assumptions are violated

Normal Q-Q: Appears that residuals are normally distributed.

Scale-Location: I do not see a horizontal line with equally spread points, suggesting Heteroscedasticity?

Question 3

(inspired by ALR 9.16)

(Data file: florida in alr R package)

In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.

The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan. Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

package 'alr4' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\erink\AppData\Local\Temp\RtmpUtOWzQ\downloaded_packages
library(alr4)

data(florida)

florida
               Gore   Bush Buchanan
ALACHUA       47300  34062      262
BAKER          2392   5610       73
BAY           18850  38637      248
BRADFORD       3072   5413       65
BREVARD       97318 115185      570
BROWARD      386518 177279      789
CALHOUN        2155   2873       90
CHARLOTTE     29641  35419      182
CITRUS        25501  29744      270
CLAY          14630  41745      186
COLLIER       29905  60426      122
COLUMBIA       7047  10964       89
DADE         328702 289456      561
DE SOTO        3322   4256       36
DIXIE          1825   2698       29
DUVAL        107680 152082      650
ESCAMBIA      40958  73029      504
FLAGLER       13891  12608       83
FRANKLIN       2042   2448       33
GADSDEN        9565   4750       39
GILCHRIST      1910   3300       29
GLADES         1420   1840        9
GULF           2389   3546       71
HAMILTON       1718   2153       24
HARDEE         2341   3764       30
HENDRY         3239   4743       22
HERNANDO      32644  30646      242
HIGHLANDS     14152  20196       99
HILLSBOROUGH 166581 176967      836
HOLMES         2154   4985       76
INDIAN RIVER  19769  28627      105
JACKSON        6868   9138      102
JEFFERSON      3038   2481       29
LAFAYETTE       788   1669       10
LAKE          36555  49963      289
LEE           73560 106141      305
LEON          61425  39053      282
LEVY           5403   6860       67
LIBERTY        1011   1316       39
MADISON        3011   3038       29
MANATEE       49169  57948      272
MARION        44648  55135      563
MARTIN        26619  33864      108
MONROE        16483  16059       47
NASSAU         6952  16404       90
OKALOOSA      16924  52043      267
OKEECHOBEE     4588   5058       43
ORANGE       140115 134476      446
OSCEOLA       28177  26216      145
PALM BEACH   268945 152846     3407
PASCO         69550  68581      570
PINELLAS     199660 184312     1010
POLK          74977  90101      538
PUTNAM        12091  13439      147
ST. JOHNS     19482  39497      229
ST. LUCIE     41559  34705      124
SANTA ROSA    12795  36248      311
SARASOTA      72854  83100      305
SEMINOLE      58888  75293      194
SUMTER         9634  12126      114
SUWANNEE       4084   8014      108
TAYLOR         2647   4051       27
UNION          1399   2326       26
VOLUSIA       97063  82214      396
WAKULLA        3835   4511       46
WALTON         5637  12176      120
WASHINGTON     2796   4983       88
head(florida)
           Gore   Bush Buchanan
ALACHUA   47300  34062      262
BAKER      2392   5610       73
BAY       18850  38637      248
BRADFORD   3072   5413       65
BREVARD   97318 115185      570
BROWARD  386518 177279      789
summary(florida)
      Gore             Bush           Buchanan     
 Min.   :   788   Min.   :  1316   Min.   :   9.0  
 1st Qu.:  3055   1st Qu.:  4746   1st Qu.:  46.5  
 Median : 14152   Median : 20196   Median : 114.0  
 Mean   : 43341   Mean   : 43356   Mean   : 258.5  
 3rd Qu.: 45974   3rd Qu.: 56542   3rd Qu.: 285.5  
 Max.   :386518   Max.   :289456   Max.   :3407.0  
?florida

lm(Buchanan~., data= florida)|>summary()

Call:
lm(formula = Buchanan ~ ., data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-913.70  -65.68  -36.96   21.41 2204.20 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) 82.3532929 52.1819372   1.578  0.11945   
Gore         0.0043013  0.0013309   3.232  0.00194 **
Bush        -0.0002379  0.0017476  -0.136  0.89216   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 330.7 on 64 degrees of freedom
Multiple R-squared:  0.4747,    Adjusted R-squared:  0.4582 
F-statistic: 28.91 on 2 and 64 DF,  p-value: 1.132e-09
installed.packages("ggplot2")
     Package LibPath Version Priority Depends Imports LinkingTo
     Suggests Enhances License License_is_FOSS License_restricts_use
     OS_type Archs MD5sum NeedsCompilation Built
library(ggplot2)

ggplot(data = florida) +
  geom_point(mapping = aes(x = Buchanan, y = Bush))

Yes, I can visually see from the plots that there is an outlier (above 3000 for Buchanan), and when I check that plot in the data, I can see that it is Palm Beach.

See separate File for PART II