These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal, and are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station Data Policy.

Unit 3: Penguins Revisited

pendata<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/penguins.csv")
head(pendata)
names(pendata)
[1] "Observation"       "body_mass_g"       "species"          
[4] "island"            "bill_length_mm"    "bill_depth_mm"    
[7] "flipper_length_mm" "sex"               "year"             

Hypothesized Linear Model:

\(E(bodymass)=\beta_0+\beta_1BillDep+\beta_2BillLen+\beta_3FlipperLen+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDep*BillLen\)

Check the model assumptions.

penmodfinal<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)

#There are a few options to view plots for assumptions

#Residuals Plots of explanatory variables vs residuals
residualPlots(penmodfinal,tests=F)

#Residual vs Fitted and QQ plot
plot(penmodfinal, which=c(1,2))

#histogram of residuals
hist(residuals(penmodfinal))

  • Lack of Fit Plot: Residual Plots
  • Constant Variance: Residual Plots and Residual vs Fitted
  • Normality: QQ Plot and Histogram of Residuals
  • Independence: We do not have time series data
  1. Is the mean zero assumption violated? How do you know?

There were no clear trends that could be modeled in the residual plots. The assumption was not violated.

  1. Is the constant variance assumption violated? How do you know?

There were no clear fanning patterns in the residual plots nor the residual vs predicted plots. The assumption was not violated.

  1. Is the normality assumptions violated? How do you know?

The observations fall linearly on the QQ plot. (The residuals are unimodal and symmetric in the histogram). The assumption is not violated.

  1. Is the independence assumption violated? How do you know?

We do not believe the information of one penguin depends on another penguin. The assumption is not violated.

  1. If you noted any violations, attempt to correct them.

There were no severe violations.

Although no transformations were needed, here is an example of how we would interpret and code for a few different variable transformations.

#attempt transformations within plot to create a more linear relationship

# ORIGINAL
plot((body_mass_g)~flipper_length_mm,data=pendata)

#You can transform both x and y in any function
# ATTEMPT combinations of log and square root of x and y
plot(sqrt(body_mass_g)~flipper_length_mm,data=pendata)

plot(log(body_mass_g)~flipper_length_mm,data=pendata)

plot(log(body_mass_g)~log(flipper_length_mm),data=pendata)

# compute correlations
cor(pendata$body_mass_g,pendata$flipper_length_mm)
[1] NA
# note the NA because of missing values
# add the use= argument
cor(pendata$body_mass_g,pendata$flipper_length_mm, use="complete.obs")
[1] 0.8712018
cor(log(pendata$body_mass_g),log(pendata$flipper_length_mm), use="complete.obs")
[1] 0.8588604
#Within lm() we can add both transformations
# try several
penmodT<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)
summary(penmodT)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + species + sex, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-706.90 -175.12   -6.14  176.80  880.08 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -6290.446   1832.633  -3.432 0.000675 ***
bill_depth_mm                  333.924     98.231   3.399 0.000760 ***
bill_length_mm                 130.274     41.058   3.173 0.001653 ** 
flipper_length_mm               16.535      2.888   5.725 2.35e-08 ***
speciesChinstrap              -200.763     82.322  -2.439 0.015273 *  
speciesGentoo                  923.788    132.379   6.978 1.68e-11 ***
sexmale                        391.603     47.370   8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm    -6.344      2.290  -2.770 0.005919 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 284.4 on 325 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8778,    Adjusted R-squared:  0.8752 
F-statistic: 333.7 on 7 and 325 DF,  p-value: < 2.2e-16

NOTE:

  1. When we transform the explanatory variables, the interpretations do not change drastically
  • EX: for a one unit increase in the transformed value of X (square root of flipper length), the predicted value of y changes by beta hat_i
  1. Similarly, the transformation for y follow:
  • EX: for a one unit increase in the value of x, the predicted value of the transformed value of y (square root of y) changes by beta hat_i
  • Prediction Interval: When we have this set of explanatory variables we expect the square root of y to be between the interval computed.
  • Prediction Interval Transformed: When we have this set of explanatory variables we expect the value y to be between the interval computed if we inverse the transformation of the end points
  • Confidence Interval: When we have this set of explanatory variables we expect the average of the transformed value of y to be between the interval computed.
  • Confidence Interval: we cannot perform the inverse of the transformation to get the original units.

Identify outliers and influential observations.

How to find n when there are missing values?

#Determine n
summary(penmodfinal)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + species + sex, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-706.90 -175.12   -6.14  176.80  880.08 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -6290.446   1832.633  -3.432 0.000675 ***
bill_depth_mm                  333.924     98.231   3.399 0.000760 ***
bill_length_mm                 130.274     41.058   3.173 0.001653 ** 
flipper_length_mm               16.535      2.888   5.725 2.35e-08 ***
speciesChinstrap              -200.763     82.322  -2.439 0.015273 *  
speciesGentoo                  923.788    132.379   6.978 1.68e-11 ***
sexmale                        391.603     47.370   8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm    -6.344      2.290  -2.770 0.005919 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 284.4 on 325 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8778,    Adjusted R-squared:  0.8752 
F-statistic: 333.7 on 7 and 325 DF,  p-value: < 2.2e-16
#numDF + (k+1) = n-k-1+(k+1) = n
n<-df.residual(penmodfinal)+length(coef(penmodfinal))
n
[1] 333
#Leverage Cut off
levcut<-(2*(length(coef(penmodfinal))))/n
levcut
[1] 0.04804805
#Cook's Distance Cutoff
cdcut<-4/n
cdcut
[1] 0.01201201

n will be how many observations were used to fit the model, not necessarily how many observations were in the original data table. To determine this number we can go to the model F degrees of freedom k, n-k-1. In the pen model we have 7, 325. So k=7 and n=333.

#Cooks Distance Thresholds
plot(penmodfinal,which=4)

#Leverage vs Studentized Residuals
influencePlot(penmodfinal,fill=F)
#Deleted Studentized Residals
ols_plot_resid_stud_fit(penmodfinal)

Based on the labeled observations from the plots:

  • With respect to Cook’s Distance, observations 293, 314, 315 are considered influential because their values exceed the threshold of 4/n=4/333=0.012.

  • With respect to Leverage, observations 314, 15, 295 (and others) are considered “leverage”=outliers in x because their values exceed the threshold of 2(7+1)/333=0.048.

  • With respect to Studentizied Residuals, observations 170, 40, 315, 314 (and others) are considered outliers in y because their values exceed the threshold of +/-2 .

  • With respect to Deleted Studentizied Residuals, observations that are labelled red (164, 35, 304, etc) are considered influential because their values exceed the threshold of +/-2.

However, how can we determine all of the observations that exceed our thresholds? We can use the which() function to print values that meet certain criteria from a logical expression.

#General expressions about the variables in the data table

#Quantitative Variables
which(pendata$bill_depth_mm<17.3)
  [1]  11  25  31  35  45  59  61  63  65  67  69  73  77  79  81  93  99 103
 [19] 107 109 111 117 119 121 123 125 129 139 141 142 143 144 145 151 153 154
 [37] 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172
 [55] 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 191
 [73] 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209
 [91] 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227
[109] 228 229 230 231 232 233 234 235 236 237 239 240 241 242 243 244 245 246
[127] 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264
[145] 265 266 267 268 269 270 271 273 274 275 276 291 299 307 309 312 315 320
[163] 327 333 338 339
# Qualitative Variables
which(pendata$species=="Gentoo")
  [1] 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170
 [19] 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
 [37] 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206
 [55] 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
 [73] 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242
 [91] 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
[109] 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276
#Multiple arguments
which(pendata$species=="Gentoo"&pendata$bill_depth_mm<17.3)
  [1] 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170
 [19] 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
 [37] 189 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
 [55] 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225
 [73] 226 227 228 229 230 231 232 233 234 235 236 237 239 240 241 242 243 244
 [91] 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262
[109] 263 264 265 266 267 268 269 270 271 273 274 275 276
which(pendata$species=="Gentoo"|pendata$bill_depth_mm<17.3)
  [1]  11  25  31  35  45  59  61  63  65  67  69  73  77  79  81  93  99 103
 [19] 107 109 111 117 119 121 123 125 129 139 141 142 143 144 145 151 153 154
 [37] 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172
 [55] 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190
 [73] 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
 [91] 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226
[109] 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244
[127] 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262
[145] 263 264 265 266 267 268 269 270 271 272 273 274 275 276 291 299 307 309
[163] 312 315 320 327 333 338 339
# Using our influential diagnostics
which(hatvalues(penmodfinal)>16/333)
 15  20 115 130 143 186 190 240 288 294 308 314 340 
 10  15 109 124 137 179 183 232 277 283 297 303 329 
which(cooks.distance(penmodfinal)>4/333)
 40  82 112 118 122 166 170 193 232 285 293 294 297 298 314 315 325 
 35  76 106 112 116 160 164 186 224 274 282 283 286 287 303 304 314 
which(abs(rstudent(penmodfinal))>2)
  8  40  46  82  86  94 105 110 120 122 132 166 170 196 232 285 293 314 315 325 
  7  35  41  76  80  88  99 104 114 116 126 160 164 189 224 274 282 303 304 314 
which(abs(stdres(penmodfinal))>2)
  8  40  46  82  86  94 105 110 120 122 132 166 170 196 232 285 293 314 315 325 
  7  35  41  76  80  88  99 104 114 116 126 160 164 189 224 274 282 303 304 314 
# The top row is the original observation number, the second number is the index (position)

#Multiple arguments
which(hatvalues(penmodfinal)>16/333&cooks.distance(penmodfinal)>4/333&abs(rstudent(penmodfinal))>2&abs(stdres(penmodfinal))>2)
314 
303 
  • Observations 15, 20, 115, 130, 143, 186, 190, 240, 288, 294, 308, 314, 340 exceed the threshold for leverage.

  • Observations 40, 82, 112, 118, 122, 166, 170, 193, 232, 285, 293, 294, 297, 298, 314, 315, 325 exceed the threshold for Cook’s Distance.

  • Observations 8, 40, 46, 82, 86, 94, 105, 110, 120, 122, 132, 166, 170, 196, 232, 285, 293, 314, 315, 325 exceed the threshold for Deleted Studentized Residuals.

  • Observations 8, 40, 46, 82, 86, 94, 105, 110, 120, 122, 132, 166, 170, 196, 232, 285, 293, 314, 315, 325 exceed the threshold for Studentized Residuals.

  • Observation 314 exceeds in all 4 categories.

Next steps would be to evaluate closer with dffits or dfbetas and see how those observations compare to the other observations in the data. Then look at the EDA. Determine if the observations are very influential and/or truly representative of the data. We should NOT just remove without further exploration.

#Explore DFFits and DFBetas for the observations identified

dffits(penmodfinal)[c(15,40,170,294,314,315)]
          20           45          176          305          325          326 
-0.006256213 -0.111866703 -0.100099623  0.147376774 -0.400590799  0.061502594 
#Note those observation numbers do not match because there are missing values
# in the penguins data frame
dim(pendata)
[1] 344   9
length(dffits(penmodfinal))
[1] 333

Notice when we attempt to report the DFFits for observations 15,40,170,294,314,315 we see labels for 20, 45, 176, 305, 325, 326. This is because this square bracket notation for subsetting will call the position number (index) in the list of dffits, not the observation number from the data table.

Why are those things different? There are missing values in the penguin data table. So observation 20 ends up being position number 15 in the list of DFFits. There were 344 values in the data table, but only 333 to fit the model.

How can we reconcile?

  • call for the observation numbers explicitly
  • remove NAs and relabel the rows
#First we will store the DFFits to make them easier to work with
pendffits<-dffits(penmodfinal)

# Calling the observation numbers (row names)
# don't get too worried on the syntax
pendffits[c("15","40","170","294","314","315")]
        15         40        170        294        314        315 
 0.1286874  0.3818284  0.3833745  0.3424947  0.5436063 -0.4620501 
# Now we can compare these values to the rest
summary(dffits(penmodfinal))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.462050 -0.093274 -0.003084  0.001633  0.101550  0.543606 
# Slightly different syntax for DFBetas
# Because DFBetas is a matrix. not just a vector
pendfbeta<-dfbetas(penmodfinal)
pendfbeta[c("15","40","170","294","314","315"),]
    (Intercept) bill_depth_mm bill_length_mm flipper_length_mm speciesChinstrap
15  -0.08557838    0.09310769     0.06910227        0.02660027     0.0655398436
40   0.02653293    0.05209415     0.03999337       -0.25112153    -0.0043988499
170 -0.12168270    0.11147375     0.13543939        0.04868884     0.0246383890
294 -0.01852371    0.02920232     0.08105004       -0.17108150    -0.1584788882
314  0.26622725   -0.32579551    -0.36372431        0.14860592    -0.0004064124
315  0.21521173   -0.21042438    -0.25248533       -0.04511119    -0.1638646983
    speciesGentoo      sexmale bill_depth_mm:bill_length_mm
15     0.02492217  0.009146321                  -0.08271890
40     0.12938768  0.170995071                  -0.04103063
170   -0.05802485  0.169876311                  -0.13743246
294   -0.04443237 -0.119146477                  -0.03129413
314    0.07753382 -0.100034429                   0.36711794
315    0.18047644 -0.001915482                   0.25222563
summary(dfbetas(penmodfinal))
  (Intercept)         bill_depth_mm        bill_length_mm      
 Min.   :-2.241e-01   Min.   :-0.3257955   Min.   :-3.637e-01  
 1st Qu.:-1.834e-02   1st Qu.:-0.0121734   1st Qu.:-9.957e-03  
 Median :-3.307e-04   Median : 0.0009279   Median : 2.128e-04  
 Mean   : 7.712e-05   Mean   :-0.0000870   Mean   :-7.541e-05  
 3rd Qu.: 1.552e-02   3rd Qu.: 0.0194839   3rd Qu.: 1.921e-02  
 Max.   : 2.662e-01   Max.   : 0.2017456   Max.   : 2.407e-01  
 flipper_length_mm    speciesChinstrap     speciesGentoo       
 Min.   :-2.511e-01   Min.   :-2.327e-01   Min.   :-3.006e-01  
 1st Qu.:-1.948e-02   1st Qu.:-1.623e-02   1st Qu.:-1.962e-02  
 Median : 1.204e-04   Median :-7.051e-05   Median : 6.127e-04  
 Mean   :-3.359e-05   Mean   :-9.018e-05   Mean   : 8.558e-06  
 3rd Qu.: 2.059e-02   3rd Qu.: 1.830e-02   3rd Qu.: 1.995e-02  
 Max.   : 2.652e-01   Max.   : 2.515e-01   Max.   : 2.159e-01  
    sexmale           bill_depth_mm:bill_length_mm
 Min.   :-2.001e-01   Min.   :-2.387e-01          
 1st Qu.:-2.291e-02   1st Qu.:-1.882e-02          
 Median :-1.235e-03   Median :-4.818e-04          
 Mean   :-9.972e-05   Mean   : 9.593e-05          
 3rd Qu.: 1.990e-02   3rd Qu.: 1.117e-02          
 Max.   : 2.072e-01   Max.   : 3.671e-01          
pendata[c("15","40","170","294","314","315"),]
summary(pendata)
  Observation      body_mass_g     species             island         
 Min.   :  1.00   Min.   :2700   Length:344         Length:344        
 1st Qu.: 86.75   1st Qu.:3550   Class :character   Class :character  
 Median :172.50   Median :4050   Mode  :character   Mode  :character  
 Mean   :172.50   Mean   :4202                                        
 3rd Qu.:258.25   3rd Qu.:4750                                        
 Max.   :344.00   Max.   :6300                                        
                  NA's   :2                                           
 bill_length_mm  bill_depth_mm   flipper_length_mm     sex           
 Min.   :32.10   Min.   :13.10   Min.   :172.0     Length:344        
 1st Qu.:39.23   1st Qu.:15.60   1st Qu.:190.0     Class :character  
 Median :44.45   Median :17.30   Median :197.0     Mode  :character  
 Mean   :43.92   Mean   :17.15   Mean   :200.9                       
 3rd Qu.:48.50   3rd Qu.:18.70   3rd Qu.:213.0                       
 Max.   :59.60   Max.   :21.50   Max.   :231.0                       
 NA's   :2       NA's   :2       NA's   :2                           
      year     
 Min.   :2007  
 1st Qu.:2007  
 Median :2008  
 Mean   :2008  
 3rd Qu.:2009  
 Max.   :2009  
               

Now we can compare the DFFits and DFBetas to the rest. Additionally, we can compare the x values to the rest of the data.

Upon investigating, observation 314 appears to be very influential and has high values of DFFits. If you want to remove just one or more observations, or a set of observations, we can use a - or the which function within a subset.

# You can do this directly into the data argument in lm
penmodrem.1<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata[-314,])
summary(penmodrem.1)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + species + sex, data = pendata[-314, 
    ])

Residuals:
   Min     1Q Median     3Q    Max 
-704.2 -170.2    0.9  175.3  875.8 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -6775.864   1838.196  -3.686 0.000267 ***
bill_depth_mm                  365.764     98.923   3.697 0.000256 ***
bill_length_mm                 145.132     41.470   3.500 0.000531 ***
flipper_length_mm               16.108      2.881   5.592 4.78e-08 ***
speciesChinstrap              -200.730     81.904  -2.451 0.014782 *  
speciesGentoo                  913.577    131.798   6.932 2.26e-11 ***
sexmale                        396.318     47.184   8.399 1.42e-15 ***
bill_depth_mm:bill_length_mm    -7.180      2.313  -3.104 0.002079 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 283 on 324 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8793,    Adjusted R-squared:  0.8767 
F-statistic: 337.1 on 7 and 324 DF,  p-value: < 2.2e-16
# note DF is different

#remove more than one
penmodrem.2<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata[-c(15,40,170,294,314,315),])
summary(penmodrem.2)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + species + sex, data = pendata[-c(15, 
    40, 170, 294, 314, 315), ])

Residuals:
    Min      1Q  Median      3Q     Max 
-681.17 -171.80    1.54  171.99  688.00 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -6820.422   1833.965  -3.719 0.000236 ***
bill_depth_mm                  359.304     99.122   3.625 0.000336 ***
bill_length_mm                 142.566     41.217   3.459 0.000616 ***
flipper_length_mm               17.212      2.859   6.020 4.79e-09 ***
speciesChinstrap              -181.755     81.946  -2.218 0.027260 *  
speciesGentoo                  883.023    128.661   6.863 3.52e-11 ***
sexmale                        385.230     46.275   8.325 2.51e-15 ***
bill_depth_mm:bill_length_mm    -7.107      2.309  -3.079 0.002260 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 274.2 on 319 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8845,    Adjusted R-squared:  0.882 
F-statistic:   349 on 7 and 319 DF,  p-value: < 2.2e-16
#remove based on logical expression
penmodrem.3<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata[which(pendata$bill_depth_mm<17.3),])
summary(penmodrem.3)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + species + sex, data = pendata[which(pendata$bill_depth_mm < 
    17.3), ])

Residuals:
    Min      1Q  Median      3Q     Max 
-683.38 -162.91  -14.99  155.11  860.97 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -4913.958   4618.410  -1.064  0.28901    
bill_depth_mm                  254.936    281.470   0.906  0.36650    
bill_length_mm                  82.344     94.554   0.871  0.38519    
flipper_length_mm               17.357      4.703   3.691  0.00031 ***
speciesChinstrap              -117.223    135.779  -0.863  0.38930    
speciesGentoo                 1036.777    213.654   4.853 2.98e-06 ***
sexmale                        447.063     71.778   6.228 4.34e-09 ***
bill_depth_mm:bill_length_mm    -3.869      6.021  -0.642  0.52152    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 269 on 153 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.917, Adjusted R-squared:  0.9132 
F-statistic: 241.4 on 7 and 153 DF,  p-value: < 2.2e-16
# You can store the subset data in a new object or you can include the subset
# directly into whatever code you perform, plot, lm, cor, etc

## Quantitative Variables

quantsubsetpen<-pendata[which(pendata$bill_depth_mm<17.3),]

#Example
plot((body_mass_g)~flipper_length_mm,data=quantsubsetpen)

plot((body_mass_g)~flipper_length_mm,data=pendata[which(pendata$bill_depth_mm<17.3),])

# Qualitative Variables

qualsubsetpen<-pendata[which(pendata$species=="Gentoo"),]

# Multiple Arguments

multaubsetpen1<-pendata[which(pendata$species=="Gentoo"&pendata$bill_depth_mm<17.3),]

multaubsetpen2<-pendata[which(pendata$species=="Gentoo"|pendata$bill_depth_mm<17.3),]

Cautions on removing observations

  • We want to retain as much data as possible
  • If we remove some extreme observations then other observations may be identified as extreme

Removing NAs from the original data

#This is not necessarily best practice but an option

#remove na values from data

pennasubset<-na.omit(pendata)
dim(pennasubset)
[1] 333   9
#Then we will want to renumber the observations
#row.names is specifically for data frame
row.names(pennasubset) <- 1:nrow(pennasubset)

#Note this is the SAME model because the missing values were not used to fit 

penmodna<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pennasubset)
summary(penmodna)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + species + sex, data = pennasubset)

Residuals:
    Min      1Q  Median      3Q     Max 
-706.90 -175.12   -6.14  176.80  880.08 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -6290.446   1832.633  -3.432 0.000675 ***
bill_depth_mm                  333.924     98.231   3.399 0.000760 ***
bill_length_mm                 130.274     41.058   3.173 0.001653 ** 
flipper_length_mm               16.535      2.888   5.725 2.35e-08 ***
speciesChinstrap              -200.763     82.322  -2.439 0.015273 *  
speciesGentoo                  923.788    132.379   6.978 1.68e-11 ***
sexmale                        391.603     47.370   8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm    -6.344      2.290  -2.770 0.005919 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 284.4 on 325 degrees of freedom
Multiple R-squared:  0.8778,    Adjusted R-squared:  0.8752 
F-statistic: 333.7 on 7 and 325 DF,  p-value: < 2.2e-16
#revisit outliers/influential observations
#Cooks Distance Thresholds
plot(penmodna,which=4)

#Leverage vs Studentized Residuals
influencePlot(penmodna,fill=F)
#Now subset the dffits and or dfbetas

dffits(penmodna)[c( 10,35,164,283,303,304)]
        10         35        164        283        303        304 
 0.1286874  0.3818284  0.3833745  0.3424947  0.5436063 -0.4620501 
summary(dffits(penmodna))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.462050 -0.093274 -0.003084  0.001633  0.101550  0.543606 

On Your Own

Consider again the Gasturbine example.

gasturbine<-read.delim("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/GASTURBINE.txt")
gasturbine$log.CPRATIO<-log(gasturbine$CPRATIO)
gasturbine$log.AIRFLOW<-log(gasturbine$AIRFLOW)
gasturbine$log.POWER<-log(gasturbine$POWER)

#updated model
gasmod5<-lm(HEATRATE~RPM+INLET.TEMP+EXH.TEMP,data=gasturbine)
summary(gasmod5)

Call:
lm(formula = HEATRATE ~ RPM + INLET.TEMP + EXH.TEMP, data = gasturbine)

Residuals:
    Min      1Q  Median      3Q     Max 
-1025.8  -297.9  -115.3   225.8  1425.1 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.436e+04  7.333e+02  19.582  < 2e-16 ***
RPM          1.051e-01  1.071e-02   9.818 2.55e-14 ***
INLET.TEMP  -9.223e+00  7.869e-01 -11.721  < 2e-16 ***
EXH.TEMP     1.243e+01  2.071e+00   6.000 1.06e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 465 on 63 degrees of freedom
Multiple R-squared:  0.9189,    Adjusted R-squared:  0.915 
F-statistic: 237.9 on 3 and 63 DF,  p-value: < 2.2e-16
  1. First, write the hypothesized model that is fit.

  2. Identify ALL of the observations that exceed each of the thresholds by using the which() function.

  3. Investigate the DFFits and DFBetas for any relevant observations.

  4. Considering the potential constant variance violation and any influential observations you’ve made. Attempt to correct the violation and/or remove up to 5 observations. Your goal is to fit a good model and retain as much of the data as possible AND meet the assumptions with he updated model. Attempt three different corrections (combinations of transformations and removing observations)

# Use this space to attempt transformations and removing observations
# and recheck the assumptions for each step
  1. Summarize your findings from above. Which combination of transformation and removing observations did you determine was appropriate? Are the assumptions better met?

  2. Compute a confidence interval for E(y) for observation 1 in the data table. Interpret this interval practically.

  3. In your opinion, is it worth it to use the model you summarized in question 5 instead of “gasmod5”. Why or why not? What metrics of assessment are you using?