These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal, and are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station Data Policy.
pendata<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/penguins.csv")
head(pendata)
names(pendata)
[1] "Observation" "body_mass_g" "species"
[4] "island" "bill_length_mm" "bill_depth_mm"
[7] "flipper_length_mm" "sex" "year"
Hypothesized Linear Model:
\(E(bodymass)=\beta_0+\beta_1BillDep+\beta_2BillLen+\beta_3FlipperLen+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDep*BillLen\)
penmodfinal<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)
#There are a few options to view plots for assumptions
#Residuals Plots of explanatory variables vs residuals
residualPlots(penmodfinal,tests=F)
#Residual vs Fitted and QQ plot
plot(penmodfinal, which=c(1,2))
#histogram of residuals
hist(residuals(penmodfinal))
There were no clear trends that could be modeled in the residual plots. The assumption was not violated.
There were no clear fanning patterns in the residual plots nor the residual vs predicted plots. The assumption was not violated.
The observations fall linearly on the QQ plot. (The residuals are unimodal and symmetric in the histogram). The assumption is not violated.
We do not believe the information of one penguin depends on another penguin. The assumption is not violated.
There were no severe violations.
Although no transformations were needed, here is an example of how we would interpret and code for a few different variable transformations.
#attempt transformations within plot to create a more linear relationship
# ORIGINAL
plot((body_mass_g)~flipper_length_mm,data=pendata)
#You can transform both x and y in any function
# ATTEMPT combinations of log and square root of x and y
plot(sqrt(body_mass_g)~flipper_length_mm,data=pendata)
plot(log(body_mass_g)~flipper_length_mm,data=pendata)
plot(log(body_mass_g)~log(flipper_length_mm),data=pendata)
# compute correlations
cor(pendata$body_mass_g,pendata$flipper_length_mm)
[1] NA
# note the NA because of missing values
# add the use= argument
cor(pendata$body_mass_g,pendata$flipper_length_mm, use="complete.obs")
[1] 0.8712018
cor(log(pendata$body_mass_g),log(pendata$flipper_length_mm), use="complete.obs")
[1] 0.8588604
#Within lm() we can add both transformations
# try several
penmodT<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)
summary(penmodT)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + species + sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-706.90 -175.12 -6.14 176.80 880.08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6290.446 1832.633 -3.432 0.000675 ***
bill_depth_mm 333.924 98.231 3.399 0.000760 ***
bill_length_mm 130.274 41.058 3.173 0.001653 **
flipper_length_mm 16.535 2.888 5.725 2.35e-08 ***
speciesChinstrap -200.763 82.322 -2.439 0.015273 *
speciesGentoo 923.788 132.379 6.978 1.68e-11 ***
sexmale 391.603 47.370 8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm -6.344 2.290 -2.770 0.005919 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 284.4 on 325 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8778, Adjusted R-squared: 0.8752
F-statistic: 333.7 on 7 and 325 DF, p-value: < 2.2e-16
NOTE:
How to find n when there are missing values?
#Determine n
summary(penmodfinal)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + species + sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-706.90 -175.12 -6.14 176.80 880.08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6290.446 1832.633 -3.432 0.000675 ***
bill_depth_mm 333.924 98.231 3.399 0.000760 ***
bill_length_mm 130.274 41.058 3.173 0.001653 **
flipper_length_mm 16.535 2.888 5.725 2.35e-08 ***
speciesChinstrap -200.763 82.322 -2.439 0.015273 *
speciesGentoo 923.788 132.379 6.978 1.68e-11 ***
sexmale 391.603 47.370 8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm -6.344 2.290 -2.770 0.005919 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 284.4 on 325 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8778, Adjusted R-squared: 0.8752
F-statistic: 333.7 on 7 and 325 DF, p-value: < 2.2e-16
#numDF + (k+1) = n-k-1+(k+1) = n
n<-df.residual(penmodfinal)+length(coef(penmodfinal))
n
[1] 333
#Leverage Cut off
levcut<-(2*(length(coef(penmodfinal))))/n
levcut
[1] 0.04804805
#Cook's Distance Cutoff
cdcut<-4/n
cdcut
[1] 0.01201201
n will be how many observations were used to fit the model, not necessarily how many observations were in the original data table. To determine this number we can go to the model F degrees of freedom k, n-k-1. In the pen model we have 7, 325. So k=7 and n=333.
#Cooks Distance Thresholds
plot(penmodfinal,which=4)
#Leverage vs Studentized Residuals
influencePlot(penmodfinal,fill=F)
#Deleted Studentized Residals
ols_plot_resid_stud_fit(penmodfinal)
Based on the labeled observations from the plots:
With respect to Cook’s Distance, observations 293, 314, 315 are considered influential because their values exceed the threshold of 4/n=4/333=0.012.
With respect to Leverage, observations 314, 15, 295 (and others) are considered “leverage”=outliers in x because their values exceed the threshold of 2(7+1)/333=0.048.
With respect to Studentizied Residuals, observations 170, 40, 315, 314 (and others) are considered outliers in y because their values exceed the threshold of +/-2 .
With respect to Deleted Studentizied Residuals, observations that are labelled red (164, 35, 304, etc) are considered influential because their values exceed the threshold of +/-2.
However, how can we determine all of the
observations that exceed our thresholds? We can use the
which()
function to print values that meet certain criteria
from a logical expression.
#General expressions about the variables in the data table
#Quantitative Variables
which(pendata$bill_depth_mm<17.3)
[1] 11 25 31 35 45 59 61 63 65 67 69 73 77 79 81 93 99 103
[19] 107 109 111 117 119 121 123 125 129 139 141 142 143 144 145 151 153 154
[37] 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172
[55] 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 191
[73] 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209
[91] 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227
[109] 228 229 230 231 232 233 234 235 236 237 239 240 241 242 243 244 245 246
[127] 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264
[145] 265 266 267 268 269 270 271 273 274 275 276 291 299 307 309 312 315 320
[163] 327 333 338 339
# Qualitative Variables
which(pendata$species=="Gentoo")
[1] 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170
[19] 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
[37] 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206
[55] 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
[73] 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242
[91] 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
[109] 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276
#Multiple arguments
which(pendata$species=="Gentoo"&pendata$bill_depth_mm<17.3)
[1] 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170
[19] 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
[37] 189 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
[55] 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225
[73] 226 227 228 229 230 231 232 233 234 235 236 237 239 240 241 242 243 244
[91] 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262
[109] 263 264 265 266 267 268 269 270 271 273 274 275 276
which(pendata$species=="Gentoo"|pendata$bill_depth_mm<17.3)
[1] 11 25 31 35 45 59 61 63 65 67 69 73 77 79 81 93 99 103
[19] 107 109 111 117 119 121 123 125 129 139 141 142 143 144 145 151 153 154
[37] 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172
[55] 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190
[73] 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
[91] 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226
[109] 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244
[127] 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262
[145] 263 264 265 266 267 268 269 270 271 272 273 274 275 276 291 299 307 309
[163] 312 315 320 327 333 338 339
# Using our influential diagnostics
which(hatvalues(penmodfinal)>16/333)
15 20 115 130 143 186 190 240 288 294 308 314 340
10 15 109 124 137 179 183 232 277 283 297 303 329
which(cooks.distance(penmodfinal)>4/333)
40 82 112 118 122 166 170 193 232 285 293 294 297 298 314 315 325
35 76 106 112 116 160 164 186 224 274 282 283 286 287 303 304 314
which(abs(rstudent(penmodfinal))>2)
8 40 46 82 86 94 105 110 120 122 132 166 170 196 232 285 293 314 315 325
7 35 41 76 80 88 99 104 114 116 126 160 164 189 224 274 282 303 304 314
which(abs(stdres(penmodfinal))>2)
8 40 46 82 86 94 105 110 120 122 132 166 170 196 232 285 293 314 315 325
7 35 41 76 80 88 99 104 114 116 126 160 164 189 224 274 282 303 304 314
# The top row is the original observation number, the second number is the index (position)
#Multiple arguments
which(hatvalues(penmodfinal)>16/333&cooks.distance(penmodfinal)>4/333&abs(rstudent(penmodfinal))>2&abs(stdres(penmodfinal))>2)
314
303
Observations 15, 20, 115, 130, 143, 186, 190, 240, 288, 294, 308, 314, 340 exceed the threshold for leverage.
Observations 40, 82, 112, 118, 122, 166, 170, 193, 232, 285, 293, 294, 297, 298, 314, 315, 325 exceed the threshold for Cook’s Distance.
Observations 8, 40, 46, 82, 86, 94, 105, 110, 120, 122, 132, 166, 170, 196, 232, 285, 293, 314, 315, 325 exceed the threshold for Deleted Studentized Residuals.
Observations 8, 40, 46, 82, 86, 94, 105, 110, 120, 122, 132, 166, 170, 196, 232, 285, 293, 314, 315, 325 exceed the threshold for Studentized Residuals.
Observation 314 exceeds in all 4 categories.
Next steps would be to evaluate closer with dffits or dfbetas and see how those observations compare to the other observations in the data. Then look at the EDA. Determine if the observations are very influential and/or truly representative of the data. We should NOT just remove without further exploration.
#Explore DFFits and DFBetas for the observations identified
dffits(penmodfinal)[c(15,40,170,294,314,315)]
20 45 176 305 325 326
-0.006256213 -0.111866703 -0.100099623 0.147376774 -0.400590799 0.061502594
#Note those observation numbers do not match because there are missing values
# in the penguins data frame
dim(pendata)
[1] 344 9
length(dffits(penmodfinal))
[1] 333
Notice when we attempt to report the DFFits for observations 15,40,170,294,314,315 we see labels for 20, 45, 176, 305, 325, 326. This is because this square bracket notation for subsetting will call the position number (index) in the list of dffits, not the observation number from the data table.
Why are those things different? There are missing values in the penguin data table. So observation 20 ends up being position number 15 in the list of DFFits. There were 344 values in the data table, but only 333 to fit the model.
How can we reconcile?
#First we will store the DFFits to make them easier to work with
pendffits<-dffits(penmodfinal)
# Calling the observation numbers (row names)
# don't get too worried on the syntax
pendffits[c("15","40","170","294","314","315")]
15 40 170 294 314 315
0.1286874 0.3818284 0.3833745 0.3424947 0.5436063 -0.4620501
# Now we can compare these values to the rest
summary(dffits(penmodfinal))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.462050 -0.093274 -0.003084 0.001633 0.101550 0.543606
# Slightly different syntax for DFBetas
# Because DFBetas is a matrix. not just a vector
pendfbeta<-dfbetas(penmodfinal)
pendfbeta[c("15","40","170","294","314","315"),]
(Intercept) bill_depth_mm bill_length_mm flipper_length_mm speciesChinstrap
15 -0.08557838 0.09310769 0.06910227 0.02660027 0.0655398436
40 0.02653293 0.05209415 0.03999337 -0.25112153 -0.0043988499
170 -0.12168270 0.11147375 0.13543939 0.04868884 0.0246383890
294 -0.01852371 0.02920232 0.08105004 -0.17108150 -0.1584788882
314 0.26622725 -0.32579551 -0.36372431 0.14860592 -0.0004064124
315 0.21521173 -0.21042438 -0.25248533 -0.04511119 -0.1638646983
speciesGentoo sexmale bill_depth_mm:bill_length_mm
15 0.02492217 0.009146321 -0.08271890
40 0.12938768 0.170995071 -0.04103063
170 -0.05802485 0.169876311 -0.13743246
294 -0.04443237 -0.119146477 -0.03129413
314 0.07753382 -0.100034429 0.36711794
315 0.18047644 -0.001915482 0.25222563
summary(dfbetas(penmodfinal))
(Intercept) bill_depth_mm bill_length_mm
Min. :-2.241e-01 Min. :-0.3257955 Min. :-3.637e-01
1st Qu.:-1.834e-02 1st Qu.:-0.0121734 1st Qu.:-9.957e-03
Median :-3.307e-04 Median : 0.0009279 Median : 2.128e-04
Mean : 7.712e-05 Mean :-0.0000870 Mean :-7.541e-05
3rd Qu.: 1.552e-02 3rd Qu.: 0.0194839 3rd Qu.: 1.921e-02
Max. : 2.662e-01 Max. : 0.2017456 Max. : 2.407e-01
flipper_length_mm speciesChinstrap speciesGentoo
Min. :-2.511e-01 Min. :-2.327e-01 Min. :-3.006e-01
1st Qu.:-1.948e-02 1st Qu.:-1.623e-02 1st Qu.:-1.962e-02
Median : 1.204e-04 Median :-7.051e-05 Median : 6.127e-04
Mean :-3.359e-05 Mean :-9.018e-05 Mean : 8.558e-06
3rd Qu.: 2.059e-02 3rd Qu.: 1.830e-02 3rd Qu.: 1.995e-02
Max. : 2.652e-01 Max. : 2.515e-01 Max. : 2.159e-01
sexmale bill_depth_mm:bill_length_mm
Min. :-2.001e-01 Min. :-2.387e-01
1st Qu.:-2.291e-02 1st Qu.:-1.882e-02
Median :-1.235e-03 Median :-4.818e-04
Mean :-9.972e-05 Mean : 9.593e-05
3rd Qu.: 1.990e-02 3rd Qu.: 1.117e-02
Max. : 2.072e-01 Max. : 3.671e-01
pendata[c("15","40","170","294","314","315"),]
summary(pendata)
Observation body_mass_g species island
Min. : 1.00 Min. :2700 Length:344 Length:344
1st Qu.: 86.75 1st Qu.:3550 Class :character Class :character
Median :172.50 Median :4050 Mode :character Mode :character
Mean :172.50 Mean :4202
3rd Qu.:258.25 3rd Qu.:4750
Max. :344.00 Max. :6300
NA's :2
bill_length_mm bill_depth_mm flipper_length_mm sex
Min. :32.10 Min. :13.10 Min. :172.0 Length:344
1st Qu.:39.23 1st Qu.:15.60 1st Qu.:190.0 Class :character
Median :44.45 Median :17.30 Median :197.0 Mode :character
Mean :43.92 Mean :17.15 Mean :200.9
3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213.0
Max. :59.60 Max. :21.50 Max. :231.0
NA's :2 NA's :2 NA's :2
year
Min. :2007
1st Qu.:2007
Median :2008
Mean :2008
3rd Qu.:2009
Max. :2009
Now we can compare the DFFits and DFBetas to the rest. Additionally, we can compare the x values to the rest of the data.
Upon investigating, observation 314 appears to be very influential and has high values of DFFits. If you want to remove just one or more observations, or a set of observations, we can use a - or the which function within a subset.
# You can do this directly into the data argument in lm
penmodrem.1<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata[-314,])
summary(penmodrem.1)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + species + sex, data = pendata[-314,
])
Residuals:
Min 1Q Median 3Q Max
-704.2 -170.2 0.9 175.3 875.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6775.864 1838.196 -3.686 0.000267 ***
bill_depth_mm 365.764 98.923 3.697 0.000256 ***
bill_length_mm 145.132 41.470 3.500 0.000531 ***
flipper_length_mm 16.108 2.881 5.592 4.78e-08 ***
speciesChinstrap -200.730 81.904 -2.451 0.014782 *
speciesGentoo 913.577 131.798 6.932 2.26e-11 ***
sexmale 396.318 47.184 8.399 1.42e-15 ***
bill_depth_mm:bill_length_mm -7.180 2.313 -3.104 0.002079 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 283 on 324 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8793, Adjusted R-squared: 0.8767
F-statistic: 337.1 on 7 and 324 DF, p-value: < 2.2e-16
# note DF is different
#remove more than one
penmodrem.2<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata[-c(15,40,170,294,314,315),])
summary(penmodrem.2)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + species + sex, data = pendata[-c(15,
40, 170, 294, 314, 315), ])
Residuals:
Min 1Q Median 3Q Max
-681.17 -171.80 1.54 171.99 688.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6820.422 1833.965 -3.719 0.000236 ***
bill_depth_mm 359.304 99.122 3.625 0.000336 ***
bill_length_mm 142.566 41.217 3.459 0.000616 ***
flipper_length_mm 17.212 2.859 6.020 4.79e-09 ***
speciesChinstrap -181.755 81.946 -2.218 0.027260 *
speciesGentoo 883.023 128.661 6.863 3.52e-11 ***
sexmale 385.230 46.275 8.325 2.51e-15 ***
bill_depth_mm:bill_length_mm -7.107 2.309 -3.079 0.002260 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 274.2 on 319 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8845, Adjusted R-squared: 0.882
F-statistic: 349 on 7 and 319 DF, p-value: < 2.2e-16
#remove based on logical expression
penmodrem.3<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata[which(pendata$bill_depth_mm<17.3),])
summary(penmodrem.3)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + species + sex, data = pendata[which(pendata$bill_depth_mm <
17.3), ])
Residuals:
Min 1Q Median 3Q Max
-683.38 -162.91 -14.99 155.11 860.97
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4913.958 4618.410 -1.064 0.28901
bill_depth_mm 254.936 281.470 0.906 0.36650
bill_length_mm 82.344 94.554 0.871 0.38519
flipper_length_mm 17.357 4.703 3.691 0.00031 ***
speciesChinstrap -117.223 135.779 -0.863 0.38930
speciesGentoo 1036.777 213.654 4.853 2.98e-06 ***
sexmale 447.063 71.778 6.228 4.34e-09 ***
bill_depth_mm:bill_length_mm -3.869 6.021 -0.642 0.52152
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 269 on 153 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.917, Adjusted R-squared: 0.9132
F-statistic: 241.4 on 7 and 153 DF, p-value: < 2.2e-16
# You can store the subset data in a new object or you can include the subset
# directly into whatever code you perform, plot, lm, cor, etc
## Quantitative Variables
quantsubsetpen<-pendata[which(pendata$bill_depth_mm<17.3),]
#Example
plot((body_mass_g)~flipper_length_mm,data=quantsubsetpen)
plot((body_mass_g)~flipper_length_mm,data=pendata[which(pendata$bill_depth_mm<17.3),])
# Qualitative Variables
qualsubsetpen<-pendata[which(pendata$species=="Gentoo"),]
# Multiple Arguments
multaubsetpen1<-pendata[which(pendata$species=="Gentoo"&pendata$bill_depth_mm<17.3),]
multaubsetpen2<-pendata[which(pendata$species=="Gentoo"|pendata$bill_depth_mm<17.3),]
Cautions on removing observations
#This is not necessarily best practice but an option
#remove na values from data
pennasubset<-na.omit(pendata)
dim(pennasubset)
[1] 333 9
#Then we will want to renumber the observations
#row.names is specifically for data frame
row.names(pennasubset) <- 1:nrow(pennasubset)
#Note this is the SAME model because the missing values were not used to fit
penmodna<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pennasubset)
summary(penmodna)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + species + sex, data = pennasubset)
Residuals:
Min 1Q Median 3Q Max
-706.90 -175.12 -6.14 176.80 880.08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6290.446 1832.633 -3.432 0.000675 ***
bill_depth_mm 333.924 98.231 3.399 0.000760 ***
bill_length_mm 130.274 41.058 3.173 0.001653 **
flipper_length_mm 16.535 2.888 5.725 2.35e-08 ***
speciesChinstrap -200.763 82.322 -2.439 0.015273 *
speciesGentoo 923.788 132.379 6.978 1.68e-11 ***
sexmale 391.603 47.370 8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm -6.344 2.290 -2.770 0.005919 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 284.4 on 325 degrees of freedom
Multiple R-squared: 0.8778, Adjusted R-squared: 0.8752
F-statistic: 333.7 on 7 and 325 DF, p-value: < 2.2e-16
#revisit outliers/influential observations
#Cooks Distance Thresholds
plot(penmodna,which=4)
#Leverage vs Studentized Residuals
influencePlot(penmodna,fill=F)
#Now subset the dffits and or dfbetas
dffits(penmodna)[c( 10,35,164,283,303,304)]
10 35 164 283 303 304
0.1286874 0.3818284 0.3833745 0.3424947 0.5436063 -0.4620501
summary(dffits(penmodna))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.462050 -0.093274 -0.003084 0.001633 0.101550 0.543606
Consider again the Gasturbine example.
gasturbine<-read.delim("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/GASTURBINE.txt")
gasturbine$log.CPRATIO<-log(gasturbine$CPRATIO)
gasturbine$log.AIRFLOW<-log(gasturbine$AIRFLOW)
gasturbine$log.POWER<-log(gasturbine$POWER)
#updated model
gasmod5<-lm(HEATRATE~RPM+INLET.TEMP+EXH.TEMP,data=gasturbine)
summary(gasmod5)
Call:
lm(formula = HEATRATE ~ RPM + INLET.TEMP + EXH.TEMP, data = gasturbine)
Residuals:
Min 1Q Median 3Q Max
-1025.8 -297.9 -115.3 225.8 1425.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.436e+04 7.333e+02 19.582 < 2e-16 ***
RPM 1.051e-01 1.071e-02 9.818 2.55e-14 ***
INLET.TEMP -9.223e+00 7.869e-01 -11.721 < 2e-16 ***
EXH.TEMP 1.243e+01 2.071e+00 6.000 1.06e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 465 on 63 degrees of freedom
Multiple R-squared: 0.9189, Adjusted R-squared: 0.915
F-statistic: 237.9 on 3 and 63 DF, p-value: < 2.2e-16
First, write the hypothesized model that is fit.
Identify ALL of the observations that exceed each of the thresholds by using the which() function.
Investigate the DFFits and DFBetas for any relevant observations.
Considering the potential constant variance violation and any influential observations you’ve made. Attempt to correct the violation and/or remove up to 5 observations. Your goal is to fit a good model and retain as much of the data as possible AND meet the assumptions with he updated model. Attempt three different corrections (combinations of transformations and removing observations)
# Use this space to attempt transformations and removing observations
# and recheck the assumptions for each step
Summarize your findings from above. Which combination of transformation and removing observations did you determine was appropriate? Are the assumptions better met?
Compute a confidence interval for E(y) for observation 1 in the data table. Interpret this interval practically.
In your opinion, is it worth it to use the model you summarized in question 5 instead of “gasmod5”. Why or why not? What metrics of assessment are you using?