Description

Research Questions

What variables correlate the most to total mercury?

Import data

fishermen = read.csv("https://raw.githubusercontent.com/connormckee1323/STA-321/main/fishermen_mercury.csv", header = TRUE)
kable(head(fishermen))
fisherman age restime height weight fishmlwk fishpart MeHg TotHg
1 45 6 175 70 14 2 4.011 4.484
1 38 13 173 73 7 1 4.026 4.789
1 24 2 168 66 7 2 3.578 3.856
1 41 2 183 80 7 1 10.988 11.435
1 43 11 175 78 21 1 10.520 10.849
1 58 2 176 75 21 1 6.169 6.457

Creating Dummy Variables

fishermen$catpart <- ifelse(fishermen$fishpart < 1, "none",
                           ifelse(fishermen$fishpart < 2, "muscle",
                           ifelse(fishermen$fishpart <3, "both",
                           ifelse(fishermen$fishpart <4, "whole"))))

fishermen$fisherman = ifelse(fishermen$fisherman <1, "no", "yes")
full.model = lm(TotHg ~ ., data = fishermen)
kable(summary(full.model)$coef, caption ="Statistics of Regression Coefficients")
Statistics of Regression Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1167133 0.2774186 0.4207117 0.6746938
fishermanyes 0.0422451 0.0333500 1.2667207 0.2076293
age 0.0023698 0.0015684 1.5109517 0.1333450
restime -0.0023521 0.0024489 -0.9604571 0.3386943
height -0.0009551 0.0015503 -0.6160618 0.5389830
weight -0.0014187 0.0017601 -0.8060325 0.4217672
fishmlwk 0.0035735 0.0024855 1.4377503 0.1530238
fishpart 0.0254875 0.0411188 0.6198488 0.5364947
MeHg 1.0263888 0.0041529 247.1519556 0.0000000
catpartmuscle 0.0626317 0.0535675 1.1692106 0.2445614
catpartnone 0.1137736 0.0939457 1.2110567 0.2281768
par(mfrow=c(2,2))
plot(full.model)

Stepwise Regression

drop1(full.model, test = "F")
## Single term deletions
## 
## Model:
## TotHg ~ fisherman + age + restime + height + weight + fishmlwk + 
##     fishpart + MeHg + catpart
##           Df Sum of Sq    RSS     AIC    F value Pr(>F)    
## <none>                   1.62 -574.86                      
## fisherman  1      0.02   1.64 -575.12     1.6046 0.2076    
## age        1      0.03   1.65 -574.39     2.2830 0.1333    
## restime    1      0.01   1.63 -575.86     0.9225 0.3387    
## height     1      0.00   1.63 -576.44     0.3795 0.5390    
## weight     1      0.01   1.63 -576.15     0.6497 0.4218    
## fishmlwk   1      0.03   1.65 -574.62     2.0671 0.1530    
## fishpart   0      0.00   1.62 -574.86                      
## MeHg       1    799.39 801.01  260.38 61084.0892 <2e-16 ***
## catpart    2      0.02   1.64 -577.14     0.7917 0.4553    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(update(full.model, ~ . -height -restime -fishpart -weight -catpart -fisherman), test = "F")
## Single term deletions
## 
## Model:
## TotHg ~ age + fishmlwk + MeHg
##          Df Sum of Sq     RSS     AIC    F value  Pr(>F)    
## <none>                   1.70 -582.69                       
## age       1      0.03    1.73 -582.00     2.6394 0.10665    
## fishmlwk  1      0.08    1.78 -578.24     6.4046 0.01257 *  
## MeHg      1   1040.85 1042.55  281.96 80272.9456 < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
step.model = lm(TotHg ~ age + fishmlwk + MeHg, data = fishermen)
kable(summary(step.model)$coef, caption ="Statistics of Regression Coefficients")
Statistics of Regression Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0624310 0.0429722 -1.452822 0.1486643
age 0.0020668 0.0012722 1.624624 0.1066463
fishmlwk 0.0050316 0.0019882 2.530731 0.0125658
MeHg 1.0250212 0.0036178 283.324806 0.0000000
par(mfrow = c(2,2))
plot(step.model)

Finding the better model

#define plotting area
par(pty = "s", mfrow = c(1, 3))
#Q-Q plot for original model
qqnorm(full.model$residuals, main = "Full-Model")
qqline(full.model$residuals)
#Q-Q plot for Box-Cox transformed model
qqnorm(step.model$residuals, main = "Step Model")
qqline(step.model$residuals)

The final model

kable(summary(step.model)$coef, caption = "Inferential Statistics of Final Model")
Inferential Statistics of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0624310 0.0429722 -1.452822 0.1486643
age 0.0020668 0.0012722 1.624624 0.1066463
fishmlwk 0.0050316 0.0019882 2.530731 0.0125658
MeHg 1.0250212 0.0036178 283.324806 0.0000000

Summary of the final model

\[TotHg= -0.0624310+0.0020668*age+0.0050316*fishmlwk+1.0250212*MeHg\] The estimated regression coefficients are based on total mercury. The main factor adding to total mercury is methlyn mercury, which is expected.

We used two models to find the best model. There are three variables total in our model, two are significant and one is nearly significant and practically speaking it makes sense to leave it in our model. Our assumption of variance is violated so we need to fix that by using bootstrap regressions.