Description
Research Questions
What variables correlate the most to total mercury?
Import data
fishermen = read.csv("https://raw.githubusercontent.com/connormckee1323/STA-321/main/fishermen_mercury.csv", header = TRUE)
kable(head(fishermen))
| 1 |
45 |
6 |
175 |
70 |
14 |
2 |
4.011 |
4.484 |
| 1 |
38 |
13 |
173 |
73 |
7 |
1 |
4.026 |
4.789 |
| 1 |
24 |
2 |
168 |
66 |
7 |
2 |
3.578 |
3.856 |
| 1 |
41 |
2 |
183 |
80 |
7 |
1 |
10.988 |
11.435 |
| 1 |
43 |
11 |
175 |
78 |
21 |
1 |
10.520 |
10.849 |
| 1 |
58 |
2 |
176 |
75 |
21 |
1 |
6.169 |
6.457 |
Creating Dummy Variables
fishermen$catpart <- ifelse(fishermen$fishpart < 1, "none",
ifelse(fishermen$fishpart < 2, "muscle",
ifelse(fishermen$fishpart <3, "both",
ifelse(fishermen$fishpart <4, "whole"))))
fishermen$fisherman = ifelse(fishermen$fisherman <1, "no", "yes")
full.model = lm(TotHg ~ ., data = fishermen)
kable(summary(full.model)$coef, caption ="Statistics of Regression Coefficients")
Statistics of Regression Coefficients
| (Intercept) |
0.1167133 |
0.2774186 |
0.4207117 |
0.6746938 |
| fishermanyes |
0.0422451 |
0.0333500 |
1.2667207 |
0.2076293 |
| age |
0.0023698 |
0.0015684 |
1.5109517 |
0.1333450 |
| restime |
-0.0023521 |
0.0024489 |
-0.9604571 |
0.3386943 |
| height |
-0.0009551 |
0.0015503 |
-0.6160618 |
0.5389830 |
| weight |
-0.0014187 |
0.0017601 |
-0.8060325 |
0.4217672 |
| fishmlwk |
0.0035735 |
0.0024855 |
1.4377503 |
0.1530238 |
| fishpart |
0.0254875 |
0.0411188 |
0.6198488 |
0.5364947 |
| MeHg |
1.0263888 |
0.0041529 |
247.1519556 |
0.0000000 |
| catpartmuscle |
0.0626317 |
0.0535675 |
1.1692106 |
0.2445614 |
| catpartnone |
0.1137736 |
0.0939457 |
1.2110567 |
0.2281768 |
par(mfrow=c(2,2))
plot(full.model)

Stepwise Regression
drop1(full.model, test = "F")
## Single term deletions
##
## Model:
## TotHg ~ fisherman + age + restime + height + weight + fishmlwk +
## fishpart + MeHg + catpart
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 1.62 -574.86
## fisherman 1 0.02 1.64 -575.12 1.6046 0.2076
## age 1 0.03 1.65 -574.39 2.2830 0.1333
## restime 1 0.01 1.63 -575.86 0.9225 0.3387
## height 1 0.00 1.63 -576.44 0.3795 0.5390
## weight 1 0.01 1.63 -576.15 0.6497 0.4218
## fishmlwk 1 0.03 1.65 -574.62 2.0671 0.1530
## fishpart 0 0.00 1.62 -574.86
## MeHg 1 799.39 801.01 260.38 61084.0892 <2e-16 ***
## catpart 2 0.02 1.64 -577.14 0.7917 0.4553
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(update(full.model, ~ . -height -restime -fishpart -weight -catpart -fisherman), test = "F")
## Single term deletions
##
## Model:
## TotHg ~ age + fishmlwk + MeHg
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 1.70 -582.69
## age 1 0.03 1.73 -582.00 2.6394 0.10665
## fishmlwk 1 0.08 1.78 -578.24 6.4046 0.01257 *
## MeHg 1 1040.85 1042.55 281.96 80272.9456 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
step.model = lm(TotHg ~ age + fishmlwk + MeHg, data = fishermen)
kable(summary(step.model)$coef, caption ="Statistics of Regression Coefficients")
Statistics of Regression Coefficients
| (Intercept) |
-0.0624310 |
0.0429722 |
-1.452822 |
0.1486643 |
| age |
0.0020668 |
0.0012722 |
1.624624 |
0.1066463 |
| fishmlwk |
0.0050316 |
0.0019882 |
2.530731 |
0.0125658 |
| MeHg |
1.0250212 |
0.0036178 |
283.324806 |
0.0000000 |
par(mfrow = c(2,2))
plot(step.model)

Finding the better model
#define plotting area
par(pty = "s", mfrow = c(1, 3))
#Q-Q plot for original model
qqnorm(full.model$residuals, main = "Full-Model")
qqline(full.model$residuals)
#Q-Q plot for Box-Cox transformed model
qqnorm(step.model$residuals, main = "Step Model")
qqline(step.model$residuals)

The final model
kable(summary(step.model)$coef, caption = "Inferential Statistics of Final Model")
Inferential Statistics of Final Model
| (Intercept) |
-0.0624310 |
0.0429722 |
-1.452822 |
0.1486643 |
| age |
0.0020668 |
0.0012722 |
1.624624 |
0.1066463 |
| fishmlwk |
0.0050316 |
0.0019882 |
2.530731 |
0.0125658 |
| MeHg |
1.0250212 |
0.0036178 |
283.324806 |
0.0000000 |
Summary of the final model
\[TotHg=
-0.0624310+0.0020668*age+0.0050316*fishmlwk+1.0250212*MeHg\] The
estimated regression coefficients are based on total mercury. The main
factor adding to total mercury is methlyn mercury, which is
expected.
We used two models to find the best model. There are three variables
total in our model, two are significant and one is nearly significant
and practically speaking it makes sense to leave it in our model. Our
assumption of variance is violated so we need to fix that by using
bootstrap regressions.