| fisherman | age | restime | height | weight | fishmlwk | fishpart | MeHg | TotHg |
|---|---|---|---|---|---|---|---|---|
| 1 | 45 | 6 | 175 | 70 | 14 | 2 | 4.011 | 4.484 |
| 1 | 38 | 13 | 173 | 73 | 7 | 1 | 4.026 | 4.789 |
| 1 | 24 | 2 | 168 | 66 | 7 | 2 | 3.578 | 3.856 |
| 1 | 41 | 2 | 183 | 80 | 7 | 1 | 10.988 | 11.435 |
| 1 | 43 | 11 | 175 | 78 | 21 | 1 | 10.520 | 10.849 |
| 1 | 58 | 2 | 176 | 75 | 21 | 1 | 6.169 | 6.457 |
I am curious in finding out the factors effecting mercury levels in Kuwaiti fishermen. My research question is, “What factors effect mercury levels in Kuwaiti fishermen. Mercury is a toxic metal and is capable of causing mercury poison to human beings. One way of consuming mercury is through seafood. Mercury poisoning can cause hearing and speech difficulties, lack of coordination, nerve loss in hands and face, and more problems. Kuwaitian’s diet consists of mainly seafood and this information could be helpful to living a healthier life style by this analysis.
This is a data set consisting of 135 observations and 10 variables. Those variables include: Fisherman indicator (fisherman) Age in years (age) Residence Time in years (restime) Height in cm (height) Weight in kg (weight) Fish meals per week ((fishmlwk) Parts of fish consumed: 0=none, 1=muscle tissue only, 2=mt and sometimes whole fish, 3=whole fish (fishpart) Methyl Mercury in mg/g (MeHg) Total Mercury in mg/g (TotHg).
fishermen$catpart <- ifelse(fishermen$fishpart < 1, "none",
ifelse(fishermen$fishpart >1, "whole", "muscle"))
#If fishpart <1, none. If fishpart >1, whole. If fishpart =1, muscle.
#I combine fishparts 2 and 3 because practically speaking, if you sometimes eating the whole fish is the same as eating just the whole fish. This tells me if you don't eat fish, you eat strictly muscle, or you eat both.
fishermen$fisherman = ifelse(fishermen$fisherman <1, "no", "yes")
#If fisherman <1, no. If fisherman => 1, yes.
final.data = fishermen[, -c(7)]
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.1808230 | 0.2773141 | 0.6520512 | 0.5155653 |
| fishermanyes | 0.0440958 | 0.0331341 | 1.3308276 | 0.1856682 |
| age | 0.0024037 | 0.0015636 | 1.5372964 | 0.1267482 |
| restime | -0.0022616 | 0.0024385 | -0.9274474 | 0.3554816 |
| height | -0.0008528 | 0.0015377 | -0.5546366 | 0.5801343 |
| weight | -0.0013890 | 0.0017551 | -0.7914070 | 0.4302057 |
| fishmlwk | 0.0035123 | 0.0024774 | 1.4177149 | 0.1587620 |
| MeHg | 1.0266092 | 0.0041274 | 248.7295718 | 0.0000000 |
| catpartnone | 0.0283584 | 0.0483997 | 0.5859213 | 0.5589849 |
| catpartwhole | -0.0341177 | 0.0273794 | -1.2461112 | 0.2150539 |
Here is a summary of our full model.
We can see from the residual plots that there are some minor violations: the variance of the residuals is not constant the QQ plot indicates the distribution of residuals is slightly off the normal distribution. The residual plot seems to have a weak curve pattern.
## Single term deletions
##
## Model:
## TotHg ~ age + fishmlwk + MeHg
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 1.70 -582.69
## age 1 0.03 1.73 -582.00 2.6394 0.10665
## fishmlwk 1 0.08 1.78 -578.24 6.4046 0.01257 *
## MeHg 1 1040.85 1042.55 281.96 80272.9456 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.0624310 | 0.0429722 | -1.452822 | 0.1486643 |
| age | 0.0020668 | 0.0012722 | 1.624624 | 0.1066463 |
| fishmlwk | 0.0050316 | 0.0019882 | 2.530731 | 0.0125658 |
| MeHg | 1.0250212 | 0.0036178 | 283.324806 | 0.0000000 |
Here is a summary of step.model we created from using backwards step wise regresssion.
We can see from the residual plots that there are some minor violations: the variance of the residuals is not constant the QQ plot indicates the distribution of residuals is slightly off the normal distribution. The residual plot seems to have a weak curve pattern.
We used Box Cox transformations by trying to use the log function on our predictor variables to try to make the data normally distributed. By looking at the lambda values in the figures above, we see that our approximate lambda values can by 0.1, and 0.95. This will give us 2 more models. This is where our exact line is which is the middle line, while the other two are showing the 95% confidence intervals. When our lambda values are not 0, our box cox rules state that we raise the dependent variable to the lambda value.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 1.0207337 | 0.0181508 | 56.2362654 | 0.0000000 |
| age | -0.0000740 | 0.0005374 | -0.1376811 | 0.8907039 |
| fishmlwk | -0.0012279 | 0.0008398 | -1.4620875 | 0.1461121 |
| MeHg | 0.0282159 | 0.0015281 | 18.4644721 | 0.0000000 |
Here is a summary our our transformed model raising TotHg ^0.1.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.1559243 | 0.0443470 | 3.516007 | 0.0006024 |
| age | 0.0014258 | 0.0013129 | 1.085978 | 0.2794830 |
| fishmlwk | 0.0044467 | 0.0020518 | 2.167178 | 0.0320298 |
| MeHg | 0.8939690 | 0.0037336 | 239.440710 | 0.0000000 |
Here is a summary our our transformed model raising TotHg ^0.95.
Our Q-Q plots show no normality.
| SSE | R.sq | R.adj | Cp | AIC | SBC | |
|---|---|---|---|---|---|---|
| full.model | 1.6277780 | 0.9985932 | 0.9984919 | 10 | -576.4379 | -547.3852 |
| step.model | 1.6986018 | 0.9985320 | 0.9984984 | 4 | -582.6884 | -571.0673 |
| sq.rt.1.TotHg | 0.3030456 | 0.7329920 | 0.7268773 | 4 | -815.3848 | -803.7637 |
| sq.rt.9.TotHg | 1.8090228 | 0.9979452 | 0.9978981 | 4 | -574.1859 | -562.5648 |
Using the goodness of fits measures, I am choosing the step.model to be my final model. I am choosing it because it has a higher R adjusted squared, and a lower AIC. We see that sq.rt.1.TotHg has the lowest AIC value, but its R adjusted squared value is significantly lower than the other models. Its data is also the least normal as shown by Q-Q plots.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.0624310 | 0.0429722 | -1.452822 | 0.1486643 |
| age | 0.0020668 | 0.0012722 | 1.624624 | 0.1066463 |
| fishmlwk | 0.0050316 | 0.0019882 | 2.530731 | 0.0125658 |
| MeHg | 1.0250212 | 0.0036178 | 283.324806 | 0.0000000 |
The table shows a summary of the final model, step.model.
\[TotHg= -0.0624310+0.0020668*age+0.0050316*fishmlwk+1.0250212*MeHg\]
The estimated regression coefficients are based on total mercury. The main factor adding to total mercury is methyl mercury, which is expected.
We used four models to find the best model. There are three variables and an intercept in our final model, two variables are significant, one variables is not signifcant but is close, and the intercept is not significant. Our assumption of variance is violated and so is our assumption of normality.
Our histograms show that the coefficients for age are skewed to the right and the coefficients for the intercept are skewed to the left. The coefficients for fishmlwk and MeHg appear to be normally distributed.
| Estimate | Std. Error | t value | Pr(>|t|) | btc.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | -0.0624 | 0.0430 | -1.4528 | 0.1487 | [ -0.1517 , 0.0107 ] |
| age | 0.0021 | 0.0013 | 1.6246 | 0.1066 | [ -5e-04 , 0.0053 ] |
| fishmlwk | 0.0050 | 0.0020 | 2.5307 | 0.0126 | [ 0 , 0.0113 ] |
| MeHg | 1.0250 | 0.0036 | 283.3248 | 0.0000 | [ 1.0171 , 1.0306 ] |
Our bootstrap 95% confidence intervals for the coefficients are telling the same story that our p-values are. The p-values for age and intercept are both not significant, and 0 is in both of their bootstrap confidence intervals which indicates that they are not significant.
None of the histograms show normal distribution.
| Estimate | Std. Error | t value | Pr(>|t|) | btr.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | -0.0624 | 0.0430 | -1.4528 | 0.1487 | [ -0.1416 , 0.0137 ] |
| age | 0.0021 | 0.0013 | 1.6246 | 0.1066 | [ -1e-04 , 0.0045 ] |
| fishmlwk | 0.0050 | 0.0020 | 2.5307 | 0.0126 | [ 0 , 0.0089 ] |
| MeHg | 1.0250 | 0.0036 | 283.3248 | 0.0000 | [ 0 , 1.0317 ] |
| Estimate | Std. Error | Pr(>|t|) | btc.ci.95 | btr.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | -0.0624 | 0.0430 | 0.1487 | [ -0.1517 , 0.0107 ] | [ -0.1416 , 0.0137 ] |
| age | 0.0021 | 0.0013 | 0.1066 | [ -5e-04 , 0.0053 ] | [ -1e-04 , 0.0045 ] |
| fishmlwk | 0.0050 | 0.0020 | 0.0126 | [ 0 , 0.0113 ] | [ 0 , 0.0089 ] |
| MeHg | 1.0250 | 0.0036 | 0.0000 | [ 1.0171 , 1.0306 ] | [ 0 , 1.0317 ] |
Our bootstrap 95% confidence intervals for the residuals are telling the same story that our p-values are. The p-values for age and intercept are both not significant, and 0 is in both of their bootstrap confidence intervals which indicates that they are not significant.
| btc.wd | btr.wd |
|---|---|
| 0.1624429 | 0.1553043 |
| 0.0057872 | 0.0046100 |
| 0.0113275 | 0.0089176 |
| 0.0135227 | 1.0316980 |
We see that the widths of both of the bootstraps are similar.
In our analysis we created four different models. We created new variables as well. We used our full model, we did backwards step wise regression, and created two different models based on our box cox transformations. Our assumptions were violated in all four of the models. The were not normally distributed and the variance is not constant. We chose step.model to be our final model because it had the highest R adjusted square value and the lowest AIC value. The model contained an intercept and 3 variables. The intercept and variable “fishmlwk” was not significant. The only significant variables were MeHg and fishmlwk.
One way to fix this would be to gather more data. Our problem is that our data is not normal. We could possibly remove the intercept and age because they are not significant either. We found that MeHg and fishmlwk could help us predict the total mercury in Kuwaiti fishermen, but we would need more data.
In conclusion, this model should not be used to predict total mercury in Kuwaiti fishermen. The intercept and the variable age were not significant as shown by p-values. Our bootstrap confidence intervals also show that the intercept and variables age were not significant. The data violated our assumptions which led us to a poor model.