fisherman	age	restime	height	weight	fishmlwk	fishpart	MeHg	TotHg
1	45	6	175	70	14	2	4.011	4.484
1	38	13	173	73	7	1	4.026	4.789
1	24	2	168	66	7	2	3.578	3.856
1	41	2	183	80	7	1	10.988	11.435
1	43	11	175	78	21	1	10.520	10.849
1	58	2	176	75	21	1	6.169	6.457

Introduction

I am curious in finding out the factors effecting mercury levels in Kuwaiti fishermen. My research question is, “What factors effect mercury levels in Kuwaiti fishermen. Mercury is a toxic metal and is capable of causing mercury poison to human beings. One way of consuming mercury is through seafood. Mercury poisoning can cause hearing and speech difficulties, lack of coordination, nerve loss in hands and face, and more problems. Kuwaitian’s diet consists of mainly seafood and this information could be helpful to living a healthier life style by this analysis.

This is a data set consisting of 135 observations and 10 variables. Those variables include: Fisherman indicator (fisherman) Age in years (age) Residence Time in years (restime) Height in cm (height) Weight in kg (weight) Fish meals per week ((fishmlwk) Parts of fish consumed: 0=none, 1=muscle tissue only, 2=mt and sometimes whole fish, 3=whole fish (fishpart) Methyl Mercury in mg/g (MeHg) Total Mercury in mg/g (TotHg).

Data preparation and exploratory analysis

fishermen$catpart <- ifelse(fishermen$fishpart < 1, "none",
                           ifelse(fishermen$fishpart >1, "whole", "muscle")) 
#If fishpart <1, none. If fishpart >1, whole. If fishpart =1, muscle.

#I combine fishparts 2 and 3 because practically speaking, if you sometimes eating the whole fish is the same as eating just the whole fish. This tells me if you don't eat fish, you eat strictly muscle, or you eat both.

fishermen$fisherman = ifelse(fishermen$fisherman <1, "no", "yes")
#If fisherman <1, no. If fisherman => 1, yes.

final.data = fishermen[, -c(7)]

Model Building

Full Model

Statistics of Regression Coefficients
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.1808230	0.2773141	0.6520512	0.5155653
fishermanyes	0.0440958	0.0331341	1.3308276	0.1856682
age	0.0024037	0.0015636	1.5372964	0.1267482
restime	-0.0022616	0.0024385	-0.9274474	0.3554816
height	-0.0008528	0.0015377	-0.5546366	0.5801343
weight	-0.0013890	0.0017551	-0.7914070	0.4302057
fishmlwk	0.0035123	0.0024774	1.4177149	0.1587620
MeHg	1.0266092	0.0041274	248.7295718	0.0000000
catpartnone	0.0283584	0.0483997	0.5859213	0.5589849
catpartwhole	-0.0341177	0.0273794	-1.2461112	0.2150539

Here is a summary of our full model.

We can see from the residual plots that there are some minor violations: the variance of the residuals is not constant the QQ plot indicates the distribution of residuals is slightly off the normal distribution. The residual plot seems to have a weak curve pattern.

Backwards Step Wise Regression

## Single term deletions
## 
## Model:
## TotHg ~ age + fishmlwk + MeHg
##          Df Sum of Sq     RSS     AIC    F value  Pr(>F)    
## <none>                   1.70 -582.69                       
## age       1      0.03    1.73 -582.00     2.6394 0.10665    
## fishmlwk  1      0.08    1.78 -578.24     6.4046 0.01257 *  
## MeHg      1   1040.85 1042.55  281.96 80272.9456 < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Statistics of Regression Coefficients
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-0.0624310	0.0429722	-1.452822	0.1486643
age	0.0020668	0.0012722	1.624624	0.1066463
fishmlwk	0.0050316	0.0019882	2.530731	0.0125658
MeHg	1.0250212	0.0036178	283.324806	0.0000000

Here is a summary of step.model we created from using backwards step wise regresssion.

We can see from the residual plots that there are some minor violations: the variance of the residuals is not constant the QQ plot indicates the distribution of residuals is slightly off the normal distribution. The residual plot seems to have a weak curve pattern.

Box Cox Transformation

We used Box Cox transformations by trying to use the log function on our predictor variables to try to make the data normally distributed. By looking at the lambda values in the figures above, we see that our approximate lambda values can by 0.1, and 0.95. This will give us 2 more models. This is where our exact line is which is the middle line, while the other two are showing the 95% confidence intervals. When our lambda values are not 0, our box cox rules state that we raise the dependent variable to the lambda value.

Transformed models

Sq.Rt.1 model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	1.0207337	0.0181508	56.2362654	0.0000000
age	-0.0000740	0.0005374	-0.1376811	0.8907039
fishmlwk	-0.0012279	0.0008398	-1.4620875	0.1461121
MeHg	0.0282159	0.0015281	18.4644721	0.0000000

Here is a summary our our transformed model raising TotHg ^0.1.

Sq.Rt.9 model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.1559243	0.0443470	3.516007	0.0006024
age	0.0014258	0.0013129	1.085978	0.2794830
fishmlwk	0.0044467	0.0020518	2.167178	0.0320298
MeHg	0.8939690	0.0037336	239.440710	0.0000000

Here is a summary our our transformed model raising TotHg ^0.95.

Side by Side Showing No Normality

Our Q-Q plots show no normality.

Goodness of Fits Measures

Goodness-of-fit Measures of Candidate Models
	SSE	R.sq	R.adj	Cp	AIC	SBC
full.model	1.6277780	0.9985932	0.9984919	10	-576.4379	-547.3852
step.model	1.6986018	0.9985320	0.9984984	4	-582.6884	-571.0673
sq.rt.1.TotHg	0.3030456	0.7329920	0.7268773	4	-815.3848	-803.7637
sq.rt.9.TotHg	1.8090228	0.9979452	0.9978981	4	-574.1859	-562.5648

Using the goodness of fits measures, I am choosing the step.model to be my final model. I am choosing it because it has a higher R adjusted squared, and a lower AIC. We see that sq.rt.1.TotHg has the lowest AIC value, but its R adjusted squared value is significantly lower than the other models. Its data is also the least normal as shown by Q-Q plots.

The final model

Inferential Statistics of Final Model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-0.0624310	0.0429722	-1.452822	0.1486643
age	0.0020668	0.0012722	1.624624	0.1066463
fishmlwk	0.0050316	0.0019882	2.530731	0.0125658
MeHg	1.0250212	0.0036178	283.324806	0.0000000

The table shows a summary of the final model, step.model.

\[TotHg= -0.0624310+0.0020668*age+0.0050316*fishmlwk+1.0250212*MeHg\]

The estimated regression coefficients are based on total mercury. The main factor adding to total mercury is methyl mercury, which is expected.

We used four models to find the best model. There are three variables and an intercept in our final model, two variables are significant, one variables is not signifcant but is close, and the intercept is not significant. Our assumption of variance is violated and so is our assumption of normality.

Bootstrapping the Final Model

Bootstrapping the coefficients

Our histograms show that the coefficients for age are skewed to the right and the coefficients for the intercept are skewed to the left. The coefficients for fishmlwk and MeHg appear to be normally distributed.

Regression Coefficient Matrix
	Estimate	Std. Error	t value	Pr(>\|t\|)	btc.ci.95
(Intercept)	-0.0624	0.0430	-1.4528	0.1487	[ -0.1517 , 0.0107 ]
age	0.0021	0.0013	1.6246	0.1066	[ -5e-04 , 0.0053 ]
fishmlwk	0.0050	0.0020	2.5307	0.0126	[ 0 , 0.0113 ]
MeHg	1.0250	0.0036	283.3248	0.0000	[ 1.0171 , 1.0306 ]

Our bootstrap 95% confidence intervals for the coefficients are telling the same story that our p-values are. The p-values for age and intercept are both not significant, and 0 is in both of their bootstrap confidence intervals which indicates that they are not significant.

Bootstrap Residuals

None of the histograms show normal distribution.

Regression Coefficient Matrix with 95% Residual Bootstrap CI
	Estimate	Std. Error	t value	Pr(>\|t\|)	btr.ci.95
(Intercept)	-0.0624	0.0430	-1.4528	0.1487	[ -0.1416 , 0.0137 ]
age	0.0021	0.0013	1.6246	0.1066	[ -1e-04 , 0.0045 ]
fishmlwk	0.0050	0.0020	2.5307	0.0126	[ 0 , 0.0089 ]
MeHg	1.0250	0.0036	283.3248	0.0000	[ 0 , 1.0317 ]

Final Combined Inferential Statistics: p-values and Bootstrap CIs
	Estimate	Std. Error	Pr(>\|t\|)	btc.ci.95	btr.ci.95
(Intercept)	-0.0624	0.0430	0.1487	[ -0.1517 , 0.0107 ]	[ -0.1416 , 0.0137 ]
age	0.0021	0.0013	0.1066	[ -5e-04 , 0.0053 ]	[ -1e-04 , 0.0045 ]
fishmlwk	0.0050	0.0020	0.0126	[ 0 , 0.0113 ]	[ 0 , 0.0089 ]
MeHg	1.0250	0.0036	0.0000	[ 1.0171 , 1.0306 ]	[ 0 , 1.0317 ]

Our bootstrap 95% confidence intervals for the residuals are telling the same story that our p-values are. The p-values for age and intercept are both not significant, and 0 is in both of their bootstrap confidence intervals which indicates that they are not significant.

width of the two bootstrap confidence intervals
btc.wd	btr.wd
0.1624429	0.1553043
0.0057872	0.0046100
0.0113275	0.0089176
0.0135227	1.0316980

We see that the widths of both of the bootstraps are similar.

Summary and Discussion

In our analysis we created four different models. We created new variables as well. We used our full model, we did backwards step wise regression, and created two different models based on our box cox transformations. Our assumptions were violated in all four of the models. The were not normally distributed and the variance is not constant. We chose step.model to be our final model because it had the highest R adjusted square value and the lowest AIC value. The model contained an intercept and 3 variables. The intercept and variable “fishmlwk” was not significant. The only significant variables were MeHg and fishmlwk.

One way to fix this would be to gather more data. Our problem is that our data is not normal. We could possibly remove the intercept and age because they are not significant either. We found that MeHg and fishmlwk could help us predict the total mercury in Kuwaiti fishermen, but we would need more data.

In conclusion, this model should not be used to predict total mercury in Kuwaiti fishermen. The intercept and the variable age were not significant as shown by p-values. Our bootstrap confidence intervals also show that the intercept and variables age were not significant. The data violated our assumptions which led us to a poor model.

Project 1 no code

Connor McKee

2023-02-27