library(knitr)
library(MASS)
library(nleqslv)
library("scales") 

fishermen = read.csv("https://raw.githubusercontent.com/connormckee1323/STA-321/main/fishermen_mercury.csv", header = TRUE) #reading in csv file and naming it
kable(head(fishermen))  #clean table of the first 5 observations of the data

fisherman	age	restime	height	weight	fishmlwk	fishpart	MeHg	TotHg
1	45	6	175	70	14	2	4.011	4.484
1	38	13	173	73	7	1	4.026	4.789
1	24	2	168	66	7	2	3.578	3.856
1	41	2	183	80	7	1	10.988	11.435
1	43	11	175	78	21	1	10.520	10.849
1	58	2	176	75	21	1	6.169	6.457

Introduction

I am curious in finding out the factors effecting mercury levels in Kuwaiti fishermen. My research question is, “What factors effect mercury levels in Kuwaiti fishermen. Mercury is a toxic metal and is capable of causing mercury poison to human beings. One way of consuming mercury is through seafood. Mercury poisoning can cause hearing and speech difficulties, lack of coordination, nerve loss in hands and face, and more problems. Kuwaitian’s diet consists of mainly seafood and this information could be helpful to living a healthier life style by this analysis.

This is a data set consisting of 135 observations and 10 variables. Those variables include: Fisherman indicator (fisherman) Age in years (age) Residence Time in years (restime) Height in cm (height) Weight in kg (weight) Fish meals per week ((fishmlwk) Parts of fish consumed: 0=none, 1=muscle tissue only, 2=mt and sometimes whole fish, 3=whole fish (fishpart) Methyl Mercury in mg/g (MeHg) Total Mercury in mg/g (TotHg).

Data preparation and exploratory analysis

fishermen$catpart <- ifelse(fishermen$fishpart < 1, "none",
                           ifelse(fishermen$fishpart >1, "whole", "muscle")) 
#If fishpart <1, none. If fishpart >1, whole. If fishpart =1, muscle.

#I combine fishparts 2 and 3 because practically speaking, if you sometimes eating the whole fish is the same as eating just the whole fish. This tells me if you don't eat fish, you eat strictly muscle, or you eat both.

fishermen$fisherman = ifelse(fishermen$fisherman <1, "no", "yes")
#If fisherman <1, no. If fisherman => 1, yes.

final.data = fishermen[, -c(7)]

Model Building

Full Model

full.model = lm(TotHg ~ ., data = final.data) # full model using 9 variables, look at p-values
kable(summary(full.model)$coef, caption ="Statistics of Regression Coefficients")

Statistics of Regression Coefficients
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.1808230	0.2773141	0.6520512	0.5155653
fishermanyes	0.0440958	0.0331341	1.3308276	0.1856682
age	0.0024037	0.0015636	1.5372964	0.1267482
restime	-0.0022616	0.0024385	-0.9274474	0.3554816
height	-0.0008528	0.0015377	-0.5546366	0.5801343
weight	-0.0013890	0.0017551	-0.7914070	0.4302057
fishmlwk	0.0035123	0.0024774	1.4177149	0.1587620
MeHg	1.0266092	0.0041274	248.7295718	0.0000000
catpartnone	0.0283584	0.0483997	0.5859213	0.5589849
catpartwhole	-0.0341177	0.0273794	-1.2461112	0.2150539

par(mfrow=c(2,2)) #create a 2x2 table of graphs to check for violations
plot(full.model)

We can see from the residual plots that there are some minor violations: the variance of the residuals is not constant the QQ plot indicates the distribution of residuals is slightly off the normal distribution. The residual plot seems to have a weak curve pattern.

Backwards Step Wise Regression

drop1(update(full.model, ~ . -height -restime -weight -catpart -fisherman), test = "F")

## Single term deletions
## 
## Model:
## TotHg ~ age + fishmlwk + MeHg
##          Df Sum of Sq     RSS     AIC    F value  Pr(>F)    
## <none>                   1.70 -582.69                       
## age       1      0.03    1.73 -582.00     2.6394 0.10665    
## fishmlwk  1      0.08    1.78 -578.24     6.4046 0.01257 *  
## MeHg      1   1040.85 1042.55  281.96 80272.9456 < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Using backwards step wise regression. The variables listed above were removed due to p > 0.05, but I kept age(p=0.10665)

step.model.data = final.data[, -c(1,3,4,5,9)] #data frame for my step wise model.

step.model = lm(TotHg ~ age + fishmlwk + MeHg, data = final.data) #Creating model for step wise
kable(summary(step.model)$coef, caption ="Statistics of Regression Coefficients")

Statistics of Regression Coefficients
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-0.0624310	0.0429722	-1.452822	0.1486643
age	0.0020668	0.0012722	1.624624	0.1066463
fishmlwk	0.0050316	0.0019882	2.530731	0.0125658
MeHg	1.0250212	0.0036178	283.324806	0.0000000

par(mfrow = c(2,2)) #create a 2x2 table of graphs to check for violations
plot(step.model)

Box Cox Transformation

par(pty = "s", mfrow = c(2, 2), oma=c(.1,.1,.1,.1), mar=c(4, 0, 2, 0))
##
boxcox(TotHg ~ log(age) + fishmlwk + MeHg, data = final.data, 
       lambda = seq(0, 3, length = 10), # Vector of values of lambda
       xlab=expression(paste(lambda, ": log age"))) # X-axis title
##
boxcox(TotHg ~ age + log(1+fishmlwk) + MeHg, data = final.data, lambda = seq(0, 3, length = 10), 
       xlab=expression(paste(lambda, ": log fishmlwk ")))
##
boxcox(TotHg ~ age + fishmlwk + log(MeHg), data = final.data, lambda = seq(-3, 2, length = 10), 
       xlab=expression(paste(lambda, ": log MeHg")))
##
boxcox(TotHg ~ age + log(1+fishmlwk) + log(MeHg), data = final.data, lambda = seq(-3, 2, length = 10), 
       xlab=expression(paste(lambda, ": log MeHg, log fishmlwk")))

We used Box Cox transformations by trying to use the log function on our predictor variables to try to make the data normally distributed. By looking at the lambda values in the figures above, we see that our approximate lambda values can by 0.1, and 0.95. This will give us 2 more models. This is where our exact line is which is the middle line, while the other two are showing the 95% confidence intervals. When our lambda values are not 0, our box cox rules state that we raise the dependent variable to the lambda value.

Transformed models

sq.rt.1.TotHg = lm((TotHg)^0.1 ~ age + fishmlwk + MeHg, data = final.data) #transforming model ^0.1
kable(summary(sq.rt.1.TotHg)$coef, caption = "Sq.Rt.1 model")

Sq.Rt.1 model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	1.0207337	0.0181508	56.2362654	0.0000000
age	-0.0000740	0.0005374	-0.1376811	0.8907039
fishmlwk	-0.0012279	0.0008398	-1.4620875	0.1461121
MeHg	0.0282159	0.0015281	18.4644721	0.0000000

sq.rt.9.TotHg = lm((TotHg)^0.95 ~ age + fishmlwk + MeHg, data = final.data) #transforming model ^0.9
kable(summary(sq.rt.9.TotHg)$coef, caption = "Sq.Rt.9 model")

Sq.Rt.9 model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	0.1559243	0.0443470	3.516007	0.0006024
age	0.0014258	0.0013129	1.085978	0.2794830
fishmlwk	0.0044467	0.0020518	2.167178	0.0320298
MeHg	0.8939690	0.0037336	239.440710	0.0000000

Side by Side Showing No Normality

#define plotting area
par(pty = "s", mfrow = c(2, 2))
#Q-Q plot for original model and creating Q-Q line
qqnorm(full.model$residuals, main = "Full-Model")
qqline(full.model$residuals)
#Q-Q plot for step model and creating Q-Q line
qqnorm(step.model$residuals, main = "Step Model")
qqline(step.model$residuals)
#Q-Q plot for sq.rt.1.TotHg model and creating Q-Q line
qqnorm(sq.rt.1.TotHg$residuals, main = "sq.rt.1.TotHg")
qqline(sq.rt.1.TotHg$residuals)
#Q-Q plot for sq.rt.9.TotHg model and creating Q-Q line
qqnorm(sq.rt.9.TotHg$residuals, main = "sq.rt.9.TotHg")
qqline(sq.rt.9.TotHg$residuals)

Goodness of Fits Measures

select=function(m){                    # creating an object called m: model
 e = m$resid                           # residuals inside the model
 n0 = length(e)                        # sample size of the residuals
 SSE=(m$df)*(summary(m)$sigma)^2       # sum of squared error formula
 R.sq=summary(m)$r.squared             # R square formula
 R.adj=summary(m)$adj.r                # Adjusted R square formula
 MSE=(summary(m)$sigma)^2              # square error formula
 Cp=(SSE/MSE)-(n0-2*(n0-m$df))         # Mellow's p formula
 AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)          # Akaike information criterion formula
 SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)  # Schwarz Bayesian Information criterion formula
 
 # creating object table which is a data frame using data frame with object=numerical value.
 tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC))
 names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC")
 tbl
}

# Output of the summary of our goodness of fit measures values with each model as row names.
output.sum = rbind(select(full.model), select(step.model), select(sq.rt.1.TotHg), select(sq.rt.9.TotHg))
row.names(output.sum) = c("full.model", "step.model","sq.rt.1.TotHg", "sq.rt.9.TotHg")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")

Goodness-of-fit Measures of Candidate Models
	SSE	R.sq	R.adj	Cp	AIC	SBC
full.model	1.6277780	0.9985932	0.9984919	10	-576.4379	-547.3852
step.model	1.6986018	0.9985320	0.9984984	4	-582.6884	-571.0673
sq.rt.1.TotHg	0.3030456	0.7329920	0.7268773	4	-815.3848	-803.7637
sq.rt.9.TotHg	1.8090228	0.9979452	0.9978981	4	-574.1859	-562.5648

Using the goodness of fits measures, I am choosing the step.model to be my final model. I am choosing it because it has a higher R adjusted squared, and a lower AIC. We see that sq.rt.1.TotHg has the lowest AIC value, but its R adjusted squared value is significantly lower than the other models. Its data is also the least normal as shown by Q-Q plots.

The final model

kable(summary(step.model)$coef, caption = "Inferential Statistics of Final Model")

Inferential Statistics of Final Model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-0.0624310	0.0429722	-1.452822	0.1486643
age	0.0020668	0.0012722	1.624624	0.1066463
fishmlwk	0.0050316	0.0019882	2.530731	0.0125658
MeHg	1.0250212	0.0036178	283.324806	0.0000000

\[TotHg= -0.0624310+0.0020668*age+0.0050316*fishmlwk+1.0250212*MeHg\]

The estimated regression coefficients are based on total mercury. The main factor adding to total mercury is methyl mercury, which is expected.

We used two models to find the best model. There are three variables total in our final model, two are significant and one is nearly significant and practically speaking it makes sense to leave it in our model. Our assumption of variance is violated and so is our assumption of normality.

Bootstrapping the Final Model

Bootstrapping the coefficients

cmtrx <- summary(step.model)$coef

##
B = 1000       # number of bootstrap replicates
## 
num.p = dim(model.frame(step.model))[2]  # number of parameters in the model
smpl.n = dim(model.frame(step.model))[1] # sample size
## zero matrix to store bootstrap coefficients 
coef.mtrx = matrix(rep(0, B*num.p), ncol = num.p)        #creating the coefficiant matrix
## 
for (i in 1:B){
  bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE) # fit final model to the bootstrap sample
  step.model.btc = lm(TotHg ~ age + fishmlwk + MeHg, data=final.data[bootc.id,])     
  coef.mtrx[i,] = coef(step.model.btc)    # this will extract coefficients from bootstrap regression model
}

boot.hist = function(cmtrx, bt.coef.mtrx, var.id, var.nm){
  ## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
  ## var.id = variable ID (1, 2, ..., k+1)
  ## var.nm = variable name on the hist title, must be the string in the double quotes
  ## coefficient matrix of the final model
  ## Bootstrap sampling distribution of the estimated coefficients
  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 ) #creating object with mix and max bootstrap coeffificent matrixes
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id])) #creating object with mean and standard deviation of bootstrap coefficient matrix
  # height of the histogram - use it to make a nice-looking histogram.
  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3") #objects we created earlier
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") 
  #legend("topright", c(""))
}

par(mfrow=c(2,2))  # histograms of bootstrap coefs
#creating histogram with each variable
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=1, var.nm ="Intercept" ) 
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=2, var.nm ="Age" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=3, var.nm ="fishmlwk" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=4, var.nm ="MeHg" )

Our histograms show that the coefficients for age are skewed to the right and the coefficients for the intercept are skewed to the left. The coefficients for fishmlwk and MeHg appear to be normally distributed.

num.p = dim(coef.mtrx)[2]  # number of parameters
btc.ci = NULL
btc.wd = NULL
for (i in 1:num.p){
  lci.025 = round(quantile(coef.mtrx[, i], 0.025, type = 2),8)
  uci.975 = round(quantile(coef.mtrx[, i],0.975, type = 2 ),8)
  btc.wd[i] =  uci.975 - lci.025
  btc.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
 }
#as.data.frame(btc.ci)
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btc.ci.95=btc.ci)), 
      caption = "Regression Coefficient Matrix")

Regression Coefficient Matrix
	Estimate	Std. Error	t value	Pr(>\|t\|)	btc.ci.95
(Intercept)	-0.0624	0.0430	-1.4528	0.1487	[ -0.151 , 0.016 ]
age	0.0021	0.0013	1.6246	0.1066	[ -5e-04 , 0.0052 ]
fishmlwk	0.0050	0.0020	2.5307	0.0126	[ 0 , 0.0114 ]
MeHg	1.0250	0.0036	283.3248	0.0000	[ 1.0171 , 1.0302 ]

Our bootstrap 95% confidence intervals for the coefficients are telling the same story that our p-values are. The p-values for age and intercept are both not significant, and 0 is in both of their bootstrap confidence intervals which indicates that they are not significant.

Bootstrap Residuals

## Final model
step.model.mercury <- lm(lm(TotHg ~ age + fishmlwk + MeHg, data = fishermen), data = final.data)
model.resid = step.model.mercury$residuals
##
B=1000
num.p = dim(model.matrix(step.model.mercury))[2]   # creates the number of parameters
samp.n = dim(model.matrix(step.model.mercury))[1]  # creates the sample size
btr.mtrx = matrix(rep(0,6*B), ncol=num.p) # creates the zero matrix to store boot coefs
for (i in 1:B){
  ## Bootstrap response values
  bt.sm.mercury = step.model.mercury$fitted.values + 
        sample(step.model.mercury$residuals, samp.n, replace = TRUE)  # bootstrap residuals #repleace so same probability
  
  final.data$bt.sm.mercury =  bt.sm.mercury   #  send the boot response to the data
  btr.model = lm(bt.sm.mercury ~ age + fishmlwk + MeHg, data = final.data)   
  btr.mtrx[i,]=btr.model$coefficients
}

boot.hist = function(bt.coef.mtrx, var.id, var.nm){
  ## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
  ## var.id = variable ID (1, 2, ..., k+1)
  ## var.nm = variable name on the hist title, must be the string in the double quotes
  ## Bootstrap sampling distribution of the estimated coefficients
  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
  # height of the histogram - use it to make a nice-looking histogram.
  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")       # normal density curve         
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue")    # loess curve
}

par(mfrow=c(2,2))  # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=2, var.nm ="age" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=3, var.nm ="fishmlwk" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=4, var.nm ="MeHg" )

None of the histograms show normal distribution.

num.p = dim(coef.mtrx)[2]  # number of parameters
btr.ci = NULL
btr.wd = NULL
for (i in 1:num.p){
  lci.025 = round(quantile(btr.mtrx[, i], 0.025, type = 2),8)
  uci.975 = round(quantile(btr.mtrx[, i],0.975, type = 2 ),8)
  btr.wd[i] = uci.975 - lci.025
  btr.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
#as.data.frame(btc.ci)
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btr.ci.95=btr.ci)), 
      caption = "Regression Coefficient Matrix with 95% Residual Bootstrap CI")

Regression Coefficient Matrix with 95% Residual Bootstrap CI
	Estimate	Std. Error	t value	Pr(>\|t\|)	btr.ci.95
(Intercept)	-0.0624	0.0430	-1.4528	0.1487	[ -0.1331 , 0.0118 ]
age	0.0021	0.0013	1.6246	0.1066	[ -1e-04 , 0.0042 ]
fishmlwk	0.0050	0.0020	2.5307	0.0126	[ 0 , 0.009 ]
MeHg	1.0250	0.0036	283.3248	0.0000	[ 0 , 1.032 ]

kable(as.data.frame(cbind(formatC(cmtrx[,-3],4,format="f"), btc.ci.95=btc.ci,btr.ci.95=btr.ci)), 
      caption="Final Combined Inferential Statistics: p-values and Bootstrap CIs")

Final Combined Inferential Statistics: p-values and Bootstrap CIs
	Estimate	Std. Error	Pr(>\|t\|)	btc.ci.95	btr.ci.95
(Intercept)	-0.0624	0.0430	0.1487	[ -0.151 , 0.016 ]	[ -0.1331 , 0.0118 ]
age	0.0021	0.0013	0.1066	[ -5e-04 , 0.0052 ]	[ -1e-04 , 0.0042 ]
fishmlwk	0.0050	0.0020	0.0126	[ 0 , 0.0114 ]	[ 0 , 0.009 ]
MeHg	1.0250	0.0036	0.0000	[ 1.0171 , 1.0302 ]	[ 0 , 1.032 ]

Our bootstrap 95% confidence intervals for the residuals are telling the same story that our p-values are. The p-values for age and intercept are both not significant, and 0 is in both of their bootstrap confidence intervals which indicates that they are not significant.

kable(cbind(btc.wd, btr.wd), caption="width of the two bootstrap confidence intervals")

width of the two bootstrap confidence intervals
btc.wd	btr.wd
0.1669553	0.1449167
0.0057675	0.0042862
0.0113337	0.0089844
0.0131720	1.0319823

We see that the widths of both of the bootstraps are similar.

Summary and Discussion

In our analysis we created four different models. We created new variables as well. We used our full model, we did backwards step wise regression, and created two different models based on our box cox transformations. Our assumptions were violated in all four of the models. The were not normally distributed and the variance is not constant. We chose step.model to be our final model because it had the highest R adjusted square value and the lowest AIC value. The model contained an intercept and 3 variables. The intercept and variable “fishmlwk” was not significant. The only significant variables were MeHg and fishmlwk.

In conclusion, this model should not be used to predict total mercury in Kuwaiti fishermen. The data violated our assumptions which led us to a poor model.

One way to fix this would be to gather more data. Our problem is that our data is not normal. We could possibly not use the intercept and age because they are not significant either. We found that MeHg and fishmlwk could help us predict the total mercury in Kuwaiti fishermen, but we would need more data.

Project 1 with code

Connor McKee

2023-02-24