library(knitr)
library(MASS)
library(nleqslv)
library("scales")
fishermen = read.csv("https://raw.githubusercontent.com/connormckee1323/STA-321/main/fishermen_mercury.csv", header = TRUE) #reading in csv file and naming it
kable(head(fishermen)) #clean table of the first 5 observations of the data
| fisherman | age | restime | height | weight | fishmlwk | fishpart | MeHg | TotHg |
|---|---|---|---|---|---|---|---|---|
| 1 | 45 | 6 | 175 | 70 | 14 | 2 | 4.011 | 4.484 |
| 1 | 38 | 13 | 173 | 73 | 7 | 1 | 4.026 | 4.789 |
| 1 | 24 | 2 | 168 | 66 | 7 | 2 | 3.578 | 3.856 |
| 1 | 41 | 2 | 183 | 80 | 7 | 1 | 10.988 | 11.435 |
| 1 | 43 | 11 | 175 | 78 | 21 | 1 | 10.520 | 10.849 |
| 1 | 58 | 2 | 176 | 75 | 21 | 1 | 6.169 | 6.457 |
I am curious in finding out the factors effecting mercury levels in Kuwaiti fishermen. My research question is, “What factors effect mercury levels in Kuwaiti fishermen. Mercury is a toxic metal and is capable of causing mercury poison to human beings. One way of consuming mercury is through seafood. Mercury poisoning can cause hearing and speech difficulties, lack of coordination, nerve loss in hands and face, and more problems. Kuwaitian’s diet consists of mainly seafood and this information could be helpful to living a healthier life style by this analysis.
This is a data set consisting of 135 observations and 10 variables. Those variables include: Fisherman indicator (fisherman) Age in years (age) Residence Time in years (restime) Height in cm (height) Weight in kg (weight) Fish meals per week ((fishmlwk) Parts of fish consumed: 0=none, 1=muscle tissue only, 2=mt and sometimes whole fish, 3=whole fish (fishpart) Methyl Mercury in mg/g (MeHg) Total Mercury in mg/g (TotHg).
fishermen$catpart <- ifelse(fishermen$fishpart < 1, "none",
ifelse(fishermen$fishpart >1, "whole", "muscle"))
#If fishpart <1, none. If fishpart >1, whole. If fishpart =1, muscle.
#I combine fishparts 2 and 3 because practically speaking, if you sometimes eating the whole fish is the same as eating just the whole fish. This tells me if you don't eat fish, you eat strictly muscle, or you eat both.
fishermen$fisherman = ifelse(fishermen$fisherman <1, "no", "yes")
#If fisherman <1, no. If fisherman => 1, yes.
final.data = fishermen[, -c(7)]
full.model = lm(TotHg ~ ., data = final.data) # full model using 9 variables, look at p-values
kable(summary(full.model)$coef, caption ="Statistics of Regression Coefficients")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.1808230 | 0.2773141 | 0.6520512 | 0.5155653 |
| fishermanyes | 0.0440958 | 0.0331341 | 1.3308276 | 0.1856682 |
| age | 0.0024037 | 0.0015636 | 1.5372964 | 0.1267482 |
| restime | -0.0022616 | 0.0024385 | -0.9274474 | 0.3554816 |
| height | -0.0008528 | 0.0015377 | -0.5546366 | 0.5801343 |
| weight | -0.0013890 | 0.0017551 | -0.7914070 | 0.4302057 |
| fishmlwk | 0.0035123 | 0.0024774 | 1.4177149 | 0.1587620 |
| MeHg | 1.0266092 | 0.0041274 | 248.7295718 | 0.0000000 |
| catpartnone | 0.0283584 | 0.0483997 | 0.5859213 | 0.5589849 |
| catpartwhole | -0.0341177 | 0.0273794 | -1.2461112 | 0.2150539 |
par(mfrow=c(2,2)) #create a 2x2 table of graphs to check for violations
plot(full.model)
We can see from the residual plots that there are some minor violations: the variance of the residuals is not constant the QQ plot indicates the distribution of residuals is slightly off the normal distribution. The residual plot seems to have a weak curve pattern.
drop1(update(full.model, ~ . -height -restime -weight -catpart -fisherman), test = "F")
## Single term deletions
##
## Model:
## TotHg ~ age + fishmlwk + MeHg
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 1.70 -582.69
## age 1 0.03 1.73 -582.00 2.6394 0.10665
## fishmlwk 1 0.08 1.78 -578.24 6.4046 0.01257 *
## MeHg 1 1040.85 1042.55 281.96 80272.9456 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Using backwards step wise regression. The variables listed above were removed due to p > 0.05, but I kept age(p=0.10665)
step.model.data = final.data[, -c(1,3,4,5,9)] #data frame for my step wise model.
step.model = lm(TotHg ~ age + fishmlwk + MeHg, data = final.data) #Creating model for step wise
kable(summary(step.model)$coef, caption ="Statistics of Regression Coefficients")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.0624310 | 0.0429722 | -1.452822 | 0.1486643 |
| age | 0.0020668 | 0.0012722 | 1.624624 | 0.1066463 |
| fishmlwk | 0.0050316 | 0.0019882 | 2.530731 | 0.0125658 |
| MeHg | 1.0250212 | 0.0036178 | 283.324806 | 0.0000000 |
par(mfrow = c(2,2)) #create a 2x2 table of graphs to check for violations
plot(step.model)
We can see from the residual plots that there are some minor violations: the variance of the residuals is not constant the QQ plot indicates the distribution of residuals is slightly off the normal distribution. The residual plot seems to have a weak curve pattern.
par(pty = "s", mfrow = c(2, 2), oma=c(.1,.1,.1,.1), mar=c(4, 0, 2, 0))
##
boxcox(TotHg ~ log(age) + fishmlwk + MeHg, data = final.data,
lambda = seq(0, 3, length = 10), # Vector of values of lambda
xlab=expression(paste(lambda, ": log age"))) # X-axis title
##
boxcox(TotHg ~ age + log(1+fishmlwk) + MeHg, data = final.data, lambda = seq(0, 3, length = 10),
xlab=expression(paste(lambda, ": log fishmlwk ")))
##
boxcox(TotHg ~ age + fishmlwk + log(MeHg), data = final.data, lambda = seq(-3, 2, length = 10),
xlab=expression(paste(lambda, ": log MeHg")))
##
boxcox(TotHg ~ age + log(1+fishmlwk) + log(MeHg), data = final.data, lambda = seq(-3, 2, length = 10),
xlab=expression(paste(lambda, ": log MeHg, log fishmlwk")))
We used Box Cox transformations by trying to use the log function on our predictor variables to try to make the data normally distributed. By looking at the lambda values in the figures above, we see that our approximate lambda values can by 0.1, and 0.95. This will give us 2 more models. This is where our exact line is which is the middle line, while the other two are showing the 95% confidence intervals. When our lambda values are not 0, our box cox rules state that we raise the dependent variable to the lambda value.
sq.rt.1.TotHg = lm((TotHg)^0.1 ~ age + fishmlwk + MeHg, data = final.data) #transforming model ^0.1
kable(summary(sq.rt.1.TotHg)$coef, caption = "Sq.Rt.1 model")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 1.0207337 | 0.0181508 | 56.2362654 | 0.0000000 |
| age | -0.0000740 | 0.0005374 | -0.1376811 | 0.8907039 |
| fishmlwk | -0.0012279 | 0.0008398 | -1.4620875 | 0.1461121 |
| MeHg | 0.0282159 | 0.0015281 | 18.4644721 | 0.0000000 |
sq.rt.9.TotHg = lm((TotHg)^0.95 ~ age + fishmlwk + MeHg, data = final.data) #transforming model ^0.9
kable(summary(sq.rt.9.TotHg)$coef, caption = "Sq.Rt.9 model")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.1559243 | 0.0443470 | 3.516007 | 0.0006024 |
| age | 0.0014258 | 0.0013129 | 1.085978 | 0.2794830 |
| fishmlwk | 0.0044467 | 0.0020518 | 2.167178 | 0.0320298 |
| MeHg | 0.8939690 | 0.0037336 | 239.440710 | 0.0000000 |
#define plotting area
par(pty = "s", mfrow = c(2, 2))
#Q-Q plot for original model and creating Q-Q line
qqnorm(full.model$residuals, main = "Full-Model")
qqline(full.model$residuals)
#Q-Q plot for step model and creating Q-Q line
qqnorm(step.model$residuals, main = "Step Model")
qqline(step.model$residuals)
#Q-Q plot for sq.rt.1.TotHg model and creating Q-Q line
qqnorm(sq.rt.1.TotHg$residuals, main = "sq.rt.1.TotHg")
qqline(sq.rt.1.TotHg$residuals)
#Q-Q plot for sq.rt.9.TotHg model and creating Q-Q line
qqnorm(sq.rt.9.TotHg$residuals, main = "sq.rt.9.TotHg")
qqline(sq.rt.9.TotHg$residuals)
select=function(m){ # creating an object called m: model
e = m$resid # residuals inside the model
n0 = length(e) # sample size of the residuals
SSE=(m$df)*(summary(m)$sigma)^2 # sum of squared error formula
R.sq=summary(m)$r.squared # R square formula
R.adj=summary(m)$adj.r # Adjusted R square formula
MSE=(summary(m)$sigma)^2 # square error formula
Cp=(SSE/MSE)-(n0-2*(n0-m$df)) # Mellow's p formula
AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df) # Akaike information criterion formula
SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df) # Schwarz Bayesian Information criterion formula
# creating object table which is a data frame using data frame with object=numerical value.
tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC))
names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC")
tbl
}
# Output of the summary of our goodness of fit measures values with each model as row names.
output.sum = rbind(select(full.model), select(step.model), select(sq.rt.1.TotHg), select(sq.rt.9.TotHg))
row.names(output.sum) = c("full.model", "step.model","sq.rt.1.TotHg", "sq.rt.9.TotHg")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")
| SSE | R.sq | R.adj | Cp | AIC | SBC | |
|---|---|---|---|---|---|---|
| full.model | 1.6277780 | 0.9985932 | 0.9984919 | 10 | -576.4379 | -547.3852 |
| step.model | 1.6986018 | 0.9985320 | 0.9984984 | 4 | -582.6884 | -571.0673 |
| sq.rt.1.TotHg | 0.3030456 | 0.7329920 | 0.7268773 | 4 | -815.3848 | -803.7637 |
| sq.rt.9.TotHg | 1.8090228 | 0.9979452 | 0.9978981 | 4 | -574.1859 | -562.5648 |
Using the goodness of fits measures, I am choosing the step.model to be my final model. I am choosing it because it has a higher R adjusted squared, and a lower AIC. We see that sq.rt.1.TotHg has the lowest AIC value, but its R adjusted squared value is significantly lower than the other models. Its data is also the least normal as shown by Q-Q plots.
kable(summary(step.model)$coef, caption = "Inferential Statistics of Final Model")
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.0624310 | 0.0429722 | -1.452822 | 0.1486643 |
| age | 0.0020668 | 0.0012722 | 1.624624 | 0.1066463 |
| fishmlwk | 0.0050316 | 0.0019882 | 2.530731 | 0.0125658 |
| MeHg | 1.0250212 | 0.0036178 | 283.324806 | 0.0000000 |
\[TotHg= -0.0624310+0.0020668*age+0.0050316*fishmlwk+1.0250212*MeHg\]
The estimated regression coefficients are based on total mercury. The main factor adding to total mercury is methyl mercury, which is expected.
We used two models to find the best model. There are three variables total in our final model, two are significant and one is nearly significant and practically speaking it makes sense to leave it in our model. Our assumption of variance is violated and so is our assumption of normality.
cmtrx <- summary(step.model)$coef
##
B = 1000 # number of bootstrap replicates
##
num.p = dim(model.frame(step.model))[2] # number of parameters in the model
smpl.n = dim(model.frame(step.model))[1] # sample size
## zero matrix to store bootstrap coefficients
coef.mtrx = matrix(rep(0, B*num.p), ncol = num.p) #creating the coefficiant matrix
##
for (i in 1:B){
bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE) # fit final model to the bootstrap sample
step.model.btc = lm(TotHg ~ age + fishmlwk + MeHg, data=final.data[bootc.id,])
coef.mtrx[i,] = coef(step.model.btc) # this will extract coefficients from bootstrap regression model
}
boot.hist = function(cmtrx, bt.coef.mtrx, var.id, var.nm){
## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
## var.id = variable ID (1, 2, ..., k+1)
## var.nm = variable name on the hist title, must be the string in the double quotes
## coefficient matrix of the final model
## Bootstrap sampling distribution of the estimated coefficients
x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 ) #creating object with mix and max bootstrap coeffificent matrixes
y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id])) #creating object with mean and standard deviation of bootstrap coefficient matrix
# height of the histogram - use it to make a nice-looking histogram.
highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density)
ylimit <- max(c(y1.1,highestbar))
hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="",
col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
lines(x = x1.1, y = y1.1, col = "red3") #objects we created earlier
lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue")
#legend("topright", c(""))
}
par(mfrow=c(2,2)) # histograms of bootstrap coefs
#creating histogram with each variable
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=2, var.nm ="Age" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=3, var.nm ="fishmlwk" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=4, var.nm ="MeHg" )
Our histograms show that the coefficients for age are skewed to the right and the coefficients for the intercept are skewed to the left. The coefficients for fishmlwk and MeHg appear to be normally distributed.
num.p = dim(coef.mtrx)[2] # number of parameters
btc.ci = NULL
btc.wd = NULL
for (i in 1:num.p){
lci.025 = round(quantile(coef.mtrx[, i], 0.025, type = 2),8)
uci.975 = round(quantile(coef.mtrx[, i],0.975, type = 2 ),8)
btc.wd[i] = uci.975 - lci.025
btc.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
#as.data.frame(btc.ci)
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btc.ci.95=btc.ci)),
caption = "Regression Coefficient Matrix")
| Estimate | Std. Error | t value | Pr(>|t|) | btc.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | -0.0624 | 0.0430 | -1.4528 | 0.1487 | [ -0.151 , 0.016 ] |
| age | 0.0021 | 0.0013 | 1.6246 | 0.1066 | [ -5e-04 , 0.0052 ] |
| fishmlwk | 0.0050 | 0.0020 | 2.5307 | 0.0126 | [ 0 , 0.0114 ] |
| MeHg | 1.0250 | 0.0036 | 283.3248 | 0.0000 | [ 1.0171 , 1.0302 ] |
Our bootstrap 95% confidence intervals for the coefficients are telling the same story that our p-values are. The p-values for age and intercept are both not significant, and 0 is in both of their bootstrap confidence intervals which indicates that they are not significant.
## Final model
step.model.mercury <- lm(lm(TotHg ~ age + fishmlwk + MeHg, data = fishermen), data = final.data)
model.resid = step.model.mercury$residuals
##
B=1000
num.p = dim(model.matrix(step.model.mercury))[2] # creates the number of parameters
samp.n = dim(model.matrix(step.model.mercury))[1] # creates the sample size
btr.mtrx = matrix(rep(0,6*B), ncol=num.p) # creates the zero matrix to store boot coefs
for (i in 1:B){
## Bootstrap response values
bt.sm.mercury = step.model.mercury$fitted.values +
sample(step.model.mercury$residuals, samp.n, replace = TRUE) # bootstrap residuals #repleace so same probability
final.data$bt.sm.mercury = bt.sm.mercury # send the boot response to the data
btr.model = lm(bt.sm.mercury ~ age + fishmlwk + MeHg, data = final.data)
btr.mtrx[i,]=btr.model$coefficients
}
boot.hist = function(bt.coef.mtrx, var.id, var.nm){
## bt.coef.mtrx = matrix for storing bootstrap estimates of coefficients
## var.id = variable ID (1, 2, ..., k+1)
## var.nm = variable name on the hist title, must be the string in the double quotes
## Bootstrap sampling distribution of the estimated coefficients
x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
# height of the histogram - use it to make a nice-looking histogram.
highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density)
ylimit <- max(c(y1.1,highestbar))
hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="",
col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
lines(x = x1.1, y = y1.1, col = "red3") # normal density curve
lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") # loess curve
}
par(mfrow=c(2,2)) # histograms of bootstrap coefs
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=2, var.nm ="age" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=3, var.nm ="fishmlwk" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=4, var.nm ="MeHg" )
None of the histograms show normal distribution.
num.p = dim(coef.mtrx)[2] # number of parameters
btr.ci = NULL
btr.wd = NULL
for (i in 1:num.p){
lci.025 = round(quantile(btr.mtrx[, i], 0.025, type = 2),8)
uci.975 = round(quantile(btr.mtrx[, i],0.975, type = 2 ),8)
btr.wd[i] = uci.975 - lci.025
btr.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
#as.data.frame(btc.ci)
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btr.ci.95=btr.ci)),
caption = "Regression Coefficient Matrix with 95% Residual Bootstrap CI")
| Estimate | Std. Error | t value | Pr(>|t|) | btr.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | -0.0624 | 0.0430 | -1.4528 | 0.1487 | [ -0.1331 , 0.0118 ] |
| age | 0.0021 | 0.0013 | 1.6246 | 0.1066 | [ -1e-04 , 0.0042 ] |
| fishmlwk | 0.0050 | 0.0020 | 2.5307 | 0.0126 | [ 0 , 0.009 ] |
| MeHg | 1.0250 | 0.0036 | 283.3248 | 0.0000 | [ 0 , 1.032 ] |
kable(as.data.frame(cbind(formatC(cmtrx[,-3],4,format="f"), btc.ci.95=btc.ci,btr.ci.95=btr.ci)),
caption="Final Combined Inferential Statistics: p-values and Bootstrap CIs")
| Estimate | Std. Error | Pr(>|t|) | btc.ci.95 | btr.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | -0.0624 | 0.0430 | 0.1487 | [ -0.151 , 0.016 ] | [ -0.1331 , 0.0118 ] |
| age | 0.0021 | 0.0013 | 0.1066 | [ -5e-04 , 0.0052 ] | [ -1e-04 , 0.0042 ] |
| fishmlwk | 0.0050 | 0.0020 | 0.0126 | [ 0 , 0.0114 ] | [ 0 , 0.009 ] |
| MeHg | 1.0250 | 0.0036 | 0.0000 | [ 1.0171 , 1.0302 ] | [ 0 , 1.032 ] |
Our bootstrap 95% confidence intervals for the residuals are telling the same story that our p-values are. The p-values for age and intercept are both not significant, and 0 is in both of their bootstrap confidence intervals which indicates that they are not significant.
kable(cbind(btc.wd, btr.wd), caption="width of the two bootstrap confidence intervals")
| btc.wd | btr.wd |
|---|---|
| 0.1669553 | 0.1449167 |
| 0.0057675 | 0.0042862 |
| 0.0113337 | 0.0089844 |
| 0.0131720 | 1.0319823 |
We see that the widths of both of the bootstraps are similar.
In our analysis we created four different models. We created new variables as well. We used our full model, we did backwards step wise regression, and created two different models based on our box cox transformations. Our assumptions were violated in all four of the models. The were not normally distributed and the variance is not constant. We chose step.model to be our final model because it had the highest R adjusted square value and the lowest AIC value. The model contained an intercept and 3 variables. The intercept and variable “fishmlwk” was not significant. The only significant variables were MeHg and fishmlwk.
In conclusion, this model should not be used to predict total mercury in Kuwaiti fishermen. The data violated our assumptions which led us to a poor model.
One way to fix this would be to gather more data. Our problem is that our data is not normal. We could possibly not use the intercept and age because they are not significant either. We found that MeHg and fishmlwk could help us predict the total mercury in Kuwaiti fishermen, but we would need more data.