Diabetes is defined as a disorder of the assimilation, use and storage of sugars brought by the diet, its management is ensured by the monitoring of overweight and obesity and regular glycemic control. The lowering of blood glucose in diabetic patients allows a reduction of macrovascular and especially microvascular complications. The glycohemoglobin (A1c) assay is an easy estimate of average blood glucose levels over the last two months. The A1c level allows caregivers and patients to evaluate glycemic control and set therapeutic goals. The objective of this project is the study of association between glycohemoglobin, overweight and obesity in patients between 20-60 years of age diagnosed with pred-diabetic and diabetic conditions, data collected by the National Health and Nutrition Examination Survey.
In this project, we use three different statistical methods to see the association between glycohemoglobin, overweight and obesity in patients. We did some processing of the data and include only variable that we will need for the purpose of this project. We choose Logistic Regression, Poisson Regression and One-Way Analysis of Variance. We compared the results of the methods and determined that logistic regression has a better result compared to two other methods.
Our project analyzed the NHANES data set available to the public from the repository of the Department of Biostatistics at the Vanderbilt University. This data is intended to help doing research on Body Size, Glycohemoglobin and Demographics among populations of the United States based on a variety of different factors. It consists of 6,795 Observations and with 20 variables recorded for each person interviewed in this survey. It was collected during household interview or computer assisted program software at the center and during the laboratory exam at examination center. They used a standardized questionnaires and they only interviewed persons with 16 years of age and older and emancipated minors.
The variables are: seqn: Respondent sequence number, integer. gh: Glycohemoglobin, double. sex: Gender, integer. age: Age, double. Re: Race/Ethnicity, integer tx: On Insulin or Diabetes Meds, integer. bmi: Body Mass Index, double.
The data set is Data obtained from ttp://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/nhgh.rda. We did some data processing and remove unwanted variables. The R code we have will run the processed data we created and it will not run properly the dataset in the repository unless you process the data. In our analysis, gh (Glycohemoglobin) is the response (y) variable, and the remaining 5 variables (sex,age,race,bmi and tx) are the predictor (x) variables. We recode bmi is as a categorical variable (underweight, normal, overweight and obese), sex as a categorical variable (female and male) and race as a categorical variable also (Mexican American, Other Hispanic, Non-Hispanic White, Non-Hispanic Black and Other Race Including Multi-Racial) making it suitable for our analysis. This study included 6,795 adult participants between 20-70 years identified by the National Health and Nutrition Examination Survey (NHANES) and meeting the inclusions criteria. This study excluded people less than 20 years old and older than 70 years old (n=2147).
# We import this dataset
nhanes_dataset <- read.csv("//Users/mikelapika/Desktop/Datasets/nhanes.csv", header = T, sep = ",")
# We strat the data processing and we remove unwanted variables and keep only variables that we will use for the purpose of this analysis.
nhanes_dataset <- nhanes_dataset[,-1]
colnames(nhanes_dataset) <- c("sex","age","race","tx","dx","weight","height","bmi","gh")
keeping <- c("sex","age","race","tx","dx","bmi","gh")
nhanes_dataset <- nhanes_dataset[keeping]
col1 <- mapply(anyNA,nhanes_dataset) # apply function anyNA() on all columns of airquality dataset
col1
## sex age race tx dx bmi gh
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# Data processing to predict
library(tidyverse)
nhanes_dataset2 <- nhanes_dataset
attach(nhanes_dataset2)
nhanes_dataset2[,"sex"] <- factor(sex)
nhanes_dataset2[,"race"] <- factor(race)
# Here we transform the variable we may need to a categorical variable
nhanes_dataset2$tx <- as.numeric(nhanes_dataset2$tx)
nhanes_dataset2$dx <- as.numeric(nhanes_dataset2$dx)
str(nhanes_dataset2)
## 'data.frame': 6795 obs. of 7 variables:
## $ sex : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
## $ age : num 34.2 16.8 60.2 26.1 49.7 ...
## $ race: Factor w/ 5 levels "Mexican American",..: 3 2 2 1 3 3 3 4 2 2 ...
## $ tx : num 0 0 1 0 0 0 1 0 0 1 ...
## $ dx : num 0 0 1 0 0 1 1 0 0 1 ...
## $ bmi : num 32.2 22 42.4 32.6 30.6 ...
## $ gh : num 5.2 5.7 6 5.1 5.3 5.4 6.8 5.1 5.6 11 ...
# Remove patients smaller than 20 or larger than 60:
nhanes_dataset3 <- nhanes_dataset2 %>% filter(nhanes_dataset2$age > 20 & nhanes_dataset2$age < 70)
attach(nhanes_dataset3)
# change gh variable to a categorical variable
nhanes_dataset3$gh <- ifelse(nhanes_dataset3$gh>6.5,1,0)
nhanes_dataset3$gh <- as.factor(nhanes_dataset3$gh)
# age to different category
nhanes_dataset3$agecat[20<=age & age<=29] <- "20-29"
nhanes_dataset3$agecat[30<=age & age<=39] <- "30-39"
nhanes_dataset3$agecat[40<=age & age<=70] <- "40-69"
#factr the agecat category
nhanes_dataset3$agecat <- as.factor(nhanes_dataset3$agecat)
# bmi different category
nhanes_dataset3$bmicat[0<=bmi & bmi<=18.5] <- "underweight"
nhanes_dataset3$bmicat[18.5<=bmi & bmi<=25] <- "normal"
nhanes_dataset3$bmicat[25.0<=bmi & bmi<=30] <- "overweight"
nhanes_dataset3$bmicat[30.0<=bmi & bmi<=100] <- "obese"
#factr the bmicat category
nhanes_dataset3$bmicat <- as.factor(nhanes_dataset3$bmicat)
# cgange tx to a categorical variable
nhanes_dataset3$tx <- as.factor(nhanes_dataset3$tx)
nhanes_dataset3$txcat[tx =="0"] <- "On Insulin"
nhanes_dataset3$txcat[tx =="1"] <- "Diabetes Meds"
nhanes_dataset3$txcat <- as.factor(nhanes_dataset3$txcat)
str(nhanes_dataset3)
## 'data.frame': 4648 obs. of 10 variables:
## $ sex : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
## $ age : num 34.2 60.2 26.1 49.7 43 ...
## $ race : Factor w/ 5 levels "Mexican American",..: 3 2 1 3 2 1 3 1 3 3 ...
## $ tx : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 1 1 1 1 ...
## $ dx : num 0 1 0 0 1 0 0 0 0 0 ...
## $ bmi : num 32.2 42.4 32.6 30.6 39.9 ...
## $ gh : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
## $ agecat: Factor w/ 3 levels "20-29","30-39",..: 2 3 1 3 3 3 3 1 3 3 ...
## $ bmicat: Factor w/ 4 levels "normal","obese",..: 2 2 2 2 2 3 3 3 3 1 ...
## $ txcat : Factor w/ 2 levels "Diabetes Meds",..: 2 1 2 2 1 2 2 2 2 2 ...
# check the correlation between the variable in the datatset
my_cols <- c("#00AFBB", "#E7B800", "#FC4E07")
pairs(nhanes_dataset3, pch = 19,col = my_cols)
#Logistic regression
# load package
library(sjPlot)
library(sjmisc)
library(sjlabelled)
#Unajusted model
m1 <- glm(gh ~ relevel(bmicat, ref="normal"),data=nhanes_dataset3,family="binomial")
summary(m1)
##
## Call:
## glm(formula = gh ~ relevel(bmicat, ref = "normal"), family = "binomial",
## data = nhanes_dataset3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5797 -0.5797 -0.3511 -0.2367 2.6793
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -3.5612 0.1739 -20.476
## relevel(bmicat, ref = "normal")obese 1.8627 0.1857 10.030
## relevel(bmicat, ref = "normal")overweight 0.8058 0.2045 3.941
## relevel(bmicat, ref = "normal")underweight -13.0049 278.9415 -0.047
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## relevel(bmicat, ref = "normal")obese < 2e-16 ***
## relevel(bmicat, ref = "normal")overweight 8.12e-05 ***
## relevel(bmicat, ref = "normal")underweight 0.963
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2750.3 on 4647 degrees of freedom
## Residual deviance: 2561.8 on 4644 degrees of freedom
## AIC: 2569.8
##
## Number of Fisher Scoring iterations: 15
#adjusted
m2 <- glm(gh ~ relevel(bmicat, ref="normal") + relevel(sex,ref="female") + relevel(agecat,ref="20-29") + relevel(race,ref="Other Hispanic") + relevel(txcat,ref="On Insulin") , data = nhanes_dataset3,family="binomial")
summary(m2)
##
## Call:
## glm(formula = gh ~ relevel(bmicat, ref = "normal") + relevel(sex,
## ref = "female") + relevel(agecat, ref = "20-29") + relevel(race,
## ref = "Other Hispanic") + relevel(txcat, ref = "On Insulin"),
## family = "binomial", data = nhanes_dataset3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9676 -0.2844 -0.2029 -0.1157 3.1973
##
## Coefficients:
## Estimate
## (Intercept) -5.81601
## relevel(bmicat, ref = "normal")obese 1.10043
## relevel(bmicat, ref = "normal")overweight 0.30835
## relevel(bmicat, ref = "normal")underweight -13.10721
## relevel(sex, ref = "female")male 0.45241
## relevel(agecat, ref = "20-29")30-39 0.88549
## relevel(agecat, ref = "20-29")40-69 1.91776
## relevel(race, ref = "Other Hispanic")Mexican American 0.40250
## relevel(race, ref = "Other Hispanic")Non-Hispanic Black -0.07294
## relevel(race, ref = "Other Hispanic")Non-Hispanic White -0.42732
## relevel(race, ref = "Other Hispanic")Other Race Including Multi-Racial 0.51400
## relevel(txcat, ref = "On Insulin")Diabetes Meds 3.72287
## Std. Error
## (Intercept) 0.50110
## relevel(bmicat, ref = "normal")obese 0.23107
## relevel(bmicat, ref = "normal")overweight 0.24972
## relevel(bmicat, ref = "normal")underweight 445.87036
## relevel(sex, ref = "female")male 0.14420
## relevel(agecat, ref = "20-29")30-39 0.48252
## relevel(agecat, ref = "20-29")40-69 0.42648
## relevel(race, ref = "Other Hispanic")Mexican American 0.24369
## relevel(race, ref = "Other Hispanic")Non-Hispanic Black 0.25483
## relevel(race, ref = "Other Hispanic")Non-Hispanic White 0.23683
## relevel(race, ref = "Other Hispanic")Other Race Including Multi-Racial 0.35409
## relevel(txcat, ref = "On Insulin")Diabetes Meds 0.14741
## z value
## (Intercept) -11.606
## relevel(bmicat, ref = "normal")obese 4.762
## relevel(bmicat, ref = "normal")overweight 1.235
## relevel(bmicat, ref = "normal")underweight -0.029
## relevel(sex, ref = "female")male 3.137
## relevel(agecat, ref = "20-29")30-39 1.835
## relevel(agecat, ref = "20-29")40-69 4.497
## relevel(race, ref = "Other Hispanic")Mexican American 1.652
## relevel(race, ref = "Other Hispanic")Non-Hispanic Black -0.286
## relevel(race, ref = "Other Hispanic")Non-Hispanic White -1.804
## relevel(race, ref = "Other Hispanic")Other Race Including Multi-Racial 1.452
## relevel(txcat, ref = "On Insulin")Diabetes Meds 25.256
## Pr(>|z|)
## (Intercept) < 2e-16
## relevel(bmicat, ref = "normal")obese 1.91e-06
## relevel(bmicat, ref = "normal")overweight 0.2169
## relevel(bmicat, ref = "normal")underweight 0.9765
## relevel(sex, ref = "female")male 0.0017
## relevel(agecat, ref = "20-29")30-39 0.0665
## relevel(agecat, ref = "20-29")40-69 6.90e-06
## relevel(race, ref = "Other Hispanic")Mexican American 0.0986
## relevel(race, ref = "Other Hispanic")Non-Hispanic Black 0.7747
## relevel(race, ref = "Other Hispanic")Non-Hispanic White 0.0712
## relevel(race, ref = "Other Hispanic")Other Race Including Multi-Racial 0.1466
## relevel(txcat, ref = "On Insulin")Diabetes Meds < 2e-16
##
## (Intercept) ***
## relevel(bmicat, ref = "normal")obese ***
## relevel(bmicat, ref = "normal")overweight
## relevel(bmicat, ref = "normal")underweight
## relevel(sex, ref = "female")male **
## relevel(agecat, ref = "20-29")30-39 .
## relevel(agecat, ref = "20-29")40-69 ***
## relevel(race, ref = "Other Hispanic")Mexican American .
## relevel(race, ref = "Other Hispanic")Non-Hispanic Black
## relevel(race, ref = "Other Hispanic")Non-Hispanic White .
## relevel(race, ref = "Other Hispanic")Other Race Including Multi-Racial
## relevel(txcat, ref = "On Insulin")Diabetes Meds ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2682.7 on 4458 degrees of freedom
## Residual deviance: 1496.8 on 4447 degrees of freedom
## (189 observations deleted due to missingness)
## AIC: 1520.8
##
## Number of Fisher Scoring iterations: 16
#Levels: Mexican American Non-Hispanic Black Non-Hispanic White Other Hispanic Other Race Including Multi-Racial
tab_model(m1,m2,show.se = TRUE,dv.labels = c("Model 1", "Model 2"),string.se = "Std Err",digits=4,show.ci = FALSE)
| Model 1 | Model 2 | |||||
|---|---|---|---|---|---|---|
| Predictors | Odds Ratios | Std Err | p | Odds Ratios | Std Err | p |
| (Intercept) | 0.0284 | 0.1739 | <0.001 | 0.0030 | 0.5011 | <0.001 |
| obese | 6.4409 | 0.1857 | <0.001 | 3.0055 | 0.2311 | <0.001 |
| overweight | 2.2384 | 0.2045 | <0.001 | 1.3612 | 0.2497 | 0.217 |
| underweight | 0.0000 | 278.9415 | 0.963 | 0.0000 | 445.8704 | 0.977 |
| male | 1.5721 | 0.1442 | 0.002 | |||
| 30-39 | 2.4242 | 0.4825 | 0.066 | |||
| 40-69 | 6.8057 | 0.4265 | <0.001 | |||
| Mexican American | 1.4956 | 0.2437 | 0.099 | |||
| Non-Hispanic Black | 0.9297 | 0.2548 | 0.775 | |||
| Non-Hispanic White | 0.6523 | 0.2368 | 0.071 | |||
|
Other Race Including Multi-Racial |
1.6720 | 0.3541 | 0.147 | |||
| Diabetes Meds | 41.3829 | 0.1474 | <0.001 | |||
| Observations | 4648 | 4459 | ||||
| Tjur’s R2 | 0.039 | 0.441 | ||||
From the summary of the logistic regression model, both unadjusted and adjusted models are significant for overweight and obese variables to the A1c level which means people are more likely having diabetes are in the obese and overweight category. This is making sense in real life. Model 1 shows us that people with A1c level higher that means they may have a diabetes condition have 9% higher odds(OR=1.09[95% CI: 0.95,1.27]) obesity,…. Also, Multi-racial or white people are male and with age 30~60 years old are more likely having diabetes than the other levels. We conclude that we have a significant association between diabetes and overweight and obese. Some predictors found to be significantly increasing the odds() and some of them decreases the odss.
# Basic barplot:
counts <- table(nhanes_dataset3$bmicat, nhanes_dataset3$gh)
counts <- scale(counts , FALSE, colSums(counts )) * 100
barplot(counts, main="Obesity and Overweight",
xlab="Diabetes", col=c("lightblue","red","green","yellow"),
legend = rownames(counts), beside=TRUE,names.arg=c("No-Diabetes", "Diabetes"),args.legend = list(x = "topright", bty = "n"))
The average of people who are obese and overweight was significantly higher among person with diabetes conditions compared with people without any diabetes conditions as shown in our graph.
#Poisson regression
nhanes_dataset3$gh <- as.numeric(nhanes_dataset3$gh)
#Unajusted model
m3 <- glm(gh ~ relevel(bmicat, ref="normal"),data=nhanes_dataset3,family = poisson(link = "log"))
summary(m3)
##
## Call:
## glm(formula = gh ~ relevel(bmicat, ref = "normal"), family = poisson(link = "log"),
## data = nhanes_dataset3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.14733 -0.14733 -0.05863 -0.02737 0.84785
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.02725 0.02812 0.969
## relevel(bmicat, ref = "normal")obese 0.11656 0.03565 3.270
## relevel(bmicat, ref = "normal")overweight 0.03082 0.03747 0.823
## relevel(bmicat, ref = "normal")underweight -0.02725 0.11960 -0.228
## Pr(>|z|)
## (Intercept) 0.33253
## relevel(bmicat, ref = "normal")obese 0.00108 **
## relevel(bmicat, ref = "normal")overweight 0.41079
## relevel(bmicat, ref = "normal")underweight 0.81980
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 278.59 on 4647 degrees of freedom
## Residual deviance: 265.49 on 4644 degrees of freedom
## AIC: 9818
##
## Number of Fisher Scoring iterations: 4
#adjusted
m4 <- glm(gh ~ relevel(bmicat, ref="normal") + relevel(sex,ref="female") + relevel(agecat,ref="20-29") + relevel(race,ref="Other Hispanic") , data = nhanes_dataset3,family = poisson(link = "log"))
summary(m4)
##
## Call:
## glm(formula = gh ~ relevel(bmicat, ref = "normal") + relevel(sex,
## ref = "female") + relevel(agecat, ref = "20-29") + relevel(race,
## ref = "Other Hispanic"), family = poisson(link = "log"),
## data = nhanes_dataset3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.23603 -0.12673 -0.07130 0.00009 0.93619
##
## Coefficients:
## Estimate
## (Intercept) -0.017800
## relevel(bmicat, ref = "normal")obese 0.097667
## relevel(bmicat, ref = "normal")overweight 0.008399
## relevel(bmicat, ref = "normal")underweight -0.018697
## relevel(sex, ref = "female")male 0.019877
## relevel(agecat, ref = "20-29")30-39 0.009308
## relevel(agecat, ref = "20-29")40-69 0.102167
## relevel(race, ref = "Other Hispanic")Mexican American 0.015061
## relevel(race, ref = "Other Hispanic")Non-Hispanic Black -0.007467
## relevel(race, ref = "Other Hispanic")Non-Hispanic White -0.042178
## relevel(race, ref = "Other Hispanic")Other Race Including Multi-Racial 0.025187
## Std. Error
## (Intercept) 0.057904
## relevel(bmicat, ref = "normal")obese 0.037544
## relevel(bmicat, ref = "normal")overweight 0.039176
## relevel(bmicat, ref = "normal")underweight 0.121524
## relevel(sex, ref = "female")male 0.028879
## relevel(agecat, ref = "20-29")30-39 0.049266
## relevel(agecat, ref = "20-29")40-69 0.039477
## relevel(race, ref = "Other Hispanic")Mexican American 0.052676
## relevel(race, ref = "Other Hispanic")Non-Hispanic Black 0.054026
## relevel(race, ref = "Other Hispanic")Non-Hispanic White 0.047669
## relevel(race, ref = "Other Hispanic")Other Race Including Multi-Racial 0.074274
## z value
## (Intercept) -0.307
## relevel(bmicat, ref = "normal")obese 2.601
## relevel(bmicat, ref = "normal")overweight 0.214
## relevel(bmicat, ref = "normal")underweight -0.154
## relevel(sex, ref = "female")male 0.688
## relevel(agecat, ref = "20-29")30-39 0.189
## relevel(agecat, ref = "20-29")40-69 2.588
## relevel(race, ref = "Other Hispanic")Mexican American 0.286
## relevel(race, ref = "Other Hispanic")Non-Hispanic Black -0.138
## relevel(race, ref = "Other Hispanic")Non-Hispanic White -0.885
## relevel(race, ref = "Other Hispanic")Other Race Including Multi-Racial 0.339
## Pr(>|z|)
## (Intercept) 0.75854
## relevel(bmicat, ref = "normal")obese 0.00929
## relevel(bmicat, ref = "normal")overweight 0.83025
## relevel(bmicat, ref = "normal")underweight 0.87772
## relevel(sex, ref = "female")male 0.49128
## relevel(agecat, ref = "20-29")30-39 0.85015
## relevel(agecat, ref = "20-29")40-69 0.00965
## relevel(race, ref = "Other Hispanic")Mexican American 0.77495
## relevel(race, ref = "Other Hispanic")Non-Hispanic Black 0.89008
## relevel(race, ref = "Other Hispanic")Non-Hispanic White 0.37625
## relevel(race, ref = "Other Hispanic")Other Race Including Multi-Racial 0.73453
##
## (Intercept)
## relevel(bmicat, ref = "normal")obese **
## relevel(bmicat, ref = "normal")overweight
## relevel(bmicat, ref = "normal")underweight
## relevel(sex, ref = "female")male
## relevel(agecat, ref = "20-29")30-39
## relevel(agecat, ref = "20-29")40-69 **
## relevel(race, ref = "Other Hispanic")Mexican American
## relevel(race, ref = "Other Hispanic")Non-Hispanic Black
## relevel(race, ref = "Other Hispanic")Non-Hispanic White
## relevel(race, ref = "Other Hispanic")Other Race Including Multi-Racial
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 272.98 on 4458 degrees of freedom
## Residual deviance: 246.29 on 4448 degrees of freedom
## (189 observations deleted due to missingness)
## AIC: 9430.5
##
## Number of Fisher Scoring iterations: 4
#Levels: Mexican American Non-Hispanic Black Non-Hispanic White Other Hispanic Other Race Including Multi-Racial
tab_model(m3,m4,show.se = TRUE,dv.labels = c("Model 1", "Model 2"),string.se = "Std Err",digits=4,show.ci = FALSE)
| Model 1 | Model 2 | |||||
|---|---|---|---|---|---|---|
| Predictors | Incidence Rate Ratios | Std Err | p | Incidence Rate Ratios | Std Err | p |
| (Intercept) | 1.0276 | 0.0281 | 0.333 | 0.9824 | 0.0579 | 0.759 |
| obese | 1.1236 | 0.0356 | 0.001 | 1.1026 | 0.0375 | 0.009 |
| overweight | 1.0313 | 0.0375 | 0.411 | 1.0084 | 0.0392 | 0.830 |
| underweight | 0.9731 | 0.1196 | 0.820 | 0.9815 | 0.1215 | 0.878 |
| male | 1.0201 | 0.0289 | 0.491 | |||
| 30-39 | 1.0094 | 0.0493 | 0.850 | |||
| 40-69 | 1.1076 | 0.0395 | 0.010 | |||
| Mexican American | 1.0152 | 0.0527 | 0.775 | |||
| Non-Hispanic Black | 0.9926 | 0.0540 | 0.890 | |||
| Non-Hispanic White | 0.9587 | 0.0477 | 0.376 | |||
|
Other Race Including Multi-Racial |
1.0255 | 0.0743 | 0.735 | |||
| Observations | 4648 | 4459 | ||||
| Nagelkerke’s R2 | 0.048 | 0.100 | ||||
Unlike the logistic regression, none of the variables are significant at this time. We tried to use all variables to predict the “gh”, we found that the only significant variable is “tx”, which is people are taking insulin or not. This is also making sense because when people’s A1c level is greater than 6.5, means they may have diabetes, so they take insulin, otherwise, no insulin is taken.
#Oneway ANOVA
nhanes_dataset3$gh <- as.numeric(nhanes_dataset3$gh)
#Unajusted model
m5 <- aov(gh ~ relevel(bmicat, ref="normal"),data=nhanes_dataset3,family = poisson(link = "log"))
summary(m5)
## Df Sum Sq Mean Sq F value Pr(>F)
## relevel(bmicat, ref = "normal") 3 14.3 4.766 62.28 <2e-16 ***
## Residuals 4644 355.4 0.077
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#As the p-value is less than the significance level 0.05, we can conclude that there are significant differences between the groups highlighted with "*" in the model summary.
#From the above results, it is observed that the F-statistic value is 62.28 and it is highly significant as the corresponding p-value is much less than the level of significance (1% or 0.01). Thus, it is wise to reject the null hypothesis
#Now you have to find out the pair of bmicat which differ. For this you may use the Tukey's HSD test.
TukeyHSD(m5, conf.level = 0.99)
## Tukey multiple comparisons of means
## 99% family-wise confidence level
##
## Fit: aov(formula = gh ~ relevel(bmicat, ref = "normal"), data = nhanes_dataset3, family = poisson(link = "log"))
##
## $`relevel(bmicat, ref = "normal")`
## diff lwr upr p adj
## obese-normal 0.12703650 0.0951796686 0.15889333 0.0000000
## overweight-normal 0.03215926 -0.0007912466 0.06510976 0.0127134
## underweight-normal -0.02761982 -0.1307607089 0.07552107 0.8382319
## overweight-obese -0.09487724 -0.1247793019 -0.06497518 0.0000000
## underweight-obese -0.15465632 -0.2568641481 -0.05244849 0.0000149
## underweight-overweight -0.05977908 -0.1623330549 0.04277490 0.2659503
#The TukeyHSD command shows the pair-wise difference of bmicat at 1% level of significance. Here, the "diff" column provides mean differences. The "lwr" and "upr" columns provide lower and upper 99% confidence bounds, respectively. Finally, the "p adj" column provides the p-values adjusted for the number of comparisons made.
#It can be seen from the output, that only the difference between overweight-normal and underweight-obese is significant with a respective adjusted p-value of 0.0127 and 0.0000149.
plot(TukeyHSD(m5, conf.level = 0.99),las=1, col = "red")
# Extract the residuals
aov_residuals <- residuals(object = m5 )
# Run Shapiro-Wilk test
shapiro.test(x = aov_residuals )
##
## Shapiro-Wilk normality test
##
## data: aov_residuals
## W = 0.48137, p-value < 2.2e-16
#Based on the result with p-value less than the thresold we can conclude that the normality is met..
#adjusted
m6 <- aov(gh ~ relevel(bmicat, ref="normal") + relevel(sex,ref="female") + relevel(agecat,ref="20-29") + relevel(race,ref="Other Hispanic"), data = nhanes_dataset3,family = poisson(link = "log"))
summary(m6)
## Df Sum Sq Mean Sq F value Pr(>F)
## relevel(bmicat, ref = "normal") 3 14.5 4.817 64.247 < 2e-16
## relevel(sex, ref = "female") 1 0.5 0.515 6.867 0.00881
## relevel(agecat, ref = "20-29") 2 10.7 5.352 71.379 < 2e-16
## relevel(race, ref = "Other Hispanic") 4 3.3 0.830 11.070 6.2e-09
## Residuals 4448 333.5 0.075
##
## relevel(bmicat, ref = "normal") ***
## relevel(sex, ref = "female") **
## relevel(agecat, ref = "20-29") ***
## relevel(race, ref = "Other Hispanic") ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 189 observations deleted due to missingness
#Levels: Mexican American Non-Hispanic Black Non-Hispanic White Other Hispanic Other Race Including Multi-Racial
TukeyHSD(m6, conf.level = 0.99)
## Tukey multiple comparisons of means
## 99% family-wise confidence level
##
## Fit: aov(formula = gh ~ relevel(bmicat, ref = "normal") + relevel(sex, ref = "female") + relevel(agecat, ref = "20-29") + relevel(race, ref = "Other Hispanic"), data = nhanes_dataset3, family = poisson(link = "log"))
##
## $`relevel(bmicat, ref = "normal")`
## diff lwr upr p adj
## obese-normal 0.13057152 0.0983161994 0.16282685 0.0000000
## overweight-normal 0.03319435 -0.0001081167 0.06649683 0.0103335
## underweight-normal -0.02808511 -0.1316388620 0.07546865 0.8329961
## overweight-obese -0.09737717 -0.1275625163 -0.06719182 0.0000000
## underweight-obese -0.15865663 -0.2612503963 -0.05606286 0.0000090
## underweight-overweight -0.06127946 -0.1642072494 0.04164833 0.2480598
##
## $`relevel(sex, ref = "female")`
## diff lwr upr p adj
## male-female 0.02137691 0.0002377424 0.04251607 0.0091934
##
## $`relevel(agecat, ref = "20-29")`
## diff lwr upr p adj
## 30-39-20-29 0.002731517 -0.03664575 0.04210879 0.9777117
## 40-69-20-29 0.101501264 0.07007754 0.13292499 0.0000000
## 40-69-30-39 0.098769747 0.06689899 0.13064051 0.0000000
##
## $`relevel(race, ref = "Other Hispanic")`
## diff
## Mexican American-Other Hispanic 0.016304095
## Non-Hispanic Black-Other Hispanic -0.008334878
## Non-Hispanic White-Other Hispanic -0.045201044
## Other Race Including Multi-Racial-Other Hispanic 0.027619561
## Non-Hispanic Black-Mexican American -0.024638973
## Non-Hispanic White-Mexican American -0.061505138
## Other Race Including Multi-Racial-Mexican American 0.011315466
## Non-Hispanic White-Non-Hispanic Black -0.036866166
## Other Race Including Multi-Racial-Non-Hispanic Black 0.035954439
## Other Race Including Multi-Racial-Non-Hispanic White 0.072820604
## lwr
## Mexican American-Other Hispanic -0.03309305
## Non-Hispanic Black-Other Hispanic -0.05879065
## Non-Hispanic White-Other Hispanic -0.08953942
## Other Race Including Multi-Racial-Other Hispanic -0.04164594
## Non-Hispanic Black-Mexican American -0.06771172
## Non-Hispanic White-Mexican American -0.09721744
## Other Race Including Multi-Racial-Mexican American -0.05277187
## Non-Hispanic White-Non-Hispanic Black -0.07402898
## Other Race Including Multi-Racial-Non-Hispanic Black -0.02895237
## Other Race Including Multi-Racial-Non-Hispanic White 0.01254623
## upr
## Mexican American-Other Hispanic 0.0657012397
## Non-Hispanic Black-Other Hispanic 0.0421208919
## Non-Hispanic White-Other Hispanic -0.0008626708
## Other Race Including Multi-Racial-Other Hispanic 0.0968850627
## Non-Hispanic Black-Mexican American 0.0184337752
## Non-Hispanic White-Mexican American -0.0257928408
## Other Race Including Multi-Racial-Mexican American 0.0754028023
## Non-Hispanic White-Non-Hispanic Black 0.0002966512
## Other Race Including Multi-Racial-Non-Hispanic Black 0.1008612444
## Other Race Including Multi-Racial-Non-Hispanic White 0.1330949770
## p adj
## Mexican American-Other Hispanic 0.8195161
## Non-Hispanic Black-Other Hispanic 0.9833909
## Non-Hispanic White-Other Hispanic 0.0080699
## Other Race Including Multi-Racial-Other Hispanic 0.6921159
## Non-Hispanic Black-Mexican American 0.3376950
## Non-Hispanic White-Mexican American 0.0000003
## Other Race Including Multi-Racial-Mexican American 0.9787273
## Non-Hispanic White-Non-Hispanic Black 0.0109056
## Other Race Including Multi-Racial-Non-Hispanic Black 0.3712817
## Other Race Including Multi-Racial-Non-Hispanic White 0.0008052
#plot(TukeyHSD(m6, conf.level = 0.99),las=1, col = "red")
# Extract the residuals
aov_residuals2 <- residuals(object = m6 )
# Run Shapiro-Wilk test
shapiro.test(x = aov_residuals2 )
##
## Shapiro-Wilk normality test
##
## data: aov_residuals2
## W = 0.59646, p-value < 2.2e-16
Conclusions
The table below summarizes the performance of the three different methods. logistic regression had the best performance, with error rate 15.46% on the test data.The other two methods were only slightly worse, with 17-18% error rates. Decision trees were also the only method that had no over-fitting at all, with performance almost exactly the same on test data as on training data. The other two methods did a bit worse on the test data than on the training data. From these results, we would recommend using decision tree based methods to predict income based on the available census data.
The study of association between glycohemoglobin on overweight and obesity were the purpose of this project, we can endorses findings from other studies of the relationship between overweight, obesity, and glycohemoglobin in people with diabetes condition. It demonstrates that glycohemoglobin and demographic factors are significantly associated with age category 30-39 years, Non Hispanic Black and are male.
In conclusion, from the two methods above, we prefer to use the logistic regression to analyse this dataset. Also obesity and overweight may cause the diabetes. Logistic regression and poisson regression are similar. The paper demonstrated an algebraic relationship between probability, logit, and log odds. Using the definition of log odds, we demonstrated that the parameters of the model can be estimated using either logistic or poisson regression. Although the parameter estimates are not identical between logistic and poisson regression (due to the use of binning in poisson regression), the probability predictions between the two models are similar.