The dataset under analysis includes data on potential medical indicators of type 2 diabetes for 403 patients. Using response surface methods, the experiment will test which of 4 factors suggest a predisposition to the onset of type 2 diabetes as measured by the response variable, percent glycosolated hemoglobin. The World Health Organization suggests using a glycosolated hemoglobin threshold of at least 6.5% to determine the presence of type 2 diabetes.
remove(list=ls())
diabetes = read.csv("C:/Users/Trevor/Documents/diabetes.csv", header=TRUE)
#view first few lines
head(diabetes)
## glyhb newchol newstabglu newhdl newage chol stabglu hdl age gender
## 1 7.87 3 3 2 3 292 235 55 79 male
## 2 4.87 3 3 2 2 293 115 54 50 male
## 3 9.18 2 3 3 2 219 112 73 59 male
## 4 5.23 2 3 3 3 219 112 73 76 male
## 5 7.40 3 2 3 3 246 104 62 66 female
## 6 13.70 3 3 2 2 237 233 58 49 female
## height ratio location weight frame bp.1s bp.1d bp.2s bp.2d waist hip
## 1 70 5.3 Buckingham 165 170 90 170 100 39 41
## 2 71 5.4 Buckingham 170 medium 131 75 NA NA 34 39
## 3 66 3.0 Buckingham 170 medium 146 92 168 98 37 40
## 4 64 3.0 Buckingham 105 medium 125 82 NA NA 29 33
## 5 66 4.0 Louisa 189 medium 200 94 208 90 45 46
## 6 62 4.1 Buckingham 189 large 130 90 NA NA 43 47
## time.ppn
## 1 240
## 2 120
## 3 120
## 4 60
## 5 195
## 6 195
#observe the structure of the data
str(diabetes)
## 'data.frame': 403 obs. of 22 variables:
## $ glyhb : num 7.87 4.87 9.18 5.23 7.4 ...
## $ newchol : int 3 3 2 2 3 3 3 2 3 2 ...
## $ newstabglu: int 3 3 3 3 2 3 3 3 3 3 ...
## $ newhdl : int 2 2 3 3 3 2 3 3 3 3 ...
## $ newage : int 3 2 2 3 3 2 3 3 3 3 ...
## $ chol : int 292 293 219 219 246 237 296 213 232 209 ...
## $ stabglu : int 235 115 112 112 104 233 262 203 184 113 ...
## $ hdl : int 55 54 73 73 62 58 60 75 114 65 ...
## $ age : int 79 50 59 76 66 49 74 71 91 76 ...
## $ gender : Factor w/ 2 levels "female","male": 2 2 2 2 1 1 1 1 1 1 ...
## $ height : int 70 71 66 64 66 62 63 63 61 60 ...
## $ ratio : num 5.3 5.4 3 3 4 ...
## $ location : Factor w/ 2 levels "Buckingham","Louisa": 1 1 1 1 2 1 1 1 1 1 ...
## $ weight : int 165 170 170 105 189 189 183 165 127 143 ...
## $ frame : Factor w/ 4 levels "","large","medium",..: 1 3 3 3 3 2 2 3 1 2 ...
## $ bp.1s : int 170 131 146 125 200 130 159 150 170 156 ...
## $ bp.1d : int 90 75 92 82 94 90 99 80 82 78 ...
## $ bp.2s : int 170 NA 168 NA 208 NA 160 145 NA 144 ...
## $ bp.2d : int 100 NA 98 NA 90 NA 103 80 NA 76 ...
## $ waist : int 39 34 37 29 45 43 42 34 35 35 ...
## $ hip : int 41 39 40 33 46 47 48 42 38 40 ...
## $ time.ppn : int 240 120 120 60 195 195 300 960 120 1200 ...
#403 observations of 24 variables
The data are tabluated into 24 columns, with detailed measurements of various vital functions and characteristics of each patient. This project will test the effects of 4 factors on the response. Due to the nature of the experimental design (3^k), the 4 predictor variables needed to be reduced to three levels for use in the linear model. This was performed by defining values less than the individual column quantiles as “1”, values greater than or equal to the first quantile and less than or equal to the third quantiles as “2”, and values greater than the third quantiles as “3”.
summary(diabetes)
## glyhb newchol newstabglu newhdl newage
## Min. : 2.68 Min. :1.00 Min. :1 Min. :1.00 Min. :1.00
## 1st Qu.: 4.38 1st Qu.:2.00 1st Qu.:2 1st Qu.:2.00 1st Qu.:2.00
## Median : 4.84 Median :2.00 Median :2 Median :2.00 Median :2.00
## Mean : 5.59 Mean :2.01 Mean :2 Mean :2.01 Mean :1.99
## 3rd Qu.: 5.60 3rd Qu.:2.00 3rd Qu.:2 3rd Qu.:2.00 3rd Qu.:2.00
## Max. :16.11 Max. :3.00 Max. :3 Max. :3.00 Max. :3.00
## NA's :13
## chol stabglu hdl age gender
## Min. : 78 Min. : 48 Min. : 12.0 Min. :19.0 female:234
## 1st Qu.:179 1st Qu.: 81 1st Qu.: 38.0 1st Qu.:34.0 male :169
## Median :204 Median : 89 Median : 46.0 Median :45.0
## Mean :208 Mean :107 Mean : 50.5 Mean :46.9
## 3rd Qu.:230 3rd Qu.:106 3rd Qu.: 59.0 3rd Qu.:60.0
## Max. :443 Max. :385 Max. :120.0 Max. :92.0
## NA's :1 NA's :1
## height ratio location weight frame
## Min. :52 Min. : 1.50 Buckingham:200 Min. : 99 : 12
## 1st Qu.:63 1st Qu.: 3.20 Louisa :203 1st Qu.:151 large :103
## Median :66 Median : 4.20 Median :172 medium:184
## Mean :66 Mean : 4.52 Mean :178 small :104
## 3rd Qu.:69 3rd Qu.: 5.40 3rd Qu.:200
## Max. :76 Max. :19.30 Max. :325
## NA's :5 NA's :1 NA's :1
## bp.1s bp.1d bp.2s bp.2d
## Min. : 90 Min. : 48.0 Min. :110 Min. : 60.0
## 1st Qu.:121 1st Qu.: 75.0 1st Qu.:138 1st Qu.: 84.0
## Median :136 Median : 82.0 Median :149 Median : 92.0
## Mean :137 Mean : 83.3 Mean :152 Mean : 92.5
## 3rd Qu.:147 3rd Qu.: 90.0 3rd Qu.:161 3rd Qu.:100.0
## Max. :250 Max. :124.0 Max. :238 Max. :124.0
## NA's :5 NA's :5 NA's :262 NA's :262
## waist hip time.ppn
## Min. :26.0 Min. :30 Min. : 5
## 1st Qu.:33.0 1st Qu.:39 1st Qu.: 90
## Median :37.0 Median :42 Median : 240
## Mean :37.9 Mean :43 Mean : 341
## 3rd Qu.:41.0 3rd Qu.:46 3rd Qu.: 518
## Max. :56.0 Max. :64 Max. :1560
## NA's :2 NA's :2 NA's :3
Randomization is not applicable here due to the nature of the data. 1046 subjects were initially interviewed in a study to understand the prevalence of obesity and diabetes in central Virginia for African Americans. The 403 of these subjects included in this dataset were the onces that were actually screened for diabetes.
The null hypothesis is that each of the 4 predictor variabes has no effect on the response, percent glycosolated hemoglobin. Analysis of Variance (ANOVA) will be used to understand the main effects of cholesterol (“newchol”, in mg/dL), stabilized glucose (“newstabglu”, in mg/dL), high density lipoprotein (“newhdl”, in mg/dL), and age (“newage”, in years) on the response.
For these data, there are no replicates. Furthermore, the background information does not expressly state that repeated measures were used in collecting the data from each patient.
Blocking will not be utilized here for simplicity’s sake and because the objective is to focus on the design of the experiment.
Graph showing trends in the data:
par(mfrow=c(2,2))
plot(diabetes$glyhb~diabetes$newchol,xlab="newchol (mg/dL)",ylab="level of glycosolated hemoglobin (%)")
plot(diabetes$glyhb~diabetes$newstabglu,xlab="newstabglu (mg/dL)",ylab="level of glycosolated hemoglobin (%)")
plot(diabetes$glyhb~diabetes$newhdl,xlab="newhdl (mg/dL)",ylab="level of glycosolated hemoglobin (%)")
plot(diabetes$glyhb~diabetes$newage,xlab="newage (years)",ylab="level of glycosolated hemoglobin (%)")
Download Response Surface Methods Package:
library(rsm)
## Warning: package 'rsm' was built under R version 3.1.2
Create Initial model:
model1=rsm(glyhb~SO(newchol,newstabglu,newhdl,newage), data=diabetes)
summary(model1)
##
## Call:
## rsm(formula = glyhb ~ SO(newchol, newstabglu, newhdl, newage),
## data = diabetes)
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.24562 2.01091 3.11 0.0020 **
## newchol -1.00159 0.86695 -1.16 0.2487
## newstabglu -3.91489 0.92190 -4.25 2.7e-05 ***
## newhdl 1.02032 0.94603 1.08 0.2815
## newage 0.66410 0.90002 0.74 0.4611
## newchol:newstabglu 0.59059 0.21422 2.76 0.0061 **
## newchol:newhdl -0.00549 0.19913 -0.03 0.9780
## newchol:newage -0.03363 0.21864 -0.15 0.8779
## newstabglu:newhdl -0.36435 0.20221 -1.80 0.0724 .
## newstabglu:newage 0.02168 0.22088 0.10 0.9219
## newhdl:newage -0.22777 0.20285 -1.12 0.2622
## newchol^2 0.06168 0.18779 0.33 0.7427
## newstabglu^2 1.21910 0.18987 6.42 4.1e-10 ***
## newhdl^2 -0.03199 0.18076 -0.18 0.8596
## newage^2 0.04085 0.19645 0.21 0.8354
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Multiple R-squared: 0.432, Adjusted R-squared: 0.411
## F-statistic: 20.4 on 14 and 375 DF, p-value: <2e-16
##
## Analysis of Variance Table
##
## Response: glyhb
## Df Sum Sq Mean Sq F value
## FO(newchol, newstabglu, newhdl, newage) 4 621 155.3 52.45
## TWI(newchol, newstabglu, newhdl, newage) 6 101 16.8 5.67
## PQ(newchol, newstabglu, newhdl, newage) 4 124 30.9 10.43
## Residuals 375 1111 3.0
## Lack of fit 54 189 3.5 1.22
## Pure error 321 922 2.9
## Pr(>F)
## FO(newchol, newstabglu, newhdl, newage) < 2e-16
## TWI(newchol, newstabglu, newhdl, newage) 1.2e-05
## PQ(newchol, newstabglu, newhdl, newage) 5.1e-08
## Residuals
## Lack of fit 0.16
## Pure error
##
## Stationary point of response surface:
## newchol newstabglu newhdl newage
## 5.4004 0.7288 3.0250 2.3339
##
## Eigenanalysis:
## $values
## [1] 1.3140 0.1250 -0.0156 -0.1337
##
## $vectors
## [,1] [,2] [,3] [,4]
## newchol -0.22751 -0.28529 0.917411 0.1588
## newstabglu -0.96457 -0.02125 -0.221215 -0.1422
## newhdl 0.13246 -0.54309 0.007451 -0.8291
## newage -0.01706 0.78944 0.330703 -0.5168
In this particular model based on response surface methods, we reject the H0 that “newstabglu” has no effect on the response. Furthermore, we reject the H0 that “newchol” and “newstabglu” jointly have no effect on the response. We also reject the H0 that the pure quadratic of “newstabglu” has no effect on the response.
The stationary point of the second order surface is as follows: newchol (5.4004142), newstabglu (0.7288239), newhdl (3.0250310), newage (2.3339078)
The response surface can be characterized based on the following Eigenvalues: newchol (1.31396501), newstabglu (0.12498168), newhdl (-0.01560483), newage (-0.13369428)
newchol:newstabglu (maximum response), newhdl:newage (minimum response), newchol:newhdl (saddle point), newchol:newage (saddle point), newstabglu:newhdl (saddle point), newstabglu:newage (saddle point)
par(mfrow=c(2,3))
contour(model1, ~newchol+newstabglu+newhdl+newage, image=TRUE, at=summary(model1$canonical$xs))
View response surfaces. One thing to remember, however, is that these can be slightly misleading because they only take into account two factors at a time. In this case, comparing just two factors at a time can produce contradictory results. For this reason, it is crucial to compare all factors simultaneously to understand the true optimal operating point.
library(rgl)
## Warning: package 'rgl' was built under R version 3.1.2
attach(diabetes)
par(mfrow=c(1,1))
persp(model1, ~ newchol + newstabglu, image = TRUE,
at = c(summary(model1)$canonical$xs, Block="B2"),theta=30,zlab="glycololated hemoglobin(%)",col.lab=33,contour="colors")
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
persp(model1, ~ newchol + newhdl, image = TRUE,
at = c(summary(model1)$canonical$xs, Block="B2"),theta=30,zlab="glycololated hemoglobin(%)",col.lab=33,contour="colors")
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
persp(model1, ~ newchol + newage, image = TRUE,
at = c(summary(model1)$canonical$xs, Block="B2"),theta=30,zlab="glycololated hemoglobin(%)",col.lab=33,contour="colors")
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
persp(model1, ~ newstabglu + newhdl, image = TRUE,
at = c(summary(model1)$canonical$xs, Block="B2"),zlab="glycololated hemoglobin(%)",col.lab=33,contour="colors")
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
persp(model1, ~ newstabglu + newage, image = TRUE,
at = c(summary(model1)$canonical$xs, Block="B2"),zlab="glycololated hemoglobin(%)",col.lab=33,contour="colors")
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
persp(model1, ~ newhdl + newage, image = TRUE,
at = c(summary(model1)$canonical$xs, Block="B2"),zlab="glycololated hemoglobin(%)",col.lab=33,contour="colors")
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
Estimate Parameters using Shapiro-Wilk test of normality
shapiro.test(diabetes$glyhb)
##
## Shapiro-Wilk normality test
##
## data: diabetes$glyhb
## W = 0.7218, p-value < 2.2e-16
We reject the H0 that the response variable, percent glycosolcated hemoglobin, is normally distributed. Thus, we attempt to transform the data:
shapiro.test(diabetes$glyhb^2)
##
## Shapiro-Wilk normality test
##
## data: diabetes$glyhb^2
## W = 0.5748, p-value < 2.2e-16
shapiro.test(sqrt(diabetes$glyhb))
##
## Shapiro-Wilk normality test
##
## data: sqrt(diabetes$glyhb)
## W = 0.7956, p-value < 2.2e-16
shapiro.test(log(diabetes$glyhb))
##
## Shapiro-Wilk normality test
##
## data: log(diabetes$glyhb)
## W = 0.8624, p-value < 2.2e-16
shapiro.test(log10(diabetes$glyhb))
##
## Shapiro-Wilk normality test
##
## data: log10(diabetes$glyhb)
## W = 0.8624, p-value < 2.2e-16
shapiro.test(1/diabetes$glyhb)
##
## Shapiro-Wilk normality test
##
## data: 1/diabetes$glyhb
## W = 0.9487, p-value = 2.171e-10
It appears that the data cannot be readily transformed into a normal distribution, which causes reliability issues with the model.
Diagnostics and Model Adequacy Checking:
par(mfrow=c(2,2))
plot(model1)
The model appears to have a very good fit to the data since the points are evenly dispersed across the Cartesian plane of residual error and the fitted model.
References:
Department of Biostatistics, Vanderbilt University. 2014. Originally obtained from Dr. John Schorling, Department of Medicine, University of Virginia School of Medicine. “http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/diabetes.html”
Willems JP, Saunders JT, DE Hunt, JB Schorling: Prevalence of coronary heart disease risk factors among rural blacks: A community-based study. Southern Medical Journal 90:814-820; 1997