Recipe 9: Response Surface Methods for Diabetes Dataset

Design of Experiments

ISYE 6020

Trevor Manzanares

Rensselaer Polytechnic Institute

12/1/14

The dataset under analysis includes data on potential medical indicators of type 2 diabetes for 403 patients. Using response surface methods, the experiment will test which of 4 factors suggest a predisposition to the onset of type 2 diabetes as measured by the response variable, percent glycosolated hemoglobin. The World Health Organization suggests using a glycosolated hemoglobin threshold of at least 6.5% to determine the presence of type 2 diabetes.

remove(list=ls())
diabetes = read.csv("C:/Users/Trevor/Documents/diabetes.csv", header=TRUE) 

#view first few lines
head(diabetes)
##   glyhb newchol newstabglu newhdl newage chol stabglu hdl age gender
## 1  7.87       3          3      2      3  292     235  55  79   male
## 2  4.87       3          3      2      2  293     115  54  50   male
## 3  9.18       2          3      3      2  219     112  73  59   male
## 4  5.23       2          3      3      3  219     112  73  76   male
## 5  7.40       3          2      3      3  246     104  62  66 female
## 6 13.70       3          3      2      2  237     233  58  49 female
##   height ratio   location weight  frame bp.1s bp.1d bp.2s bp.2d waist hip
## 1     70   5.3 Buckingham    165          170    90   170   100    39  41
## 2     71   5.4 Buckingham    170 medium   131    75    NA    NA    34  39
## 3     66   3.0 Buckingham    170 medium   146    92   168    98    37  40
## 4     64   3.0 Buckingham    105 medium   125    82    NA    NA    29  33
## 5     66   4.0     Louisa    189 medium   200    94   208    90    45  46
## 6     62   4.1 Buckingham    189  large   130    90    NA    NA    43  47
##   time.ppn
## 1      240
## 2      120
## 3      120
## 4       60
## 5      195
## 6      195
#observe the structure of the data
str(diabetes)
## 'data.frame':    403 obs. of  22 variables:
##  $ glyhb     : num  7.87 4.87 9.18 5.23 7.4 ...
##  $ newchol   : int  3 3 2 2 3 3 3 2 3 2 ...
##  $ newstabglu: int  3 3 3 3 2 3 3 3 3 3 ...
##  $ newhdl    : int  2 2 3 3 3 2 3 3 3 3 ...
##  $ newage    : int  3 2 2 3 3 2 3 3 3 3 ...
##  $ chol      : int  292 293 219 219 246 237 296 213 232 209 ...
##  $ stabglu   : int  235 115 112 112 104 233 262 203 184 113 ...
##  $ hdl       : int  55 54 73 73 62 58 60 75 114 65 ...
##  $ age       : int  79 50 59 76 66 49 74 71 91 76 ...
##  $ gender    : Factor w/ 2 levels "female","male": 2 2 2 2 1 1 1 1 1 1 ...
##  $ height    : int  70 71 66 64 66 62 63 63 61 60 ...
##  $ ratio     : num  5.3 5.4 3 3 4 ...
##  $ location  : Factor w/ 2 levels "Buckingham","Louisa": 1 1 1 1 2 1 1 1 1 1 ...
##  $ weight    : int  165 170 170 105 189 189 183 165 127 143 ...
##  $ frame     : Factor w/ 4 levels "","large","medium",..: 1 3 3 3 3 2 2 3 1 2 ...
##  $ bp.1s     : int  170 131 146 125 200 130 159 150 170 156 ...
##  $ bp.1d     : int  90 75 92 82 94 90 99 80 82 78 ...
##  $ bp.2s     : int  170 NA 168 NA 208 NA 160 145 NA 144 ...
##  $ bp.2d     : int  100 NA 98 NA 90 NA 103 80 NA 76 ...
##  $ waist     : int  39 34 37 29 45 43 42 34 35 35 ...
##  $ hip       : int  41 39 40 33 46 47 48 42 38 40 ...
##  $ time.ppn  : int  240 120 120 60 195 195 300 960 120 1200 ...
#403 observations of 24 variables

The data are tabluated into 24 columns, with detailed measurements of various vital functions and characteristics of each patient. This project will test the effects of 4 factors on the response. Due to the nature of the experimental design (3^k), the 4 predictor variables needed to be reduced to three levels for use in the linear model. This was performed by defining values less than the individual column quantiles as “1”, values greater than or equal to the first quantile and less than or equal to the third quantiles as “2”, and values greater than the third quantiles as “3”.

summary(diabetes)
##      glyhb          newchol       newstabglu     newhdl         newage    
##  Min.   : 2.68   Min.   :1.00   Min.   :1    Min.   :1.00   Min.   :1.00  
##  1st Qu.: 4.38   1st Qu.:2.00   1st Qu.:2    1st Qu.:2.00   1st Qu.:2.00  
##  Median : 4.84   Median :2.00   Median :2    Median :2.00   Median :2.00  
##  Mean   : 5.59   Mean   :2.01   Mean   :2    Mean   :2.01   Mean   :1.99  
##  3rd Qu.: 5.60   3rd Qu.:2.00   3rd Qu.:2    3rd Qu.:2.00   3rd Qu.:2.00  
##  Max.   :16.11   Max.   :3.00   Max.   :3    Max.   :3.00   Max.   :3.00  
##  NA's   :13                                                               
##       chol        stabglu         hdl             age          gender   
##  Min.   : 78   Min.   : 48   Min.   : 12.0   Min.   :19.0   female:234  
##  1st Qu.:179   1st Qu.: 81   1st Qu.: 38.0   1st Qu.:34.0   male  :169  
##  Median :204   Median : 89   Median : 46.0   Median :45.0               
##  Mean   :208   Mean   :107   Mean   : 50.5   Mean   :46.9               
##  3rd Qu.:230   3rd Qu.:106   3rd Qu.: 59.0   3rd Qu.:60.0               
##  Max.   :443   Max.   :385   Max.   :120.0   Max.   :92.0               
##  NA's   :1                   NA's   :1                                  
##      height       ratio             location       weight       frame    
##  Min.   :52   Min.   : 1.50   Buckingham:200   Min.   : 99         : 12  
##  1st Qu.:63   1st Qu.: 3.20   Louisa    :203   1st Qu.:151   large :103  
##  Median :66   Median : 4.20                    Median :172   medium:184  
##  Mean   :66   Mean   : 4.52                    Mean   :178   small :104  
##  3rd Qu.:69   3rd Qu.: 5.40                    3rd Qu.:200               
##  Max.   :76   Max.   :19.30                    Max.   :325               
##  NA's   :5    NA's   :1                        NA's   :1                 
##      bp.1s         bp.1d           bp.2s         bp.2d      
##  Min.   : 90   Min.   : 48.0   Min.   :110   Min.   : 60.0  
##  1st Qu.:121   1st Qu.: 75.0   1st Qu.:138   1st Qu.: 84.0  
##  Median :136   Median : 82.0   Median :149   Median : 92.0  
##  Mean   :137   Mean   : 83.3   Mean   :152   Mean   : 92.5  
##  3rd Qu.:147   3rd Qu.: 90.0   3rd Qu.:161   3rd Qu.:100.0  
##  Max.   :250   Max.   :124.0   Max.   :238   Max.   :124.0  
##  NA's   :5     NA's   :5       NA's   :262   NA's   :262    
##      waist           hip        time.ppn   
##  Min.   :26.0   Min.   :30   Min.   :   5  
##  1st Qu.:33.0   1st Qu.:39   1st Qu.:  90  
##  Median :37.0   Median :42   Median : 240  
##  Mean   :37.9   Mean   :43   Mean   : 341  
##  3rd Qu.:41.0   3rd Qu.:46   3rd Qu.: 518  
##  Max.   :56.0   Max.   :64   Max.   :1560  
##  NA's   :2      NA's   :2    NA's   :3

Randomization is not applicable here due to the nature of the data. 1046 subjects were initially interviewed in a study to understand the prevalence of obesity and diabetes in central Virginia for African Americans. The 403 of these subjects included in this dataset were the onces that were actually screened for diabetes.

The null hypothesis is that each of the 4 predictor variabes has no effect on the response, percent glycosolated hemoglobin. Analysis of Variance (ANOVA) will be used to understand the main effects of cholesterol (“newchol”, in mg/dL), stabilized glucose (“newstabglu”, in mg/dL), high density lipoprotein (“newhdl”, in mg/dL), and age (“newage”, in years) on the response.

For these data, there are no replicates. Furthermore, the background information does not expressly state that repeated measures were used in collecting the data from each patient.

Blocking will not be utilized here for simplicity’s sake and because the objective is to focus on the design of the experiment.

Graph showing trends in the data:

par(mfrow=c(2,2))
plot(diabetes$glyhb~diabetes$newchol,xlab="newchol (mg/dL)",ylab="level of glycosolated hemoglobin (%)")
plot(diabetes$glyhb~diabetes$newstabglu,xlab="newstabglu (mg/dL)",ylab="level of glycosolated hemoglobin (%)")
plot(diabetes$glyhb~diabetes$newhdl,xlab="newhdl (mg/dL)",ylab="level of glycosolated hemoglobin (%)")
plot(diabetes$glyhb~diabetes$newage,xlab="newage (years)",ylab="level of glycosolated hemoglobin (%)")

plot of chunk unnamed-chunk-3

Download Response Surface Methods Package:

library(rsm)
## Warning: package 'rsm' was built under R version 3.1.2

Create Initial model:

model1=rsm(glyhb~SO(newchol,newstabglu,newhdl,newage), data=diabetes)
summary(model1)
## 
## Call:
## rsm(formula = glyhb ~ SO(newchol, newstabglu, newhdl, newage), 
##     data = diabetes)
## 
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.24562    2.01091    3.11   0.0020 ** 
## newchol            -1.00159    0.86695   -1.16   0.2487    
## newstabglu         -3.91489    0.92190   -4.25  2.7e-05 ***
## newhdl              1.02032    0.94603    1.08   0.2815    
## newage              0.66410    0.90002    0.74   0.4611    
## newchol:newstabglu  0.59059    0.21422    2.76   0.0061 ** 
## newchol:newhdl     -0.00549    0.19913   -0.03   0.9780    
## newchol:newage     -0.03363    0.21864   -0.15   0.8779    
## newstabglu:newhdl  -0.36435    0.20221   -1.80   0.0724 .  
## newstabglu:newage   0.02168    0.22088    0.10   0.9219    
## newhdl:newage      -0.22777    0.20285   -1.12   0.2622    
## newchol^2           0.06168    0.18779    0.33   0.7427    
## newstabglu^2        1.21910    0.18987    6.42  4.1e-10 ***
## newhdl^2           -0.03199    0.18076   -0.18   0.8596    
## newage^2            0.04085    0.19645    0.21   0.8354    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Multiple R-squared:  0.432,  Adjusted R-squared:  0.411 
## F-statistic: 20.4 on 14 and 375 DF,  p-value: <2e-16
## 
## Analysis of Variance Table
## 
## Response: glyhb
##                                           Df Sum Sq Mean Sq F value
## FO(newchol, newstabglu, newhdl, newage)    4    621   155.3   52.45
## TWI(newchol, newstabglu, newhdl, newage)   6    101    16.8    5.67
## PQ(newchol, newstabglu, newhdl, newage)    4    124    30.9   10.43
## Residuals                                375   1111     3.0        
## Lack of fit                               54    189     3.5    1.22
## Pure error                               321    922     2.9        
##                                           Pr(>F)
## FO(newchol, newstabglu, newhdl, newage)  < 2e-16
## TWI(newchol, newstabglu, newhdl, newage) 1.2e-05
## PQ(newchol, newstabglu, newhdl, newage)  5.1e-08
## Residuals                                       
## Lack of fit                                 0.16
## Pure error                                      
## 
## Stationary point of response surface:
##    newchol newstabglu     newhdl     newage 
##     5.4004     0.7288     3.0250     2.3339 
## 
## Eigenanalysis:
## $values
## [1]  1.3140  0.1250 -0.0156 -0.1337
## 
## $vectors
##                [,1]     [,2]      [,3]    [,4]
## newchol    -0.22751 -0.28529  0.917411  0.1588
## newstabglu -0.96457 -0.02125 -0.221215 -0.1422
## newhdl      0.13246 -0.54309  0.007451 -0.8291
## newage     -0.01706  0.78944  0.330703 -0.5168

In this particular model based on response surface methods, we reject the H0 that “newstabglu” has no effect on the response. Furthermore, we reject the H0 that “newchol” and “newstabglu” jointly have no effect on the response. We also reject the H0 that the pure quadratic of “newstabglu” has no effect on the response.

The stationary point of the second order surface is as follows: newchol (5.4004142), newstabglu (0.7288239), newhdl (3.0250310), newage (2.3339078)

The response surface can be characterized based on the following Eigenvalues: newchol (1.31396501), newstabglu (0.12498168), newhdl (-0.01560483), newage (-0.13369428)

newchol:newstabglu (maximum response), newhdl:newage (minimum response), newchol:newhdl (saddle point), newchol:newage (saddle point), newstabglu:newhdl (saddle point), newstabglu:newage (saddle point)

par(mfrow=c(2,3))
contour(model1, ~newchol+newstabglu+newhdl+newage, image=TRUE, at=summary(model1$canonical$xs))

plot of chunk unnamed-chunk-6

View response surfaces. One thing to remember, however, is that these can be slightly misleading because they only take into account two factors at a time. In this case, comparing just two factors at a time can produce contradictory results. For this reason, it is crucial to compare all factors simultaneously to understand the true optimal operating point.

library(rgl)
## Warning: package 'rgl' was built under R version 3.1.2
attach(diabetes)

par(mfrow=c(1,1))

persp(model1, ~ newchol + newstabglu, image = TRUE,  
    at = c(summary(model1)$canonical$xs, Block="B2"),theta=30,zlab="glycololated hemoglobin(%)",col.lab=33,contour="colors")
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter

plot of chunk unnamed-chunk-7

persp(model1, ~ newchol + newhdl, image = TRUE,  
    at = c(summary(model1)$canonical$xs, Block="B2"),theta=30,zlab="glycololated hemoglobin(%)",col.lab=33,contour="colors")
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter

plot of chunk unnamed-chunk-7

persp(model1, ~ newchol + newage, image = TRUE,  
    at = c(summary(model1)$canonical$xs, Block="B2"),theta=30,zlab="glycololated hemoglobin(%)",col.lab=33,contour="colors")
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter

plot of chunk unnamed-chunk-7

persp(model1, ~ newstabglu + newhdl, image = TRUE,  
    at = c(summary(model1)$canonical$xs, Block="B2"),zlab="glycololated hemoglobin(%)",col.lab=33,contour="colors")
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter

plot of chunk unnamed-chunk-7

persp(model1, ~ newstabglu + newage, image = TRUE,  
    at = c(summary(model1)$canonical$xs, Block="B2"),zlab="glycololated hemoglobin(%)",col.lab=33,contour="colors")
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter

plot of chunk unnamed-chunk-7

persp(model1, ~ newhdl + newage, image = TRUE,  
    at = c(summary(model1)$canonical$xs, Block="B2"),zlab="glycololated hemoglobin(%)",col.lab=33,contour="colors")
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter
## Warning: "image" is not a graphical parameter

plot of chunk unnamed-chunk-7

Estimate Parameters using Shapiro-Wilk test of normality

shapiro.test(diabetes$glyhb)
## 
##  Shapiro-Wilk normality test
## 
## data:  diabetes$glyhb
## W = 0.7218, p-value < 2.2e-16

We reject the H0 that the response variable, percent glycosolcated hemoglobin, is normally distributed. Thus, we attempt to transform the data:

shapiro.test(diabetes$glyhb^2) 
## 
##  Shapiro-Wilk normality test
## 
## data:  diabetes$glyhb^2
## W = 0.5748, p-value < 2.2e-16
shapiro.test(sqrt(diabetes$glyhb)) 
## 
##  Shapiro-Wilk normality test
## 
## data:  sqrt(diabetes$glyhb)
## W = 0.7956, p-value < 2.2e-16
shapiro.test(log(diabetes$glyhb)) 
## 
##  Shapiro-Wilk normality test
## 
## data:  log(diabetes$glyhb)
## W = 0.8624, p-value < 2.2e-16
shapiro.test(log10(diabetes$glyhb)) 
## 
##  Shapiro-Wilk normality test
## 
## data:  log10(diabetes$glyhb)
## W = 0.8624, p-value < 2.2e-16
shapiro.test(1/diabetes$glyhb) 
## 
##  Shapiro-Wilk normality test
## 
## data:  1/diabetes$glyhb
## W = 0.9487, p-value = 2.171e-10

It appears that the data cannot be readily transformed into a normal distribution, which causes reliability issues with the model.

Diagnostics and Model Adequacy Checking:

par(mfrow=c(2,2))
plot(model1)

plot of chunk unnamed-chunk-10

The model appears to have a very good fit to the data since the points are evenly dispersed across the Cartesian plane of residual error and the fitted model.

References:

Department of Biostatistics, Vanderbilt University. 2014. Originally obtained from Dr. John Schorling, Department of Medicine, University of Virginia School of Medicine. “http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/diabetes.html

Willems JP, Saunders JT, DE Hunt, JB Schorling: Prevalence of coronary heart disease risk factors among rural blacks: A community-based study. Southern Medical Journal 90:814-820; 1997