Recipes for the Design of Experiments

Visits to Physician’s Office

Jane Braun

RPI

November 20th Version 1.0

1. Setting

System under test

The data used in this recipe was found using the “100+ interesting data sets” webpage. It is part of the R dataset package called “Ecdat,” which various datasets for the purpose of econometrics.

This dataset in particular has data of number of physician visits by individuals in the United States.

# Read in dataset from Ecdat package
#install.packages("Ecdat")
# library(Ecdat)
data(OFP, package = "Ecdat")
x <- OFP

Factors and Levels

The original data being examined has 4,406 observations of 19 variables.

head(x)

##   ofp ofnp opp opnp emr hosp numchron adldiff age black    sex maried
## 1   5    0   0    0   0    1        2       0 6.9   yes   male    yes
## 2   1    0   2    0   2    0        2       0 7.4    no female    yes
## 3  13    0   0    0   3    3        4       1 6.6   yes female     no
## 4  16    0   5    0   1    1        2       1 7.6    no   male    yes
## 5   3    0   0    0   0    0        2       1 7.9    no female    yes
## 6  17    0   0    0   0    0        5       1 6.6    no female     no
##   school faminc employed privins medicaid region  hlth
## 1      6 2.8810      yes     yes       no  other other
## 2     10 2.7478       no     yes       no  other other
## 3     10 0.6532       no      no      yes  other  poor
## 4      3 0.6588       no     yes       no  other  poor
## 5      6 0.6588       no     yes       no  other other
## 6      7 0.3301       no      no      yes  other  poor

summary(x)

##       ofp              ofnp              opp                opnp         
##  Min.   : 0.000   Min.   :  0.000   Min.   :  0.0000   Min.   :  0.0000  
##  1st Qu.: 1.000   1st Qu.:  0.000   1st Qu.:  0.0000   1st Qu.:  0.0000  
##  Median : 4.000   Median :  0.000   Median :  0.0000   Median :  0.0000  
##  Mean   : 5.774   Mean   :  1.618   Mean   :  0.7508   Mean   :  0.5361  
##  3rd Qu.: 8.000   3rd Qu.:  1.000   3rd Qu.:  0.0000   3rd Qu.:  0.0000  
##  Max.   :89.000   Max.   :104.000   Max.   :141.0000   Max.   :155.0000  
##       emr               hosp          numchron        adldiff     
##  Min.   : 0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 0.0000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 0.0000   Median :0.000   Median :1.000   Median :0.000  
##  Mean   : 0.2635   Mean   :0.296   Mean   :1.542   Mean   :0.204  
##  3rd Qu.: 0.0000   3rd Qu.:0.000   3rd Qu.:2.000   3rd Qu.:0.000  
##  Max.   :12.0000   Max.   :8.000   Max.   :8.000   Max.   :1.000  
##       age         black          sex       maried         school     
##  Min.   : 6.600   no :3890   female:2628   no :2000   Min.   : 0.00  
##  1st Qu.: 6.900   yes: 516   male  :1778   yes:2406   1st Qu.: 8.00  
##  Median : 7.300                                       Median :11.00  
##  Mean   : 7.402                                       Mean   :10.29  
##  3rd Qu.: 7.800                                       3rd Qu.:12.00  
##  Max.   :10.900                                       Max.   :18.00  
##      faminc        employed   privins    medicaid       region    
##  Min.   :-1.0125   no :3951   no : 985   no :4004   other  :1614  
##  1st Qu.: 0.9122   yes: 455   yes:3421   yes: 402   noreast: 837  
##  Median : 1.6982                                    midwest:1157  
##  Mean   : 2.5271                                    west   : 798  
##  3rd Qu.: 3.1728                                                  
##  Max.   :54.8351                                                  
##         hlth     
##  other    :3509  
##  excellent: 343  
##  poor     : 554  
##                  
##                  
##

# As seen below, there are 452 observations of 12 variables, including: 5 factors and 7 numeric variables.
str(x)

## 'data.frame':    4406 obs. of  19 variables:
##  $ ofp     : int  5 1 13 16 3 17 9 3 1 0 ...
##  $ ofnp    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ opp     : int  0 2 0 5 0 0 0 0 0 0 ...
##  $ opnp    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ emr     : int  0 2 3 1 0 0 0 0 0 0 ...
##  $ hosp    : int  1 0 3 1 0 0 0 0 0 0 ...
##  $ numchron: int  2 2 4 2 2 5 0 0 0 0 ...
##  $ adldiff : int  0 0 1 1 1 1 0 0 0 0 ...
##  $ age     : num  6.9 7.4 6.6 7.6 7.9 6.6 7.5 8.7 7.3 7.8 ...
##  $ black   : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ sex     : Factor w/ 2 levels "female","male": 2 1 1 2 1 1 1 1 1 1 ...
##  $ maried  : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 1 1 1 1 ...
##  $ school  : int  6 10 10 3 6 7 8 8 8 8 ...
##  $ faminc  : num  2.881 2.748 0.653 0.659 0.659 ...
##  $ employed: Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ privins : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 2 2 2 2 ...
##  $ medicaid: Factor w/ 2 levels "no","yes": 1 1 2 1 1 2 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "other","noreast",..: 1 1 1 1 1 1 3 3 3 3 ...
##  $ hlth    : Factor w/ 3 levels "other","excellent",..: 1 1 3 3 1 3 1 1 1 1 ...

x <- x[,c(1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)] # eliminate unused variables
x$school <- as.numeric(x$school) # assign as a numeric variable
x$school <- cut(x$school, c(0, 12, 19)) #create intervals out of continuous variable
x$school <- as.factor(x$school) #reassign as a factor 
x <- x[x$hlth %in% c("excellent", "poor"),] # eliminate "other" level
x$hlth <- droplevels(x$hlth) # drop unused level
x <- x[, c(1,2,3,4,5,7,11)] # eliminate all unused variables


# Assign all factors as binary
x$black <- as.character(x$black)
x$black[x$black == "no"] <- 1
x$black[x$black == "yes"] <- -1

x$sex <- as.character(x$sex)
x$sex[x$sex == "female"] <- 1
x$sex[x$sex == "male"] <- -1

x$employed <- as.character(x$employed)
x$employed[x$employed == "no"] <- 1
x$employed[x$employed == "yes"] <- -1

x$school <- as.character(x$school)
x$school[x$school == "(0,12]"] <- 1
x$school[x$school == "(12,19]"] <- -1

x$maried <- as.character(x$maried)
x$maried[x$maried == "no"] <- 1
x$maried[x$maried == "yes"] <- -1

x$hlth <- as.character(x$hlth)
x$hlth[x$hlth == "excellent"] <- 1
x$hlth[x$hlth == "poor"] <- -1

x$hlth <- as.integer(x$hlth)
x$sex <- as.integer(x$sex)
x$black <- as.integer(x$black)
x$maried <- as.integer(x$maried)
x$school <- as.integer(x$school)
x$employed <- as.integer(x$employed)
x$ofp <- as.numeric(x$ofp)

Continuous variables

The only continuous variable in this dataset is the number of physician visits by the individual.

Response variables

The response variable being analyzed in this dataset is the number of physician visits. As seen below, the range of unemployment is between 0 and 117 weeks, with a mean of 18.51 weeks.

summary(x$ofp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   5.000   6.806   9.000  89.000

The Data

The data is organized into 19 variables, but was narrowed down to 7, including: 6 factors and 1 numeric variable.

Randomization

There was no randomization in this recipe because the data points are based on observations about physician visits among American individuals.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

A fractional factorial design will be used to complete analysis. Each factor will be narrowed down to two levels through either subsetting intervals, or by excluding certain levels.

A 2^(6-1) design matrix is used to look at all the possible combinations of factors.

Randomize:

The experiment was based on observations, and was therefore not randomized.

Replicate:

The experiment had no repeated measures.

Block:

There was no blocking in this recipe.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

x_plot <- x # rename so that we can rename as factors without screwing up original data

summary(x_plot)

##       ofp             black              sex              maried        
##  Min.   : 0.000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.00000  
##  1st Qu.: 1.000   1st Qu.: 1.0000   1st Qu.:-1.0000   1st Qu.:-1.00000  
##  Median : 5.000   Median : 1.0000   Median : 1.0000   Median :-1.00000  
##  Mean   : 6.806   Mean   : 0.7547   Mean   : 0.1929   Mean   :-0.02341  
##  3rd Qu.: 9.000   3rd Qu.: 1.0000   3rd Qu.: 1.0000   3rd Qu.: 1.00000  
##  Max.   :89.000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.00000  
##                                                                         
##      school           employed            hlth        
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
##  1st Qu.: 1.0000   1st Qu.: 1.0000   1st Qu.:-1.0000  
##  Median : 1.0000   Median : 1.0000   Median :-1.0000  
##  Mean   : 0.5972   Mean   : 0.7904   Mean   :-0.2352  
##  3rd Qu.: 1.0000   3rd Qu.: 1.0000   3rd Qu.: 1.0000  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
##  NA's   :33

x_plot$black <- as.factor(x_plot$black)
x_plot$sex <- as.factor(x_plot$sex)
x_plot$maried <- as.factor(x_plot$maried)
x_plot$school <- as.factor(x_plot$school)
x_plot$employed <- as.factor(x_plot$employed)
x_plot$hlth <- as.factor(x_plot$hlth)

The boxplot below shows the number of doctor visits versus race. There does not seem to be any distinct difference in medians between black and non black individuals. The range of number of doctor visits for black individuals does, however have a larger range.

par(mfrow=c(1,1))
plot(x_plot$ofp ~ x_plot$black, xlab = "Black (1) or Non-Black (-1)", ylab = "Number of Doctors Visits", main="Number of Doctor Visits versus Race" )

The boxplot below shows the frequency of doctor visits by the gender of the individual. The median frequency for females is slightly greater than females, but the males have a larger range in frequencies.

plot(x_plot$ofp ~ x_plot$sex, xlab = "Male (-1) or Female (1)", ylab = "Number of Doctors Visits", main = "Number of Doctors Visits versus Gender")

The boxplot below shows frequency of docor visits of employed vs. unemployed individuals. The employed individuals has a larger median, and seems to be right scewed.

plot(x_plot$ofp ~ x_plot$employed, xlab = "Unemployed (-1) or Employed (1)", ylab = "Number of Doctors Visits", main = "Number of Doctors Visits versus Employment Status")

The boxplot below shows the number of doctor visits versus the marital status of the individual. Married individuals have a larger range, but the medians of the two groups seem relatively consistent.

plot(x_plot$ofp ~ x_plot$maried, xlab = "Married (-1) or Single (1))", ylab = "Number of Doctors Visits", main = "Number of Doctors Visits versus Married Status")

The boxplot below shows the frequency of doctor visits and their relationship to the health of the individual. As expected, poor health individuals have greater medians, along with a much larger range - indicating that that poor health individuals have various levels of severity.

plot(x_plot$ofp ~ x_plot$hlth, xlab = "Excellent (1) or Poor (-1)", ylab = "Number of Doctors Visits", main = "Number of Doctors Visits versus Health")

The boxplot below shows the frequency of doctor visits related to the years of school the individual has had. There doesn’t seem to be any signidicant difference in means or ranges of the two groups.

plot(x_plot$ofp ~ x_plot$school, xlab = "Low (12 or less years) (1) or High (greater than 12 years) (-1)", ylab = "Number of Doctors Visits", main = "Number of Doctors Visits versus Years of School")

Testing

Analysis of Variance

The ANOVA below is a 6 factor, 2 level analysis of various. The null hypothesis states that the variation in the number of annual doctors visits (ofp) cannot be explained by anything other than randomization. The alternative, however, states that it is possible that something other than randomization can explain the variation in annual doctors visits.

model <- aov(ofp~black + sex + maried + school + employed + hlth, data=x)
summary(model)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## black         1     15      15   0.239  0.625    
## sex           1     68      68   1.105  0.293    
## maried        1     16      16   0.261  0.610    
## school        1    104     104   1.694  0.193    
## employed      1      9       9   0.143  0.706    
## hlth          1   6303    6303 102.772 <2e-16 ***
## Residuals   857  52560      61                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 33 observations deleted due to missingness

In this case, the p-values for all of the main effects of the six factors are greater than an alpha of 0.05. Therefore they are insignificant, and we must fail to reject the null. Therefore, it is possible that frequency of doctor visits can still not be explained by anything other than randomization.

Design Matrix Construction

# install.packages("FrF2")
library(FrF2)

## Warning: package 'FrF2' was built under R version 3.1.2

## Loading required package: DoE.base

## Warning: package 'DoE.base' was built under R version 3.1.2

## Loading required package: grid
## Loading required package: conf.design

## Warning: package 'conf.design' was built under R version 3.1.2

## 
## Attaching package: 'DoE.base'
## 
## The following objects are masked from 'package:stats':
## 
##     aov, lm
## 
## The following object is masked from 'package:graphics':
## 
##     plot.design

attach(x)

# Constructing the design matrix

design <- FrF2(32, nfactors=6, estimable = formula("~black + sex + maried + school + employed + hlth +hlth:(~black + sex + maried + school + employed + hlth)"), factor.names = c("black", "sex", "maried", "school", "employed","hlth"), res5=TRUE, clear=FALSE)

design

##    black sex maried school employed hlth
## 1     -1  -1     -1      1        1   -1
## 2     -1  -1     -1     -1       -1   -1
## 3     -1  -1     -1     -1        1    1
## 4      1   1     -1      1        1   -1
## 5     -1  -1     -1      1       -1    1
## 6     -1   1     -1      1        1    1
## 7     -1  -1      1     -1       -1    1
## 8      1   1      1      1        1    1
## 9      1  -1     -1     -1       -1    1
## 10     1  -1     -1     -1        1   -1
## 11    -1  -1      1      1       -1   -1
## 12     1  -1      1      1       -1    1
## 13    -1   1      1     -1        1    1
## 14    -1   1     -1     -1        1   -1
## 15     1  -1      1     -1        1    1
## 16     1  -1      1      1        1   -1
## 17     1   1      1      1       -1   -1
## 18    -1   1      1      1       -1    1
## 19     1   1      1     -1        1   -1
## 20    -1   1      1      1        1   -1
## 21     1  -1     -1      1        1    1
## 22     1   1      1     -1       -1    1
## 23    -1   1      1     -1       -1   -1
## 24     1   1     -1      1       -1    1
## 25     1   1     -1     -1        1    1
## 26    -1  -1      1     -1        1   -1
## 27     1  -1     -1      1       -1   -1
## 28     1  -1      1     -1       -1   -1
## 29    -1   1     -1      1       -1   -1
## 30    -1   1     -1     -1       -1    1
## 31    -1  -1      1      1        1    1
## 32     1   1     -1     -1       -1   -1
## class=design, type= FrF2.estimable

aliasprint(design)

## $legend
## [1] A=black    B=sex      C=maried   D=school   E=employed F=hlth    
## 
## [[2]]
## [1] no aliasing among main effects and 2fis

# Creating new datafram which include the combined data of the original, along with the factor level combinations

new = merge(design, x, by=c("black", "sex", "maried", "school", "employed", "hlth"), all = FALSE)

head(new)

##   black sex maried school employed hlth ofp
## 1    -1  -1     -1      1       -1    1   0
## 2    -1  -1     -1      1       -1    1   2
## 3    -1  -1     -1      1       -1    1   2
## 4    -1  -1     -1      1        1   -1   9
## 5    -1  -1     -1      1        1   -1   0
## 6    -1  -1     -1      1        1   -1   7

# Run the same ANOVA with the new dataset

model2 <- aov(ofp ~black + sex + maried + school + employed + hlth, data=new)
summary(model2)

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## black         1    227     227   3.006   0.0837 .  
## sex           1     48      48   0.634   0.4263    
## maried        1     89      89   1.173   0.2794    
## school        1     10      10   0.126   0.7225    
## employed      1    240     240   3.177   0.0755 .  
## hlth          1   4379    4379  58.020 1.87e-13 ***
## Residuals   403  30417      75                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Running the ANOVA for the new dataset result in different p-values from our original model. Now, race and health are both almost significant, with p-values of 0.0837 and 0.0755 respectively, with an alpha of 0.05. However, since these p-values are still not less than 0.05, we must fail to reject the null hypothesis - that the variation in number of doctor visists cannot be explained by anything other than randomization.

Diagnostics/Model Adequacy Checking

Assumptions for ANOVA tests include that the data is normally distributed.

shapiro.test(x$ofp)

## 
##  Shapiro-Wilk normality test
## 
## data:  x$ofp
## W = 0.711, p-value < 2.2e-16

The null of a Shapiro-Wilk test states that the data presented is normally distributed. Unfortunately, the p-value is less than an alpha of 0.05, and therefore, we must reject the null hypothesis. This suggests, instead, that the frequency of doctor visits data is not normally distributed.

par(mfrow = c(1,1))
qqnorm(residuals(model))
qqline(residuals(model))

The Q-Q Normality Plots of the residuals for the original model do not appear to be normal. When normal, the data points would lay linearly, and in this case, they do not seem to lie along the line. This suggests that the data may not be parametric.

par(mfrow = c(1,1))
qqnorm(residuals(model2))
qqline(residuals(model2))

Once again, the Q-Q normality plot of the residuals, this time for the second model, do not appear normal.

plot(fitted(model),residuals(model))

The plots of the fitted values versus the residual values are supposed to appear symmetric over a residual of 0. The datapoints of the original model do not appear to be symmetric over the length of the dynamic range, and are not symmetric over 0.

plot(fitted(model2),residuals(model2))

As before, the plots of the fitted values versus the residual values for the the new model do not appear to be normal.

4. Contingencies

Based on the model adequacy testing, it appears that the data is not normally distributed, and that this may not be the best fit of the data.

A Kruskal-Wallis test can be used to evaluate whether groups’ distributions are identical, without the assumption that the data is normally distributed.

With the kruskal-wallis test, we could only analyze the main effect of one factor at a time. I chose to analyze race and health.

kruskal.test(ofp ~ black, data=new)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  ofp by black
## Kruskal-Wallis chi-squared = 3.0358, df = 1, p-value = 0.08145

The p-value of this Kruskal-Wallis test is greater than an alpha of 0.05. Therefore, you must fail to reject the null - that variation in frequency of doctor visits cannot be explained by anything other than randomization.

kruskal.test(ofp ~ hlth, data=new)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  ofp by hlth
## Kruskal-Wallis chi-squared = 65.8292, df = 1, p-value = 4.918e-16

The p-value for the factor, health, of this Kruskal-Wallis test is less than an alpha of 0.05. Therefore, you must reject the null - that variation in frequency of doctor visits cannot be explained by something other than randomization, and suggest an alternative - that the randomization can be explained by something other than randomization.

4. References

http://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf

Doctor Visits

Jane Braun

Novemer 20th, 2014

Recipe 8: Fractional Factorial Design

Recipes for the Design of Experiments

Visits to Physician’s Office

Jane Braun

RPI

November 20th Version 1.0

1. Setting

System under test

Factors and Levels

Continuous variables

Response variables

The Data

Randomization

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

Randomize:

Replicate:

Block:

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Testing

Analysis of Variance

Design Matrix Construction

Diagnostics/Model Adequacy Checking

4. Contingencies

4. References