Rensselaer Polytechnic Institute

1.Setting

library(Ecdat)

System Under Test

The data are extracted from the 1987-88 National Medical Expenditure Survey(NMES). A subsample of people ages 66 and over all of whom are covered by Medicare. The study consists of 4406 observations and 22 variables.

A dataframe containing:

  • ofp : number of physician office visits
  • ofnp : number of nonphysician office visits
  • opp : number of physician outpatient visits
  • opnp : number of nonphysician outpatient visits
  • emr : number of emergency room visits
  • hosp : number of hospitalizations
  • numchron : number of chronic conditions
  • adldiff : the person has a condition that limits activities of daily living ?
  • age : age in years (divided by 10)
  • black : is the person african-american ?
  • sex : is the person male ?
  • maried : is the person maried ?
  • school : number of years of education
  • faminc : family income in 10000$
  • employed : is the person employed ?
  • privins : is the person covered by private health insurance ?
  • medicaid : is the person covered by medicaid ?
  • region : the region (noreast, midwest,west)
  • hlth self :-perceived health (excellent, poor, other)

First 6 rows of Tobacco data:

head(OFP)
##   ofp ofnp opp opnp emr hosp numchron adldiff age black    sex maried
## 1   5    0   0    0   0    1        2       0 6.9   yes   male    yes
## 2   1    0   2    0   2    0        2       0 7.4    no female    yes
## 3  13    0   0    0   3    3        4       1 6.6   yes female     no
## 4  16    0   5    0   1    1        2       1 7.6    no   male    yes
## 5   3    0   0    0   0    0        2       1 7.9    no female    yes
## 6  17    0   0    0   0    0        5       1 6.6    no female     no
##   school faminc employed privins medicaid region  hlth
## 1      6 2.8810      yes     yes       no  other other
## 2     10 2.7478       no     yes       no  other other
## 3     10 0.6532       no      no      yes  other  poor
## 4      3 0.6588       no     yes       no  other  poor
## 5      6 0.6588       no     yes       no  other other
## 6      7 0.3301       no      no      yes  other  poor

Summary of the Tobacco dataframe:

summary(OFP)
##       ofp              ofnp              opp                opnp         
##  Min.   : 0.000   Min.   :  0.000   Min.   :  0.0000   Min.   :  0.0000  
##  1st Qu.: 1.000   1st Qu.:  0.000   1st Qu.:  0.0000   1st Qu.:  0.0000  
##  Median : 4.000   Median :  0.000   Median :  0.0000   Median :  0.0000  
##  Mean   : 5.774   Mean   :  1.618   Mean   :  0.7508   Mean   :  0.5361  
##  3rd Qu.: 8.000   3rd Qu.:  1.000   3rd Qu.:  0.0000   3rd Qu.:  0.0000  
##  Max.   :89.000   Max.   :104.000   Max.   :141.0000   Max.   :155.0000  
##       emr               hosp          numchron        adldiff     
##  Min.   : 0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 0.0000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 0.0000   Median :0.000   Median :1.000   Median :0.000  
##  Mean   : 0.2635   Mean   :0.296   Mean   :1.542   Mean   :0.204  
##  3rd Qu.: 0.0000   3rd Qu.:0.000   3rd Qu.:2.000   3rd Qu.:0.000  
##  Max.   :12.0000   Max.   :8.000   Max.   :8.000   Max.   :1.000  
##       age         black          sex       maried         school     
##  Min.   : 6.600   no :3890   female:2628   no :2000   Min.   : 0.00  
##  1st Qu.: 6.900   yes: 516   male  :1778   yes:2406   1st Qu.: 8.00  
##  Median : 7.300                                       Median :11.00  
##  Mean   : 7.402                                       Mean   :10.29  
##  3rd Qu.: 7.800                                       3rd Qu.:12.00  
##  Max.   :10.900                                       Max.   :18.00  
##      faminc        employed   privins    medicaid       region    
##  Min.   :-1.0125   no :3951   no : 985   no :4004   other  :1614  
##  1st Qu.: 0.9122   yes: 455   yes:3421   yes: 402   noreast: 837  
##  Median : 1.6982                                    midwest:1157  
##  Mean   : 2.5271                                    west   : 798  
##  3rd Qu.: 3.1728                                                  
##  Max.   :54.8351                                                  
##         hlth     
##  other    :3509  
##  excellent: 343  
##  poor     : 554  
##                  
##                  
## 

Factors and Levels

For Project 3, it was required to select a dataset with two 2 level IVs and two 3-level IVs. For project 3 purposes, I changed factor “Region”’s levels from 4 to 3.

Factors and Levels of each factor is listed below:

str(OFP)
## 'data.frame':    4406 obs. of  19 variables:
##  $ ofp     : int  5 1 13 16 3 17 9 3 1 0 ...
##  $ ofnp    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ opp     : int  0 2 0 5 0 0 0 0 0 0 ...
##  $ opnp    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ emr     : int  0 2 3 1 0 0 0 0 0 0 ...
##  $ hosp    : int  1 0 3 1 0 0 0 0 0 0 ...
##  $ numchron: int  2 2 4 2 2 5 0 0 0 0 ...
##  $ adldiff : int  0 0 1 1 1 1 0 0 0 0 ...
##  $ age     : num  6.9 7.4 6.6 7.6 7.9 6.6 7.5 8.7 7.3 7.8 ...
##  $ black   : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ sex     : Factor w/ 2 levels "female","male": 2 1 1 2 1 1 1 1 1 1 ...
##  $ maried  : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 1 1 1 1 ...
##  $ school  : int  6 10 10 3 6 7 8 8 8 8 ...
##  $ faminc  : num  2.881 2.748 0.653 0.659 0.659 ...
##  $ employed: Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ privins : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 2 2 2 2 ...
##  $ medicaid: Factor w/ 2 levels "no","yes": 1 1 2 1 1 2 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "other","noreast",..: 1 1 1 1 1 1 3 3 3 3 ...
##  $ hlth    : Factor w/ 3 levels "other","excellent",..: 1 1 3 3 1 3 1 1 1 1 ...

The 3 factors being studied include:

  • employed: no, yes
  • medicaid : no, yes
  • region : noreast, midwest, west (removed “other” level)
  • health : other, excellent, poor
levels(OFP$employed)
## [1] "no"  "yes"
levels(OFP$medicaid)
## [1] "no"  "yes"
levels(OFP$hlth)
## [1] "other"     "excellent" "poor"
levels(OFP$region)
## [1] "other"   "noreast" "midwest" "west"

Used “droplevels” to levels of Factor “region” from 4 to 3.

OFP = subset(OFP, region != "other")
OFP$region <- as.factor(OFP$region)
levels(droplevels(OFP$region))
## [1] "noreast" "midwest" "west"
head(OFP)
##    ofp ofnp opp opnp emr hosp numchron adldiff age black    sex maried
## 7    9    0   0    0   0    0        0       0 7.5    no female     no
## 8    3    0   0    0   0    0        0       0 8.7    no female     no
## 9    1    0   0    0   0    0        0       0 7.3    no female     no
## 10   0    0   0    0   0    0        0       0 7.8    no female     no
## 11   0    0   0    0   0    0        1       0 6.6    no   male    yes
## 12  44    5   2    0   0    1        5       1 6.9    no female    yes
##    school faminc employed privins medicaid  region  hlth
## 7       8 0.8280       no     yes       no midwest other
## 8       8 3.0456       no     yes       no midwest other
## 9       8 3.0456       no     yes       no midwest other
## 10      8 3.0456       no     yes       no midwest other
## 11      8 2.9498      yes     yes       no midwest other
## 12     15 2.9498       no     yes       no midwest other

Continuous Variables

There are 11 continuous variables:

  • age : age in years (divided by 10)
  • faminc : family income in 10000$

Reponse Variable

There are 6 reponse variables that are mutually exclusive measures of utilization according to “Demandd for Medical Care by the Elderly: A Finite Mixture Approach”

  • ofp : visits to a physician in an office setting
  • ofnp : visits to a non-physician in an office setting
  • opp : visits to a physician in a hospital outpatient setting
  • opnp : visits to a non-physician in an outpatient settings
  • EMR : visits to an emergency room
  • HOSP : number of hospital stays

For thie experiment, we will use “ofp” as our response variable.

The Data: How is it organized and what does it look like?

The data analyzed in the recipe provides a comprehensive picture of how Americans use and pay for health services. The dataset is a subsample of individuals ages 66 and over (a total of 4406 observations). Each individual is covered by medicare. According to the data, most elderly persons see a physician at least once a year and small fraction go to an emergency room or are admitted to a hospital, even as outpatients.

A new dataset is created by subsetting the dataset using the 4 factors discussed above (2 factors with 2 levels, 2 factors with 3 levels).

myvars <- c("ofp", "employed", "medicaid", "region", "hlth")
newdata <- droplevels(OFP[myvars])
str(newdata)
## 'data.frame':    2792 obs. of  5 variables:
##  $ ofp     : int  9 3 1 0 0 44 2 1 19 19 ...
##  $ employed: Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
##  $ medicaid: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ region  : Factor w/ 3 levels "noreast","midwest",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ hlth    : Factor w/ 3 levels "other","excellent",..: 1 1 1 1 1 1 1 3 1 1 ...
summary(newdata)
##       ofp         employed   medicaid       region            hlth     
##  Min.   : 0.000   no :2504   no :2585   noreast: 837   other    :2272  
##  1st Qu.: 1.000   yes: 288   yes: 207   midwest:1157   excellent: 238  
##  Median : 4.000                         west   : 798   poor     : 282  
##  Mean   : 5.888                                                        
##  3rd Qu.: 8.000                                                        
##  Max.   :89.000

Next, I made necessary changes to the factors chosen as listed below in order to form the required factorial design:

Factors “employed”, “medicaid” change no to 0, yes to 1 Factor “region”,

levels(newdata$employed) <-c("0", "1")
levels(newdata$medicaid) <- c("0", "1")
levels(newdata$region) <-c("0", "1", "2")
levels(newdata$hlth) <- c("0", "1", "2")
levels(newdata$employed)
## [1] "0" "1"

Notice that the 1st six observations are 7th-12th observations of the original subsample. That is because for the 1st six observations of the original data, region factor was “other” which is the level we removed.

str(newdata)
## 'data.frame':    2792 obs. of  5 variables:
##  $ ofp     : int  9 3 1 0 0 44 2 1 19 19 ...
##  $ employed: Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
##  $ medicaid: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ region  : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hlth    : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 3 1 1 ...
summary(newdata)
##       ofp         employed medicaid region   hlth    
##  Min.   : 0.000   0:2504   0:2585   0: 837   0:2272  
##  1st Qu.: 1.000   1: 288   1: 207   1:1157   1: 238  
##  Median : 4.000                     2: 798   2: 282  
##  Mean   : 5.888                                      
##  3rd Qu.: 8.000                                      
##  Max.   :89.000

2. Experimental Design

How will the experiment be organized and conducted to test the hypothesis?

This experiment’s end goal is to contruct a 2^m-3 fractional factorial design with the highest resolution possible.

To reach that goal, following steps need to be done:

  • Full factorial design will be created of the form 2^2 * 3^2

  • 3 levels will be transformed into 2 level factors to form 2^6 full factorial design

  • 2^6 full factorial design can be transformed into 2^m-3 design.

Using aliases, the aliasing structure of this final fractional factorial design will be determined. From the main effects, a linear model will be constructed and tested with ANOVA. ## What is the rationale for this design? The rationale for constructing fractional and factorial design, in general, is to perform a cost-effective experiment with the reduction of number of running experiments while maintaining high efficiency. In addition, determining aliasing structure helps individually analyze two interactions, or a main effect and an interaction that might share the same column in the dataset.

3. Statistical Analysis

Exploratory Data Analysis- Graphics and Descriptive Summary

Below are box plots that display the relationship between the Response Variable, “ofp” and the four IVs.

boxplot(newdata$ofp ~ newdata$employed, xlab = "employed", ylab = "visits to a physician in an office setting", main = "Effect of Employment on Number ofVisits to a Physician")

boxplot(newdata$ofp ~ newdata$medicaid, xlab = "medicaid", ylab = "visits to a physician in an office setting", main = "Effect of Having Medicaid on Number ofVisits to a Physician")

boxplot(newdata$ofp ~ newdata$region, xlab = "region", ylab = "visits to a physician in an office setting", main = "Effect of Region on Number ofVisits to a Physician")

boxplot(newdata$ofp ~ newdata$hlth, xlab = "hlth", ylab = "visits to a physician in an office setting", main = "Effect of Health on Number of Visits to a Physician")

## Testing

Treatment Structure

library("FrF2")
## Warning: package 'FrF2' was built under R version 3.3.2
## Warning: package 'DoE.base' was built under R version 3.3.2
## Warning: package 'conf.design' was built under R version 3.3.2
fullfactorial <- FrF2(64,6)
print(fullfactorial)
##     A  B  C  D  E  F
## 1  -1  1 -1 -1 -1  1
## 2   1 -1 -1 -1  1 -1
## 3  -1 -1  1  1  1 -1
## 4  -1  1  1 -1  1 -1
## 5   1  1 -1  1 -1  1
## 6   1 -1  1 -1 -1 -1
## 7  -1  1 -1  1  1  1
## 8   1 -1 -1  1 -1 -1
## 9  -1  1 -1  1 -1  1
## 10 -1  1  1  1  1  1
## 11 -1 -1  1  1 -1 -1
## 12  1 -1  1  1  1 -1
## 13 -1  1  1  1 -1 -1
## 14 -1  1  1 -1 -1  1
## 15 -1 -1  1 -1 -1 -1
## 16  1 -1  1 -1 -1  1
## 17  1 -1 -1 -1  1  1
## 18  1 -1 -1 -1 -1 -1
## 19  1 -1 -1  1 -1  1
## 20 -1  1  1  1 -1  1
## 21 -1 -1 -1  1  1 -1
## 22  1  1 -1  1  1 -1
## 23  1  1 -1 -1  1  1
## 24  1  1  1 -1 -1  1
## 25 -1 -1  1 -1  1  1
## 26 -1 -1 -1 -1  1 -1
## 27  1  1 -1 -1 -1 -1
## 28 -1 -1 -1 -1 -1  1
## 29 -1  1 -1  1 -1 -1
## 30 -1  1  1  1  1 -1
## 31  1 -1 -1  1  1 -1
## 32 -1 -1 -1  1 -1  1
## 33 -1 -1 -1 -1 -1 -1
## 34  1  1 -1  1  1  1
## 35 -1 -1  1  1  1  1
## 36 -1 -1 -1 -1  1  1
## 37  1  1  1  1 -1  1
## 38  1 -1  1 -1  1  1
## 39  1 -1 -1 -1 -1  1
## 40  1  1  1 -1  1  1
## 41  1  1  1 -1  1 -1
## 42 -1  1  1 -1  1  1
## 43 -1  1 -1 -1 -1 -1
## 44 -1  1  1 -1 -1 -1
## 45 -1 -1 -1  1  1  1
## 46 -1 -1  1  1 -1  1
## 47 -1  1 -1  1  1 -1
## 48  1  1  1  1  1 -1
## 49  1  1  1  1  1  1
## 50 -1 -1  1 -1 -1  1
## 51 -1  1 -1 -1  1  1
## 52 -1 -1  1 -1  1 -1
## 53 -1 -1 -1  1 -1 -1
## 54  1 -1 -1  1  1  1
## 55  1 -1  1  1  1  1
## 56  1  1 -1 -1  1 -1
## 57  1  1  1  1 -1 -1
## 58 -1  1 -1 -1  1 -1
## 59  1  1 -1 -1 -1  1
## 60  1  1  1 -1 -1 -1
## 61  1 -1  1  1 -1  1
## 62  1 -1  1  1 -1 -1
## 63  1 -1  1 -1  1 -1
## 64  1  1 -1  1 -1 -1
## class=design, type= full factorial

It is clear that conducting a full factorial design experiments requires many runs and lots of time to examine the data. The fractional factorial design below (2^6-3) results in only 8 experimental runs with 3 being the highest resoluion.

fracfactorial <- FrF2(8,6, res3 = T)
print(fracfactorial)
##    A  B  C  D  E  F
## 1 -1  1 -1 -1  1 -1
## 2  1  1  1  1  1  1
## 3  1 -1 -1 -1 -1  1
## 4 -1 -1 -1  1  1  1
## 5 -1 -1  1  1 -1 -1
## 6 -1  1  1 -1 -1  1
## 7  1 -1  1 -1  1 -1
## 8  1  1 -1  1 -1 -1
## class=design, type= FrF2
aliasprint(fracfactorial)
## $legend
## [1] A=A B=B C=C D=D E=E F=F
## 
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
## 
## $fi2
## [1] AF=BE=CD

In this section, we conduct an ANOVA test to test the main effects and the 2-factor interaction effects.

model1 <- lm(ofp~(employed + medicaid + region + hlth)^2, data = newdata)
anova(model1)
## Analysis of Variance Table
## 
## Response: ofp
##                     Df Sum Sq Mean Sq F value    Pr(>F)    
## employed             1     20   20.47  0.4550 0.5000194    
## medicaid             1    198  197.80  4.3968 0.0360968 *  
## region               2    418  208.90  4.6435 0.0096985 ** 
## hlth                 2   4032 2016.06 44.8136 < 2.2e-16 ***
## employed:medicaid    1     48   48.15  1.0704 0.3009520    
## employed:region      2     18    9.19  0.2043 0.8152420    
## employed:hlth        2    712  356.21  7.9180 0.0003724 ***
## medicaid:region      2     41   20.58  0.4574 0.6329499    
## medicaid:hlth        2    146   73.15  1.6260 0.1969061    
## region:hlth          4    662  165.41  3.6769 0.0054275 ** 
## Residuals         2772 124706   44.99                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this section, an Anova table for the main effects is shown. This table shows that only “employed”" does not have statistically significant effects on the visits to the physicians office.

model2 <- lm(ofp~employed +medicaid +region+hlth, data = newdata)
anova(model2)
## Analysis of Variance Table
## 
## Response: ofp
##             Df Sum Sq Mean Sq F value  Pr(>F)    
## employed     1     20   20.47  0.4513 0.50179    
## medicaid     1    198  197.80  4.3605 0.03687 *  
## region       2    418  208.90  4.6052 0.01008 *  
## hlth         2   4032 2016.06 44.4435 < 2e-16 ***
## Residuals 2785 126334   45.36                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Diagnostics/Model Adequacy Checking

We plot a normal Q-Q plot to check for normality.

qqnorm(residuals(model2))
qqnorm(residuals(model2))

The residuals fit the normal line pretty well. This is a good sign.

plot(fitted(model2),residuals(model2))

There is clear display of linearity although most are dispursed randomly.

4. References to the literature

Montgomery, Douglas C. 2012. Design and Analysis of Experiments, 8th Edition

5. Appendices

Raw Data Dataset- “OFP” from Ecdat R package