Rensselaer Polytechnic Institute
library(Ecdat)
The data are extracted from the 1987-88 National Medical Expenditure Survey(NMES). A subsample of people ages 66 and over all of whom are covered by Medicare. The study consists of 4406 observations and 22 variables.
A dataframe containing:
First 6 rows of Tobacco data:
head(OFP)
## ofp ofnp opp opnp emr hosp numchron adldiff age black sex maried
## 1 5 0 0 0 0 1 2 0 6.9 yes male yes
## 2 1 0 2 0 2 0 2 0 7.4 no female yes
## 3 13 0 0 0 3 3 4 1 6.6 yes female no
## 4 16 0 5 0 1 1 2 1 7.6 no male yes
## 5 3 0 0 0 0 0 2 1 7.9 no female yes
## 6 17 0 0 0 0 0 5 1 6.6 no female no
## school faminc employed privins medicaid region hlth
## 1 6 2.8810 yes yes no other other
## 2 10 2.7478 no yes no other other
## 3 10 0.6532 no no yes other poor
## 4 3 0.6588 no yes no other poor
## 5 6 0.6588 no yes no other other
## 6 7 0.3301 no no yes other poor
Summary of the Tobacco dataframe:
summary(OFP)
## ofp ofnp opp opnp
## Min. : 0.000 Min. : 0.000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 4.000 Median : 0.000 Median : 0.0000 Median : 0.0000
## Mean : 5.774 Mean : 1.618 Mean : 0.7508 Mean : 0.5361
## 3rd Qu.: 8.000 3rd Qu.: 1.000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :89.000 Max. :104.000 Max. :141.0000 Max. :155.0000
## emr hosp numchron adldiff
## Min. : 0.0000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.: 0.0000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.000
## Median : 0.0000 Median :0.000 Median :1.000 Median :0.000
## Mean : 0.2635 Mean :0.296 Mean :1.542 Mean :0.204
## 3rd Qu.: 0.0000 3rd Qu.:0.000 3rd Qu.:2.000 3rd Qu.:0.000
## Max. :12.0000 Max. :8.000 Max. :8.000 Max. :1.000
## age black sex maried school
## Min. : 6.600 no :3890 female:2628 no :2000 Min. : 0.00
## 1st Qu.: 6.900 yes: 516 male :1778 yes:2406 1st Qu.: 8.00
## Median : 7.300 Median :11.00
## Mean : 7.402 Mean :10.29
## 3rd Qu.: 7.800 3rd Qu.:12.00
## Max. :10.900 Max. :18.00
## faminc employed privins medicaid region
## Min. :-1.0125 no :3951 no : 985 no :4004 other :1614
## 1st Qu.: 0.9122 yes: 455 yes:3421 yes: 402 noreast: 837
## Median : 1.6982 midwest:1157
## Mean : 2.5271 west : 798
## 3rd Qu.: 3.1728
## Max. :54.8351
## hlth
## other :3509
## excellent: 343
## poor : 554
##
##
##
For Project 3, it was required to select a dataset with two 2 level IVs and two 3-level IVs. For project 3 purposes, I changed factor “Region”’s levels from 4 to 3.
Factors and Levels of each factor is listed below:
str(OFP)
## 'data.frame': 4406 obs. of 19 variables:
## $ ofp : int 5 1 13 16 3 17 9 3 1 0 ...
## $ ofnp : int 0 0 0 0 0 0 0 0 0 0 ...
## $ opp : int 0 2 0 5 0 0 0 0 0 0 ...
## $ opnp : int 0 0 0 0 0 0 0 0 0 0 ...
## $ emr : int 0 2 3 1 0 0 0 0 0 0 ...
## $ hosp : int 1 0 3 1 0 0 0 0 0 0 ...
## $ numchron: int 2 2 4 2 2 5 0 0 0 0 ...
## $ adldiff : int 0 0 1 1 1 1 0 0 0 0 ...
## $ age : num 6.9 7.4 6.6 7.6 7.9 6.6 7.5 8.7 7.3 7.8 ...
## $ black : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 2 1 1 1 1 1 1 ...
## $ maried : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 1 1 1 1 ...
## $ school : int 6 10 10 3 6 7 8 8 8 8 ...
## $ faminc : num 2.881 2.748 0.653 0.659 0.659 ...
## $ employed: Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ privins : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 2 2 2 2 ...
## $ medicaid: Factor w/ 2 levels "no","yes": 1 1 2 1 1 2 1 1 1 1 ...
## $ region : Factor w/ 4 levels "other","noreast",..: 1 1 1 1 1 1 3 3 3 3 ...
## $ hlth : Factor w/ 3 levels "other","excellent",..: 1 1 3 3 1 3 1 1 1 1 ...
The 3 factors being studied include:
levels(OFP$employed)
## [1] "no" "yes"
levels(OFP$medicaid)
## [1] "no" "yes"
levels(OFP$hlth)
## [1] "other" "excellent" "poor"
levels(OFP$region)
## [1] "other" "noreast" "midwest" "west"
Used “droplevels” to levels of Factor “region” from 4 to 3.
OFP = subset(OFP, region != "other")
OFP$region <- as.factor(OFP$region)
levels(droplevels(OFP$region))
## [1] "noreast" "midwest" "west"
head(OFP)
## ofp ofnp opp opnp emr hosp numchron adldiff age black sex maried
## 7 9 0 0 0 0 0 0 0 7.5 no female no
## 8 3 0 0 0 0 0 0 0 8.7 no female no
## 9 1 0 0 0 0 0 0 0 7.3 no female no
## 10 0 0 0 0 0 0 0 0 7.8 no female no
## 11 0 0 0 0 0 0 1 0 6.6 no male yes
## 12 44 5 2 0 0 1 5 1 6.9 no female yes
## school faminc employed privins medicaid region hlth
## 7 8 0.8280 no yes no midwest other
## 8 8 3.0456 no yes no midwest other
## 9 8 3.0456 no yes no midwest other
## 10 8 3.0456 no yes no midwest other
## 11 8 2.9498 yes yes no midwest other
## 12 15 2.9498 no yes no midwest other
There are 11 continuous variables:
There are 6 reponse variables that are mutually exclusive measures of utilization according to “Demandd for Medical Care by the Elderly: A Finite Mixture Approach”
For thie experiment, we will use “ofp” as our response variable.
The data analyzed in the recipe provides a comprehensive picture of how Americans use and pay for health services. The dataset is a subsample of individuals ages 66 and over (a total of 4406 observations). Each individual is covered by medicare. According to the data, most elderly persons see a physician at least once a year and small fraction go to an emergency room or are admitted to a hospital, even as outpatients.
A new dataset is created by subsetting the dataset using the 4 factors discussed above (2 factors with 2 levels, 2 factors with 3 levels).
myvars <- c("ofp", "employed", "medicaid", "region", "hlth")
newdata <- droplevels(OFP[myvars])
str(newdata)
## 'data.frame': 2792 obs. of 5 variables:
## $ ofp : int 9 3 1 0 0 44 2 1 19 19 ...
## $ employed: Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
## $ medicaid: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ region : Factor w/ 3 levels "noreast","midwest",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ hlth : Factor w/ 3 levels "other","excellent",..: 1 1 1 1 1 1 1 3 1 1 ...
summary(newdata)
## ofp employed medicaid region hlth
## Min. : 0.000 no :2504 no :2585 noreast: 837 other :2272
## 1st Qu.: 1.000 yes: 288 yes: 207 midwest:1157 excellent: 238
## Median : 4.000 west : 798 poor : 282
## Mean : 5.888
## 3rd Qu.: 8.000
## Max. :89.000
Next, I made necessary changes to the factors chosen as listed below in order to form the required factorial design:
Factors “employed”, “medicaid” change no to 0, yes to 1 Factor “region”,
levels(newdata$employed) <-c("0", "1")
levels(newdata$medicaid) <- c("0", "1")
levels(newdata$region) <-c("0", "1", "2")
levels(newdata$hlth) <- c("0", "1", "2")
levels(newdata$employed)
## [1] "0" "1"
Notice that the 1st six observations are 7th-12th observations of the original subsample. That is because for the 1st six observations of the original data, region factor was “other” which is the level we removed.
str(newdata)
## 'data.frame': 2792 obs. of 5 variables:
## $ ofp : int 9 3 1 0 0 44 2 1 19 19 ...
## $ employed: Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
## $ medicaid: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ region : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hlth : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 3 1 1 ...
summary(newdata)
## ofp employed medicaid region hlth
## Min. : 0.000 0:2504 0:2585 0: 837 0:2272
## 1st Qu.: 1.000 1: 288 1: 207 1:1157 1: 238
## Median : 4.000 2: 798 2: 282
## Mean : 5.888
## 3rd Qu.: 8.000
## Max. :89.000
This experiment’s end goal is to contruct a 2^m-3 fractional factorial design with the highest resolution possible.
To reach that goal, following steps need to be done:
Full factorial design will be created of the form 2^2 * 3^2
3 levels will be transformed into 2 level factors to form 2^6 full factorial design
2^6 full factorial design can be transformed into 2^m-3 design.
Using aliases, the aliasing structure of this final fractional factorial design will be determined. From the main effects, a linear model will be constructed and tested with ANOVA. ## What is the rationale for this design? The rationale for constructing fractional and factorial design, in general, is to perform a cost-effective experiment with the reduction of number of running experiments while maintaining high efficiency. In addition, determining aliasing structure helps individually analyze two interactions, or a main effect and an interaction that might share the same column in the dataset.
Below are box plots that display the relationship between the Response Variable, “ofp” and the four IVs.
boxplot(newdata$ofp ~ newdata$employed, xlab = "employed", ylab = "visits to a physician in an office setting", main = "Effect of Employment on Number ofVisits to a Physician")
boxplot(newdata$ofp ~ newdata$medicaid, xlab = "medicaid", ylab = "visits to a physician in an office setting", main = "Effect of Having Medicaid on Number ofVisits to a Physician")
boxplot(newdata$ofp ~ newdata$region, xlab = "region", ylab = "visits to a physician in an office setting", main = "Effect of Region on Number ofVisits to a Physician")
boxplot(newdata$ofp ~ newdata$hlth, xlab = "hlth", ylab = "visits to a physician in an office setting", main = "Effect of Health on Number of Visits to a Physician")
## Testing
library("FrF2")
## Warning: package 'FrF2' was built under R version 3.3.2
## Warning: package 'DoE.base' was built under R version 3.3.2
## Warning: package 'conf.design' was built under R version 3.3.2
fullfactorial <- FrF2(64,6)
print(fullfactorial)
## A B C D E F
## 1 -1 1 -1 -1 -1 1
## 2 1 -1 -1 -1 1 -1
## 3 -1 -1 1 1 1 -1
## 4 -1 1 1 -1 1 -1
## 5 1 1 -1 1 -1 1
## 6 1 -1 1 -1 -1 -1
## 7 -1 1 -1 1 1 1
## 8 1 -1 -1 1 -1 -1
## 9 -1 1 -1 1 -1 1
## 10 -1 1 1 1 1 1
## 11 -1 -1 1 1 -1 -1
## 12 1 -1 1 1 1 -1
## 13 -1 1 1 1 -1 -1
## 14 -1 1 1 -1 -1 1
## 15 -1 -1 1 -1 -1 -1
## 16 1 -1 1 -1 -1 1
## 17 1 -1 -1 -1 1 1
## 18 1 -1 -1 -1 -1 -1
## 19 1 -1 -1 1 -1 1
## 20 -1 1 1 1 -1 1
## 21 -1 -1 -1 1 1 -1
## 22 1 1 -1 1 1 -1
## 23 1 1 -1 -1 1 1
## 24 1 1 1 -1 -1 1
## 25 -1 -1 1 -1 1 1
## 26 -1 -1 -1 -1 1 -1
## 27 1 1 -1 -1 -1 -1
## 28 -1 -1 -1 -1 -1 1
## 29 -1 1 -1 1 -1 -1
## 30 -1 1 1 1 1 -1
## 31 1 -1 -1 1 1 -1
## 32 -1 -1 -1 1 -1 1
## 33 -1 -1 -1 -1 -1 -1
## 34 1 1 -1 1 1 1
## 35 -1 -1 1 1 1 1
## 36 -1 -1 -1 -1 1 1
## 37 1 1 1 1 -1 1
## 38 1 -1 1 -1 1 1
## 39 1 -1 -1 -1 -1 1
## 40 1 1 1 -1 1 1
## 41 1 1 1 -1 1 -1
## 42 -1 1 1 -1 1 1
## 43 -1 1 -1 -1 -1 -1
## 44 -1 1 1 -1 -1 -1
## 45 -1 -1 -1 1 1 1
## 46 -1 -1 1 1 -1 1
## 47 -1 1 -1 1 1 -1
## 48 1 1 1 1 1 -1
## 49 1 1 1 1 1 1
## 50 -1 -1 1 -1 -1 1
## 51 -1 1 -1 -1 1 1
## 52 -1 -1 1 -1 1 -1
## 53 -1 -1 -1 1 -1 -1
## 54 1 -1 -1 1 1 1
## 55 1 -1 1 1 1 1
## 56 1 1 -1 -1 1 -1
## 57 1 1 1 1 -1 -1
## 58 -1 1 -1 -1 1 -1
## 59 1 1 -1 -1 -1 1
## 60 1 1 1 -1 -1 -1
## 61 1 -1 1 1 -1 1
## 62 1 -1 1 1 -1 -1
## 63 1 -1 1 -1 1 -1
## 64 1 1 -1 1 -1 -1
## class=design, type= full factorial
It is clear that conducting a full factorial design experiments requires many runs and lots of time to examine the data. The fractional factorial design below (2^6-3) results in only 8 experimental runs with 3 being the highest resoluion.
fracfactorial <- FrF2(8,6, res3 = T)
print(fracfactorial)
## A B C D E F
## 1 -1 1 -1 -1 1 -1
## 2 1 1 1 1 1 1
## 3 1 -1 -1 -1 -1 1
## 4 -1 -1 -1 1 1 1
## 5 -1 -1 1 1 -1 -1
## 6 -1 1 1 -1 -1 1
## 7 1 -1 1 -1 1 -1
## 8 1 1 -1 1 -1 -1
## class=design, type= FrF2
aliasprint(fracfactorial)
## $legend
## [1] A=A B=B C=C D=D E=E F=F
##
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
##
## $fi2
## [1] AF=BE=CD
In this section, we conduct an ANOVA test to test the main effects and the 2-factor interaction effects.
model1 <- lm(ofp~(employed + medicaid + region + hlth)^2, data = newdata)
anova(model1)
## Analysis of Variance Table
##
## Response: ofp
## Df Sum Sq Mean Sq F value Pr(>F)
## employed 1 20 20.47 0.4550 0.5000194
## medicaid 1 198 197.80 4.3968 0.0360968 *
## region 2 418 208.90 4.6435 0.0096985 **
## hlth 2 4032 2016.06 44.8136 < 2.2e-16 ***
## employed:medicaid 1 48 48.15 1.0704 0.3009520
## employed:region 2 18 9.19 0.2043 0.8152420
## employed:hlth 2 712 356.21 7.9180 0.0003724 ***
## medicaid:region 2 41 20.58 0.4574 0.6329499
## medicaid:hlth 2 146 73.15 1.6260 0.1969061
## region:hlth 4 662 165.41 3.6769 0.0054275 **
## Residuals 2772 124706 44.99
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this section, an Anova table for the main effects is shown. This table shows that only “employed”" does not have statistically significant effects on the visits to the physicians office.
model2 <- lm(ofp~employed +medicaid +region+hlth, data = newdata)
anova(model2)
## Analysis of Variance Table
##
## Response: ofp
## Df Sum Sq Mean Sq F value Pr(>F)
## employed 1 20 20.47 0.4513 0.50179
## medicaid 1 198 197.80 4.3605 0.03687 *
## region 2 418 208.90 4.6052 0.01008 *
## hlth 2 4032 2016.06 44.4435 < 2e-16 ***
## Residuals 2785 126334 45.36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We plot a normal Q-Q plot to check for normality.
qqnorm(residuals(model2))
qqnorm(residuals(model2))
The residuals fit the normal line pretty well. This is a good sign.
plot(fitted(model2),residuals(model2))
There is clear display of linearity although most are dispursed randomly.
Montgomery, Douglas C. 2012. Design and Analysis of Experiments, 8th Edition
Raw Data Dataset- “OFP” from Ecdat R package