Recipes for the Design of Experiments:

Recipe Outline

Trevor Corrao

RPI

11/9/16 Version 1

1. Setting

System under test

For this recipe, a dataset of childrens’ health status in the U.S. in 1986 is analyzed. The health metric is a continuous Dependent Variable, which is dependent on factors including number of doctor visits, number of children per household, and access to healthcare. There were 485 observations. This data can be found under the Doctor section library in the Ecdat Package.

#reading in data
library("Ecdat", lib.loc="~/R/win-library/3.1")
## Warning: package 'Ecdat' was built under R version 3.1.3
## Loading required package: Ecfun
## Warning: package 'Ecfun' was built under R version 3.1.3
## 
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
## 
##     Orange
library("Ecfun", lib.loc="~/R/win-library/3.1")
data(Doctor)
summary(Doctor)
##      doctor         children         access           health         
##  Min.   : 0.00   Min.   :1.000   Min.   :0.0000   Min.   :-1.524000  
##  1st Qu.: 0.00   1st Qu.:1.000   1st Qu.:0.2500   1st Qu.:-1.066000  
##  Median : 1.00   Median :2.000   Median :0.3500   Median :-0.421000  
##  Mean   : 1.61   Mean   :2.264   Mean   :0.3812   Mean   :-0.000041  
##  3rd Qu.: 2.00   3rd Qu.:3.000   3rd Qu.:0.5000   3rd Qu.: 0.657000  
##  Max.   :48.00   Max.   :9.000   Max.   :0.9200   Max.   : 7.217000

Factors and Levels

The three factors being considered for the experiment are:

  1. Doctor: the number of doctor visits.

  2. Children: the number of children in the household.

  3. Access: a measure of access to health care

In the Experimental Design portion of this Project, I will explore the factors, and decide which IV to block accordingly.

#setting factors
Doctor$doctor=as.factor(Doctor$doctor)
Doctor$children=as.factor(Doctor$children)
Doctor$access=as.factor(Doctor$access)

Continuous Variables

The response variable, health, is considered a continuous dependent variable. We hypothesize that the metric is dependent on the factors listed above. It can assume all values, not just integer values, over a range (-1.524<= x <= 7.217). That range is simply what was experienced in this study, but if further studies were conducted, the range could certainly broaden.

Deciding Upon a Response Variable

Initially, my intuition was telling me to use doctor (the number of doctor visits) as the response variable. This is a viable option as it very well could be dependent on the other three attributes. I would expect that with an increased number of children and increased access to health care, the number of visits would increase as well. With a larger “health” number, (larger positive numbers are associated with poorer health), I would expect the number of doctor visits to increase as well. This analysis could be performed. But, after critically evaluating the assignment, I realized that my dependent variable, the doctor visits, would only produce discrete integer variables. The assignment calls for continuous DV. So, I flipped my argument to analyze the effects of the factors on the “health” metric. I hypothesize that with increased access to healthcare and increased number of doctor visits, the health metric will go down. As for number of children, I do not forsee a correlation, but I will be performing further analysis under Experimental Design. I believe that number of children will prove to be a nuisance factor, and then I will have to select which factor, of the remaining two, to block.

The Data: How is it organized and what does it look like?

The structure of the Doctor data is as follows:

#analyzing structure
  str(Doctor)
## 'data.frame':    485 obs. of  4 variables:
##  $ doctor  : Factor w/ 17 levels "0","1","2","3",..: 1 2 1 1 12 4 1 7 2 1 ...
##  $ children: Factor w/ 8 levels "1","2","3","4",..: 1 3 4 2 1 1 2 1 1 1 ...
##  $ access  : Factor w/ 18 levels "0","0.08","0.17",..: 11 3 10 7 15 4 11 15 4 15 ...
##  $ health  : num  0.495 0.52 -1.227 -1.524 0.173 ...

First and last 6 observations of the dataset:

 head(Doctor)
##   doctor children access health
## 1      0        1    0.5  0.495
## 2      1        3   0.17  0.520
## 3      0        4   0.42 -1.227
## 4      0        2   0.33 -1.524
## 5     11        1   0.67  0.173
## 6      3        1   0.25 -0.905
 tail(Doctor)
##     doctor children access health
## 480      0        2    0.5 -1.524
## 481      0        2    0.5 -0.421
## 482      0        2   0.36 -0.930
## 483      9        1   0.33 -1.066
## 484      0        2   0.33 -1.363
## 485      1        3    0.5 -0.099

2.(Experimental) Design

Organization of the Experiment to Test the Hypothesis

The contributing factors and the response variables were selected naturally, just by following the guidelines outlined in the assignment. Health is certainly going to be the response variable, since it is a continuous dependent variable. Now, figuring out which independent variables to withhold depends on some inferrential statistics. Using box plots, below, I will eliminate one of the three factors, and consider it a nuisance factor.

#checking for relationships
boxplot(Doctor$health~Doctor$doctor, xlab="the number of doctor visits
+ ", ylab="Health")

boxplot(Doctor$health~Doctor$children, xlab="The number of children in the household", ylab="Health")

boxplot(Doctor$health~Doctor$access, xlab="Access to Health Care
+ ", ylab="Health")

As the boxplots highlight, the Number of children in the houselhold does not truly correlate with a significant variation in health level. The means are relatively even as the number of children increases. The other Independent variable, access and doctor, show much more variation. Thereby, I will define children as a nuisance factor, and focus on doctor and access as my “pair of categorical IVs.”

As we proceed, I will be blocking the access factor, to investigate the effect that number of doctor visits has on health.

A randomized, incomplete block design will be the approach to examine the effects of doctor on health. ANOVA testing will be performed to further examine the impacts.

Selection of Sample Size

To determine the proper sample size for our experiment, it is imperative that we set type I and type II error metrics. This will allow us to determine the effects of the relevant factor. Typically, one would use Cohen’s d to find the effect size, and then utilize the G Power tool to calculate total sample size. This is highly dependent on the notion that the factor would only have two levels. Since my factor, doctor, has 48 levels, i could not simply use Cohens d to estimate effect and sample size. I had to perfrom a sensitivity analysis within G Power to estimate the sample size. Our actual sample size is relatively small, so I was unsure if G power would increase or decrease the suggested sample size. Using the typical Type I and Type II errors to be used are .05 and .95, respectively, I received a total sample size of 528, and an effect size of .0.3049884. (This is surprising to me, given previous experiences where our original sample size acted as a population, and our G power sample size was a smaller portion of that populaton.)

Rationale for This Design

This design allows us to evaluate the main effects of the targeted factor on the response variable, without the interaction effects of other factors weighing in.

Randomization

This data is random because it is drawn from 485 random households across the United States. By looking at the physical data, we can observe an assortment of “children,” “access,” “health,” and “doctor” fields. We did not have any say during the gathering of the sample, so we have to accept the data for what it is, with its current level of randomness.

Replication

There are replicates, study observed the exact same factors on each participating household. Each household is an observation, but also a replicate. They would be “repeated” if the same study were performed on the same household more than once.

Blocking

There is blocking in this model. The “children factor” was blocked immediately, since it is a nuisance factor at this point. The more significant blocking was on “access.” This block allowed us to analyze the effects of “doctor” directly on “health,” without interaction effects from “access.”

3. (Statistical) Analysis

The first tet we will focus on is Null Hypothesis testing. Other tests will be explored in the next section.

(Exploratory Data Analysis) Graphics and descriptive summary

Summary Stats for the entire dataset:

  summary(Doctor)
##      doctor       children       access        health         
##  0      :241   1      :171   0.33   : 86   Min.   :-1.524000  
##  1      : 96   2      :142   0.5    : 84   1st Qu.:-1.066000  
##  2      : 55   3      : 93   0.17   : 56   Median :-0.421000  
##  3      : 30   4      : 49   0.42   : 48   Mean   :-0.000041  
##  4      : 14   5      : 19   0.67   : 40   3rd Qu.: 0.657000  
##  6      : 12   6      :  6   0.58   : 37   Max.   : 7.217000  
##  (Other): 37   (Other):  5   (Other):134

Mean Health Metric:

 mean(Doctor$health)
## [1] -4.123711e-05

We must first draw a histogram, to learn more about the distribution.

hist(Doctor$health)

As you can see, the distribution is not normal, but almost more exponential. We cannot assume normality while conducting further analysis.

Boxplot for doctor visits.

  boxplot(Doctor$health~Doctor$doctor, xlab="Doctor Visits", ylab="Health")

After analyzing the box plot for the following factors, there are a lot of things we can conclude. It seems that the trend is as doctor visits increase, the health metric increases as well. The health metric is quite couter-intuative though, as we have to keep in mind that the higher the number is, the less healthy the patient. So from the boxplot we can infer that unhealthy patients are taking more trips to the doctor. This hypothesis does make sense, according to the data. But, one could also argue that going to the doctor more will keep you healthier. This theor is not supported by the boxplot.

Testing

Model 1: ANOVA for the impact of number of doctor visits on health. This is taking into account the entire sample size of 485.

#anova test
model1=aov(Doctor$health~Doctor$doctor)
anova(model1)
## Analysis of Variance Table
## 
## Response: Doctor$health
##                Df Sum Sq Mean Sq F value    Pr(>F)    
## Doctor$doctor  16 151.58  9.4737  5.2592 3.071e-10 ***
## Residuals     468 843.03  1.8013                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The probability that the variation in health is directly caused by randomization is 0.000000000307. The F statistic came out to 5.2592. This being said, we reject our null hypothesis. The number of doctor visits could potentially be enough to predict the overall health of an individual.

The following qqnorm plot confirms the notion that our sample does not fit a normal distribution.

#model normality check
qqnorm(residuals(model1))
qqline(residuals(model1))

testset=Doctor[(Doctor),528]
model2=aov(Doctor$health~Doctor$doctor, data=testset)
anova(model2)
## Analysis of Variance Table
## 
## Response: Doctor$health
##                Df Sum Sq Mean Sq F value    Pr(>F)    
## Doctor$doctor  16 151.58  9.4737  5.2592 3.071e-10 ***
## Residuals     468 843.03  1.8013                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
eval=FALSE

This output suggest that the probability that the variation in health is directly caused by randomization is 0.00001174. The F statistic came out to 5.994. This being said, we reject our null hypothesis. The number of doctor visits could potentially be enough to predict the overall health of an individual.

The difference between our G Power n and our original n, in terms of significant impact on the model, is almost negligible.

model3=aov(Doctor$health~Doctor$doctor + Doctor$access)
anova(model3)
## Analysis of Variance Table
## 
## Response: Doctor$health
##                Df Sum Sq Mean Sq F value    Pr(>F)    
## Doctor$doctor  16 151.58  9.4737  5.2397 3.709e-10 ***
## Doctor$access  17  27.59  1.6229  0.8976    0.5772    
## Residuals     451 815.44  1.8081                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This ANOVA suggests that our blocking variable, access, does not have a significant effect on health. THis further solidifies the hypothesis that visits to the doctor is highly correlated with health.

Multiple Models

The only other model that we can consider for main effects, while still blocking “access,” would be ANOVA for the “children” factor. The results can be seen in the following ANOVA. It is safe to say that the number of children in a household does not have a significant impact on the Health of a child.

model4=aov(Doctor$health~Doctor$children)
anova(model4)
## Analysis of Variance Table
## 
## Response: Doctor$health
##                  Df Sum Sq Mean Sq F value Pr(>F)
## Doctor$children   7  11.42  1.6315  0.7915 0.5946
## Residuals       477 983.19  2.0612

We can also model interaction effects between the factors. The first ANOVA will model the interaction between “children”" and “doctor.”

#alternate interaction models
model5=aov(Doctor$health~Doctor$children*Doctor$doctor)
anova(model5)
## Analysis of Variance Table
## 
## Response: Doctor$health
##                                Df Sum Sq Mean Sq F value    Pr(>F)    
## Doctor$children                 7  11.42  1.6315  0.8905    0.5138    
## Doctor$doctor                  16 151.37  9.4604  5.1635 6.273e-10 ***
## Doctor$children:Doctor$doctor  30  42.16  1.4053  0.7670    0.8096    
## Residuals                     431 789.66  1.8322                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
interaction.plot(Doctor$children, Doctor$doctor, Doctor$health)

According to our ANOVA, the INteraction effects are insignificant, but according to our interaction plot, we can see that there certainly are interactions amongsth the datafields. They are just negligible in our case apparently.

The second ANOVA for interaction will be between “children” and “access.”

model6=aov(Doctor$health~Doctor$children*Doctor$access)
anova(model6)
## Analysis of Variance Table
## 
## Response: Doctor$health
##                                Df Sum Sq Mean Sq F value Pr(>F)
## Doctor$children                 7  11.42  1.6315  0.7749 0.6087
## Doctor$access                  17  34.25  2.0147  0.9568 0.5067
## Doctor$children:Doctor$access  44  73.03  1.6598  0.7883 0.8328
## Residuals                     416 875.91  2.1056
interaction.plot(Doctor$children, Doctor$access, Doctor$health)

Again, our ANOVA proves that the main effects for access to healthcare are above .05, deeming it insignificant. We also notice again that there are interactions between the factors, but the ANOVA claims that they are insignificant in retrospect.

Our last ANOVA for interaction is between “access” and “doctor.”

model7=aov(Doctor$health~Doctor$access*Doctor$doctor)
anova(model7)
## Analysis of Variance Table
## 
## Response: Doctor$health
##                              Df Sum Sq Mean Sq F value    Pr(>F)    
## Doctor$access                17  33.39  1.9640  1.1813  0.276559    
## Doctor$doctor                16 145.78  9.1113  5.4800 1.372e-10 ***
## Doctor$access:Doctor$doctor  62 168.67  2.7206  1.6363  0.003054 ** 
## Residuals                   389 646.77  1.6626                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
interaction.plot(Doctor$access, Doctor$doctor, Doctor$health)

This ANOVA and interaction chart suggest that there is a significant interaction effect between the access to healthcare, doctor visits, and overall health. This makes sense when applied to the real world as well.

Interval Estimation

Assuming the STDEV(health) of the population to be q.43352, the margin of error for our health metric at a 95% confidence level is 0.1275795. The confidence interval is between -0.1276208 and 0.1275383.

#interval estimation construction
 sd(Doctor$health)
## [1] 1.43352
n=485
sigma=1.43352
sem=sigma/sqrt(n); sem
## [1] 0.06509279
E=qnorm(.975)*sem; E
## [1] 0.1275795
xbar=mean(Doctor$health)
xbar+c(-E, E)
## [1] -0.1276208  0.1275383

Estimation

To best estimate the health metric, one could use the “doctor” factor. As seen in the parameters below, there is a relationship between “doctor” visits and “health.” The higher the number if visits, generally speaking, the higher the health metric (which ultimately signifies poor health.) This is a sound estimation strategy.

coef(model1)
##     (Intercept)  Doctor$doctor1  Doctor$doctor2  Doctor$doctor3 
##      -0.3943444       0.5181881       0.5272535       1.1424444 
##  Doctor$doctor4  Doctor$doctor5  Doctor$doctor6  Doctor$doctor7 
##       0.7937015       0.4282194       1.6350111       1.6524873 
##  Doctor$doctor8  Doctor$doctor9 Doctor$doctor10 Doctor$doctor11 
##       1.2325444       1.9497194       0.8980111       0.5673444 
## Doctor$doctor12 Doctor$doctor15 Doctor$doctor16 Doctor$doctor24 
##      -0.3746556      -0.0766556      -0.6716556       6.1683444 
## Doctor$doctor48 
##       2.4263444

Diagnostics/Model Adequacy Checking

The qualms that I experienced include the following. First off, the sensitivity analysis in GPower kept on giving me a number higher than my population size. This forces me to re-use random datum in my ANOVA. The second issue was that in my Interval Estimation I used Normal distribution, but I am not sure if this is accurate since originally the distribution was more exponential.

If we take a look at the qqnorm plots for the doctor vs health and the children vs health, we can realize that the doctor factor did not have a perfect fit, but it was the closest of the factors.

qqnorm(residuals(model1))
qqline(residuals(model1))

qqnorm(residuals(model4))
qqline(residuals(model4))

4. References to the Literature.

N/A

5. Appendices

Raw Data: Ecdat Doctor data

All Complete R code included in the RMarkDown.