Experimental Setting

System under test

The data being tested are from the Ecdat package in R, and the data set is named OFP. This was a survey of a variety of factors which impacted how often people went to a variety of medical appointments.

The data will be analyzed by combining all of these visits into one variable, and analyzing 4 factors that may have an impact on the number of times a subject visited a doctor in some capacity. This analysis will be done by expressing two 2-level factors and two 3-level factors in the data set as combinations of 2-level factors. Once this is evaluated, a fractional factorial design will be conducted in order to limit the number of experimental runs required to obtain a concrete result. This will culminate in the formation of a model regarding the factors which affect patient visits to a physician.

Factors and Levels

The original testing independent variables of this dataset are as follows:

Sex, with levels male and female

Employed, with levels yes and no

Region, with levels noreast, west, midwest and other

Hlth, with levels excellent, poor, and other

All of these variables are currently assumed to have some effect on the number of visits a patient requires.

Continuous Variables

The response variable, totalvisits is a sum of the number of patient visits to a variety of medical practitioners that was originally surveyed. This represents the number of times that a subject had an issue for which they had to see a medical professional.

Response Variables

The response variable is overall number of visits totalvisits, which is derived as explained above.

Data Overview

head(df)
##   ofp ofnp opp opnp emr hosp numchron adldiff age black    sex maried
## 1   5    0   0    0   0    1        2       0 6.9   yes   male    yes
## 2   1    0   2    0   2    0        2       0 7.4    no female    yes
## 3  13    0   0    0   3    3        4       1 6.6   yes female     no
## 4  16    0   5    0   1    1        2       1 7.6    no   male    yes
## 5   3    0   0    0   0    0        2       1 7.9    no female    yes
## 6  17    0   0    0   0    0        5       1 6.6    no female     no
##   school faminc employed privins medicaid region  hlth
## 1      6 2.8810      yes     yes       no  other other
## 2     10 2.7478       no     yes       no  other other
## 3     10 0.6532       no      no      yes  other  poor
## 4      3 0.6588       no     yes       no  other  poor
## 5      6 0.6588       no     yes       no  other other
## 6      7 0.3301       no      no      yes  other  poor
str(df)
## 'data.frame':    4406 obs. of  19 variables:
##  $ ofp     : int  5 1 13 16 3 17 9 3 1 0 ...
##  $ ofnp    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ opp     : int  0 2 0 5 0 0 0 0 0 0 ...
##  $ opnp    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ emr     : int  0 2 3 1 0 0 0 0 0 0 ...
##  $ hosp    : int  1 0 3 1 0 0 0 0 0 0 ...
##  $ numchron: int  2 2 4 2 2 5 0 0 0 0 ...
##  $ adldiff : int  0 0 1 1 1 1 0 0 0 0 ...
##  $ age     : num  6.9 7.4 6.6 7.6 7.9 6.6 7.5 8.7 7.3 7.8 ...
##  $ black   : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ sex     : Factor w/ 2 levels "female","male": 2 1 1 2 1 1 1 1 1 1 ...
##  $ maried  : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 1 1 1 1 ...
##  $ school  : int  6 10 10 3 6 7 8 8 8 8 ...
##  $ faminc  : num  2.881 2.748 0.653 0.659 0.659 ...
##  $ employed: Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ privins : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 2 2 2 2 ...
##  $ medicaid: Factor w/ 2 levels "no","yes": 1 1 2 1 1 2 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "other","noreast",..: 1 1 1 1 1 1 3 3 3 3 ...
##  $ hlth    : Factor w/ 3 levels "other","excellent",..: 1 1 3 3 1 3 1 1 1 1 ...

The data set contains 4406 observations of 19 separate variables.

  1. The following alterations will be performed:

  2. All incomplete cases in the set will be removed.

  3. The ofp, ofnp, opp, and opnp variables will be combined in order to show the total number of doctors visits (excluding hospital visits) a patient made over the span of the study.

  4. The data set will be condensed down to the response variable totalvisits, and the independent variables sex, employed, region, and hlth.

The result of these operations is below. The data set is much smaller and easier to work with now.

head(df2)
##      sex employed region  hlth totalvisits
## 1   male      yes  other other           5
## 2 female       no  other other           3
## 3 female       no  other  poor          13
## 4   male       no  other  poor          21
## 5 female       no  other other           3
## 6 female       no  other  poor          17
str(df2)
## 'data.frame':    4406 obs. of  5 variables:
##  $ sex        : Factor w/ 2 levels "female","male": 2 1 1 2 1 1 1 1 1 1 ...
##  $ employed   : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region     : Factor w/ 4 levels "other","noreast",..: 1 1 1 1 1 1 3 3 3 3 ...
##  $ hlth       : Factor w/ 3 levels "other","excellent",..: 1 1 3 3 1 3 1 1 1 1 ...
##  $ totalvisits: int  5 3 13 21 3 17 9 3 1 0 ...
summary(df2)
##      sex       employed       region            hlth     
##  female:2628   no :3951   other  :1614   other    :3509  
##  male  :1778   yes: 455   noreast: 837   excellent: 343  
##                           midwest:1157   poor     : 554  
##                           west   : 798                   
##                                                          
##                                                          
##   totalvisits     
##  Min.   :  0.000  
##  1st Qu.:  2.000  
##  Median :  6.000  
##  Mean   :  8.679  
##  3rd Qu.: 11.000  
##  Max.   :296.000

Finally, in order to enable a factorial design with levels of high and low, integer factors will replace the character factors currently in the data set. The following changes will be made to the variables:

sex: male = 1, female = 0 employed: yes = 1, no = 0 region: other and midwest = 0, west = 1, noreast = 2 hlth: poor = 0, other = 1, excellent = 2

The decisions regarding combinations and ordering of factors are based on experimenter’s judgment. other seems to describe the south and great plains region of the USA, which fits more closely with the midwest than any of the other regions in the country. other was chosen as the middle level in health factor, as it seems that poor and excellent adequately define the bounds of self-described health.

The results of these transformations are shown below.

##   sex employed region health totalvisits
## 1   1        1      0      1           5
## 2   0        0      0      1           3
## 3   0        0      0      0          13
## 4   1        0      0      0          21
## 5   0        0      0      1           3
## 6   0        0      0      0          17
## 'data.frame':    4406 obs. of  5 variables:
##  $ sex        : num  1 0 0 1 0 0 0 0 0 0 ...
##  $ employed   : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ region     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ health     : num  1 1 0 0 1 0 1 1 1 1 ...
##  $ totalvisits: int  5 3 13 21 3 17 9 3 1 0 ...
##       sex            employed          region           health      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :1.0000  
##  Mean   :0.4035   Mean   :0.1033   Mean   :0.5611   Mean   :0.9521  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :2.0000   Max.   :2.0000  
##   totalvisits     
##  Min.   :  0.000  
##  1st Qu.:  2.000  
##  Median :  6.000  
##  Mean   :  8.679  
##  3rd Qu.: 11.000  
##  Max.   :296.000

Experimental Design

This experiment seeks to observe the impact of a variety of conditions on the number of doctor’s visits required by a subject. The choice of a fractional factorial design was made in order to reduce the computing power and runs necessary to reach an adequate conclusion.

How will the experiment be organized and conducted?

A fractional factorial design of the 2m-3 type will be used in order to simplify the calculations. The three level factor variables will be decomposed into 2 two-level factors for the purposes of calculating the required data necessary to obtain the appropriate data. For these, the sum of the two-level variables will yield the 3-level value that is chosen. (This is the reason for the selection of 0-2 as levels instead of -1-1.) Replicates will be used in order to obtain the highest level of accuracy. No one factor combination will have more replicates than another, so replication may be somewhat limited, depending on the size of the subsets.

The fractional factorial design will be used in order to compute the resolution of the design, and then main and any interaction effects. The results will be assessed for aliasing in order to determine the true amount of data that results from a more limited fractional design. Once main and interaction effects are computed, a model will be created through the combination of these effects and ANOVA analysis. This will determine the appropriate factors that will create the final model.

What is the rationale for this design?

A fractional factorial model will be used to reduce the computing power required to obtain a reasonably accurate solution. While it is not strictly necessary in analyzing a pre-created data set, it would vastly simplify the process if the data set were to be created from scratch, or was taken in a manner other than a survey. Alias testing will be used in order to determine the true accuracy and resolution of the design. This allows the researcher to know the limitations of this level of design, and determine if another method is more appropriate for their needs.

Assumptions

This data set meets many random design assumptions. It has many replicates overall, although some group replicates are more limited. Blocking will not be used in this design, but the data set does have the ability to block if it is deemed necessary. Overall, this is a strong data set for our purposes.

Statistical Analysis

Descriptive Summary

Boxplots of the response variable totalvisits will be displayed, sorted by each combination of the independent factors.

This data set has some flaws. It is very heavy on outliers. While the means of the data set appear to be relatively similar, some combinations have severe outliers, which compromise the overall utility of the data set. In particular, ANOVA is not likely to be useful in many of these tests. However, as non-parametric tests have not been covered, I will be moving forward under the assumption that this data set is usable, in spite of these flaws.

Testing

The hypotheses for the test are as follows:

Ho - There is no statistically significant difference between the responses due to differing factor levels of the independent variables

Ha - There is a statistically significant difference between the responses due to the factor level of the independent variables

The testing will be performed using main and interaction effect analysis, and then an ANOVA will be used to corroborate and create the final model

Treatment Structure

In order to adequately test every interaction of the experiment, it would be necessary to have 64 runs. This design would look as follows.

## creating full factorial with 64 runs ...
##     A  B  C  D  E  F
## 1   1  1  1  1 -1 -1
## 2  -1 -1 -1 -1  1 -1
## 3   1 -1  1 -1 -1 -1
## 4   1  1  1 -1  1  1
## 5   1 -1  1 -1  1  1
## 6  -1 -1  1  1 -1 -1
## 7  -1  1  1 -1 -1 -1
## 8  -1 -1 -1  1  1 -1
## 9   1 -1 -1  1 -1 -1
## 10  1  1 -1  1 -1 -1
## 11 -1  1  1  1  1  1
## 12  1 -1  1  1  1  1
## 13 -1  1  1  1  1 -1
## 14  1 -1  1 -1 -1  1
## 15 -1  1 -1  1 -1  1
## 16 -1  1 -1 -1 -1  1
## 17 -1  1 -1 -1 -1 -1
## 18  1 -1 -1  1  1  1
## 19 -1 -1  1 -1  1  1
## 20  1  1  1  1  1  1
## 21 -1  1 -1  1 -1 -1
## 22 -1  1 -1  1  1 -1
## 23  1  1  1 -1 -1 -1
## 24 -1  1 -1  1  1  1
## 25 -1  1  1  1 -1 -1
## 26 -1 -1  1  1  1  1
## 27  1  1  1  1  1 -1
## 28  1  1 -1 -1  1  1
## 29  1  1 -1  1 -1  1
## 30  1 -1  1  1  1 -1
## 31 -1 -1  1  1 -1  1
## 32  1 -1  1 -1  1 -1
## 33  1  1 -1 -1 -1  1
## 34 -1 -1 -1  1 -1 -1
## 35  1  1 -1  1  1 -1
## 36  1 -1 -1 -1  1  1
## 37 -1  1 -1 -1  1  1
## 38  1  1 -1  1  1  1
## 39 -1 -1  1 -1 -1 -1
## 40  1 -1 -1 -1 -1 -1
## 41  1  1  1 -1  1 -1
## 42  1  1 -1 -1 -1 -1
## 43 -1 -1 -1  1 -1  1
## 44  1 -1 -1  1  1 -1
## 45 -1 -1 -1 -1 -1 -1
## 46 -1  1  1 -1 -1  1
## 47 -1 -1 -1 -1 -1  1
## 48 -1 -1  1 -1 -1  1
## 49 -1  1  1  1 -1  1
## 50 -1  1  1 -1  1 -1
## 51 -1 -1 -1  1  1  1
## 52  1  1  1  1 -1  1
## 53  1  1 -1 -1  1 -1
## 54 -1 -1  1  1  1 -1
## 55  1 -1 -1  1 -1  1
## 56  1 -1 -1 -1 -1  1
## 57  1 -1  1  1 -1  1
## 58 -1  1 -1 -1  1 -1
## 59  1  1  1 -1 -1  1
## 60 -1  1  1 -1  1  1
## 61 -1 -1  1 -1  1 -1
## 62 -1 -1 -1 -1  1  1
## 63  1 -1 -1 -1  1 -1
## 64  1 -1  1  1 -1 -1
## class=design, type= full factorial

A fractional factorial design will be computed using the FrF2 function in R. This will be done as a 26-3 design. The output is shown below, and has 8 runs, as is expected in this type of design:

##    A  B  C  D  E  F
## 1  1 -1 -1 -1 -1  1
## 2  1 -1  1 -1  1 -1
## 3  1  1 -1  1 -1 -1
## 4 -1 -1  1  1 -1 -1
## 5 -1  1  1 -1 -1  1
## 6  1  1  1  1  1  1
## 7 -1 -1 -1  1  1  1
## 8 -1  1 -1 -1  1 -1
## class=design, type= FrF2
## $legend
## [1] A=A B=B C=C D=D E=E F=F
## 
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
## 
## $fi2
## [1] AF=BE=CD

Subsets of the full data set will be created for each of these level combinations. In this design, the sum of the regionl and regionx variables will show the actual factor level in the 3-level factors that are being decomposed in the design above.

It is relatively easy to see that the model is highly aliased. This is expected, as the model is only resolution III. It appears that only the main effects are confounded with two level interactions, meaning that there are not a significant number of effects that can be separated from each other. Only the main effect results are reliable.

ISYE 6020: Calculating generator I

To start, we can create a full factorial design matrix for the matrix which has 2k-p factors. With k = 6 and p = 3, that is a 23 design matrix, as shown below:

##   A B C
## A - - -
## B + - -
## C - + -
## D + + -
## E - - +
## F + - +
## G - + +
## H + + +

From this point, the remaining k-p = 3 rows are filled in using the two factor interactions AB, AC, BC. As this is a low resolution design, the ABC product will not always be the same sign. Following this produuct, the final matrix looks as follows:

##   A B C D E F
## A - - - + + +
## B + - - - - +
## C - + - - + -
## D + + - + - -
## E - - + + - -
## F + - + - + -
## G - + + - + -
## H + + + + + +

Comparing this to the output of an unrandomized fractional factorial generator created using FrF2 in R yields the same result, demonstrating that the generator is correctly calculated.

a <- FrF2(8,6, res3 = T, default.levels = c("-","+"), randomize = F)
a
##   A B C D E F
## 1 - - - + + +
## 2 + - - - - +
## 3 - + - - + -
## 4 + + - + - -
## 5 - - + + - -
## 6 + - + - + -
## 7 - + + - - +
## 8 + + + + + +
## class=design, type= FrF2

Based on the interactions D = AB, E = AC, and F = BC, the generators result in a generator of I = ABD = ACE = BCF.

Adding a generator row for I leads to the following matrix. We can corroborate that ABD, ACE, and BCF are all equal to I.

##   I A B C D E F
## A + - - - + + +
## B + + - - - - +
## C + - + - - + -
## D + + + - + - -
## E + - - + + - -
## F + + - + - + -
## G + - + + - - +
## H + + + + + + +

Main and Interaction Effect Calculation

In order to examine the main effects of the groups, we will calculate main effects using 2 replicates of the totalvisits output of the subsets shown in the design matrix. The subsets are calculated so that any sum of 1 in the three-factor variable is considered the medium level, where 0 and 2 are low and high, respectively. It is worth noting that this method is susceptible to outliers within this data set, and assumes that any values obtained are based off of “normal” values in the data set.

The main effects will be calculated using the following formula with all of the runs for each factor:

\[ME_n = \frac{\sum{totalvisits_{n=1}} - \sum{totalvisits_{n=0}}}{4}\]

This equation produces the average difference between the values where the main effect variable is 1 and where it is 0. This is averaged over the four differences that can be taken from the runs in the design.

Effects Table
Due to the aliasing, once the main effects and interaction effect AF are computed, there are no more computations left to be done. The table below shows the main and and all of the interaction effects. However, as the aliasing is so severe in this model, only the first 6 effects and interaction effect AF=BE=CD can be distinguished as separate and accurate. All higher level interactions are assumed to be negligible, by virtue of the fractional design paradigm.

Replicate 1

##    Main Effect
## A          2.5
## B          3.0
## C         -6.0
## D          8.5
## E        -10.5
## F         -8.0
## AB         8.5
## AC       -10.5
## AD         3.0
## AE        -6.0
## AF        -4.5
## BC        -8.0
## BD         2.5
## BE        -4.5
## BF        -6.0
## CD        -4.5
## CE         2.5
## CF         3.0
## DE        -8.0
## DF       -10.5
## EF         8.5

Replicate 2

##    Main Effect
## A          2.5
## B          2.5
## C         -4.0
## D         -5.5
## E          4.0
## F         -8.0
## AB        -5.5
## AC         4.0
## AD         2.5
## AE        -4.0
## AF        -1.0
## BC        -8.0
## BD         2.5
## BE        -1.0
## BF        -4.0
## CD        -1.0
## CE         2.5
## CF         2.5
## DE        -8.0
## DF         4.0
## EF        -5.5

It is easy to see that there is a significant amount of variance between the main effects between the two replicates, making it difficult to determine what effects are significant and in what direction (positive or negative).

Analysis of Full Data Set
Due to massive inconsistencies between replicates and runs of the data shown above due to significant outliers, the mean totalvisits of each run in the design is used instead of taking a random sample of n=2 (smallest subgroup size). The results from this point onward will be discussed using those numbers, shown below.

##    Main Effect
## A   -0.2269288
## B   -0.1762097
## C   -1.2255017
## D    2.8773048
## E   -6.6480638
## F   -6.5169795
## AB   2.8773048
## AC  -6.6480638
## AD  -0.1762097
## AE  -1.2255017
## AF  -2.8594550
## BC  -6.5169795
## BD  -0.2269288
## BE  -2.8594550
## BF  -1.2255017
## CD  -2.8594550
## CE  -0.2269288
## CF  -0.1762097
## DE  -6.5169795
## DF  -6.6480638
## EF   2.8773048

Model Construction

From the table above, we can see that of the 6 factors (decomposed from 4), there are 4 main effects that are reasonably large (C, D, E, and F). These are the four factors that combine to form the three level factors region and health. As a result, these four (decomposed from two) factors are tentatively inserted into the model.

In order to corroborate the insertion of these values into the model, we will conduct an ANOVA to observe significance. This will be conducted on the full data set, with only 4 factors, instead of the 6 that the original set was decomposed to, in order to ensure that the ANOVA reflects the differences in the variables.

##                   Df Sum Sq Mean Sq F value Pr(>F)    
## sex                1    451     451   3.259 0.0711 .  
## employed           1    311     311   2.249 0.1338    
## region             1    797     797   5.758 0.0165 *  
## health             1  10959   10959  79.140 <2e-16 ***
## sex:employed       1      1       1   0.004 0.9485    
## sex:region         1     11      11   0.078 0.7796    
## sex:health         1    132     132   0.954 0.3287    
## employed:region    1      5       5   0.034 0.8546    
## employed:health    1     93      93   0.669 0.4134    
## region:health      1    173     173   1.252 0.2632    
## Residuals       4395 608624     138                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results of the ANOVA seem to corroborate the results of the ME analysis. The factors region and health are statistically significant, and none of the other factors are. This results in the model looking as follows:

\[totalvisits = Intercept + \beta_1*region + \beta_2*health\]

or alternatively as

\[totalvisits = Intercept - 1.225*C + 2.877*D -6.65*E - 6.51*F\]

Model Diagnostics

Based on the ANOVA above, the residuals are skewed significantly above 0, and deviate significantly from the norm on the higher values. This is most likely a limitation of the cut-down data set, where important explanatory factors are lost. There is a significant amount of person to person variation in the number of doctors visits present in the data set (some of which is likely due to luck), and outliers are going to tend significantly higher, as a subject can’t visit the doctor fewer than 0 times.

Appendix 1: Raw Data

The data was drawn from the R ecdat library, and the dataset used was OFP. The structure and the head of the data set can be seen in the section Experimental Setting - Data Overview

Appendix 2: R Code

# Please note that some variable names are strange as a result of shuffled analysis methods. 

# require appropriate libraries and acquire data
require(Ecdat)
require(FrF2)
df <- OFP

# Show top of df
head(df)

# Show structure
str(df)

# Limit to complete cases
df2 <- df[complete.cases(df),]
df2$totalvisits <- df2$ofp + df2$ofnp + df2$opp + df2$opnp # Sum of total visits

df2 <- df2[,c(11,15,18,19,20)] # Only required rows

head(df2) # Top of data

str(df2) # Structure of data

summary(df2) # Summary of data

l = nrow(df2) # Find length of frame

# Create four new frames for writing numbers instead of factors 
sexnum = data.frame(l)
emplnum = data.frame(l)
region = data.frame(l)
health = data.frame(l)

## Replacement loop
for (i in 1:l){
  
  # sex replace male with 1 and female with 0
  if (df2$sex[i] == "male"){
    sexnum[i,1] <- 1
    }  else {
    sexnum[i,1] <- 0
   }
  
  # employment replace yes with 1 and no with 0
  if (df2$employed[i] == "yes"){
    emplnum[i,1] <- 1
   }  else {
    emplnum[i,1] <- 0
   }
  
  # region west = 1, other + midwest = 1, noreast = 2
  if (df2$region[i] == "west"){
    
    region[i,1] <- 1
  }
  if (df2$region[i] == "other"){
    region[i,1] <- 0
  }  
  if (df2$region[i] == "midwest") {
    region[i,1] <- 0
  } 
  if (df2$region[i] == "noreast") {
    region[i,1] <- 2
  }
  
  #health poor = 0, other = 1, excellent = 2
  if (df2$hlth[i] == "poor"){
    health[i,1] <- 0
  }
  
  if (df2$hlth[i] == "other") {
    health[i,1] <- 1
  } 
  if (df2$hlth[i] == "excellent") {
    health[i,1] <- 2
  }
  
}

# Create numbered data frame out of column vectors and response variable 
df3 <- cbind(sexnum,emplnum,region,health,df2$totalvisits)

# Title columns of data frame appropriately
colnames(df3) <- c("sex","employed","region","health","totalvisits")

# Show head, structure and summary for newest data set
head(df3)

str(df3)

summary(df3)

# Boxplot of overall factors with labeled axes 
boxplot(df3$totalvisits ~ df3$sex + df3$employed + df3$region + df3$health, xlab= "sex.employed.region.health",ylab="Total doctor's visits", main="Analysis of Testing Variables")

## Show design
b <- FrF2(8,6,res3 = T) # Create fractional factorial design

print(b) # Show design structure

aliasprint(b) # Show aliasing in data

# 2^3 matrix
a <- matrix(c("-","-","-","+","-","-","-","+","-","+","+","-","-","-","+","+","-","+","-","+","+","+","+","+"),nrow = 8,byrow = T) # Three factor structure

a <- as.table(a) # Create table
a # display table

## Total matrix 
a <- matrix(c("-","-","-","+","+","+","+","-","-","-","-","+","-","+","-","-","+","-","+","+","-","+","-","-","-","-","+","+","-","-","+","-","+","-","+","-","-","+","+","-","+","-","+","+","+","+","+","+"),nrow = 8,byrow = T) # Matrix with multiplied new columns, different chunk so name 'a' is reused

a = as.table(a) # Create table
a # show table

## Fractional design comparison
a <- FrF2(8,6, res3 = T, default.levels = c("-","+"), randomize = F) # Unrandomized Fractional design for comparison

a # show table

## Subset creation for factorial design
set001100 <- subset(df3,sex == "0" & employed == "0" & region == "2" & health == "0")

set011001 <- subset(df3,sex == "0" & employed == "1" & region == "1" & health == "1")

set000111 <- subset(df3,sex == "0" & employed == "0" & region == "1" & health == "2")

set111111 <- subset(df3,sex == "1" & employed == "1" & region == "2" & health == "2")

set100001 <- subset(df3,sex == "1" & employed == "0" & region == "0" & health == "1")

set010010 <- subset(df3,sex == "0" & employed == "1" & region == "0" & health == "1")

set110100 <- subset(df3,sex == "1" & employed == "1" & region == "1" & health == "0")

set101010 <- subset(df3,sex == "1" & employed == "0" & region == "1" & health == "1")

# Creation of function to draw two random rows from subsets above
myfun <- function (df){
  a <- sample(nrow(df))
  b <- a[1]
  c <- a[2]
  
  
  return(c(df$totalvisits[b],df$totalvisits[c]))
}

## Use  functions to get group samples
mean_111111 <- myfun(set111111)
mean_101010 <- myfun(set101010)
mean_110100 <- myfun(set110100)
mean_100001 <- myfun(set100001)
mean_001100 <- myfun(set001100)
mean_011001 <- myfun(set011001)
mean_010010 <- myfun(set010010)
mean_000111 <- myfun(set000111)

## Take the mean of whole group to avoid outliers
fullmean_111111 <- mean(set111111$totalvisits)
fullmean_101010 <- mean(set101010$totalvisits)
fullmean_110100 <- mean(set110100$totalvisits)
fullmean_100001 <- mean(set100001$totalvisits)
fullmean_001100 <- mean(set001100$totalvisits)
fullmean_011001 <- mean(set011001$totalvisits)
fullmean_010010 <- mean(set010010$totalvisits)
fullmean_000111 <- mean(set000111$totalvisits)


## Main effect calculations according to equation in report
ME_A1 <- 1/4 * ( (mean_100001[1] + mean_110100[1] + mean_101010[1] + mean_111111[1]) - (mean_000111[1] + mean_010010[1] + mean_011001[1] + mean_001100[1]))

ME_B1 <- 1/4 * ((mean_110100[1] + mean_111111[1] + mean_010010[1] + mean_011001[1]) - (mean_100001[1] + mean_001100[1] + mean_101010[1] + mean_000111[1]))

ME_C1 <- 1/4 * ((mean_011001[1] + mean_001100[1] + mean_101010[1] + mean_111111[1]) - (mean_000111[1] + mean_010010[1] + mean_100001[1] + mean_110100[1]))

ME_D1 <- 1/4 * ((mean_000111[1] + mean_001100[1] + mean_110100[1] + mean_111111[1]) - (mean_010010[1] + mean_100001[1] + mean_101010[1] + mean_011001[1]))

ME_E1 <- 1/4 * ((mean_000111[1]  + mean_010010[1] + mean_101010[1] + mean_111111[1] - (mean_011001[1] + mean_001100[1] + mean_110100[1] + mean_100001[1])))
  
ME_F1 <- 1/4 * ((mean_000111[1] + mean_011001[1] + mean_100001[1] + mean_111111[1] - (mean_010010[1] + mean_001100[1] + mean_110100[1] + mean_101010[1])))

# Interaction effect
IE_AF1 <- 1/4 * ((mean_010010[1] + mean_001100[1] + mean_111111[1] + mean_100001[1] - (mean_000111[1] + mean_011001[1] + mean_110100[1] + mean_101010[1])))

## Second replicate Calculations

ME_A2 <- 1/4 * ( (mean_100001[2] + mean_110100[2] + mean_101010[2] + mean_111111[2]) - (mean_000111[2] + mean_010010[2] + mean_011001[2] + mean_001100[2]))

ME_B2 <- 1/4 * ((mean_110100[2] + mean_111111[2] + mean_010010[2] + mean_011001[2]) - (mean_100001[2] + mean_001100[2] + mean_101010[2] + mean_000111[2]))

ME_C2 <- 1/4 * ((mean_011001[2] + mean_001100[2] + mean_101010[2] + mean_111111[2]) - (mean_000111[2] + mean_010010[2] + mean_100001[2] + mean_110100[2]))

ME_D2 <- 1/4 * ((mean_000111[2] + mean_001100[2] + mean_110100[2] + mean_111111[2]) - (mean_010010[2] + mean_100001[2] + mean_101010[2] + mean_011001[2]))

ME_E2 <- 1/4 * ((mean_000111[2]  + mean_010010[2] + mean_101010[2] + mean_111111[2] - (mean_011001[2] + mean_001100[2] + mean_110100[2] + mean_100001[2])))
  
ME_F2 <- 1/4 * ((mean_000111[2] + mean_011001[2] + mean_100001[2] + mean_111111[2] - (mean_010010[2] + mean_001100[2] + mean_110100[2] + mean_101010[2])))

#Interaction
IE_AF2 <- 1/4 * ((mean_010010[2] + mean_001100[2] + mean_111111[2] + mean_100001[2] - (mean_000111[2] + mean_011001[2] + mean_110100[2] + mean_101010[2])))


## Full group mean calculation

ME_A3 <- 1/4 * ( (fullmean_100001 + fullmean_110100 + fullmean_101010 + fullmean_111111) - (fullmean_000111 + fullmean_010010 + fullmean_011001 + fullmean_001100))

ME_B3 <- 1/4 * ((fullmean_110100 + fullmean_111111 + fullmean_010010 + fullmean_011001) - (fullmean_100001 + fullmean_001100 + fullmean_101010 + fullmean_000111))

ME_C3 <- 1/4 * ((fullmean_011001 + fullmean_001100 + fullmean_101010 + fullmean_111111) - (fullmean_000111 + fullmean_010010 + fullmean_100001 + fullmean_110100))

ME_D3 <- 1/4 * ((fullmean_000111 + fullmean_001100 + fullmean_110100 + fullmean_111111) - (fullmean_010010 + fullmean_100001 + fullmean_101010 + fullmean_011001))

ME_E3 <- 1/4 * ((fullmean_000111  + fullmean_010010 + fullmean_101010 + fullmean_111111 - (fullmean_011001 + fullmean_001100 + fullmean_110100 + fullmean_100001)))
  
ME_F3 <- 1/4 * ((fullmean_000111 + fullmean_011001 + fullmean_100001 + fullmean_111111 - (fullmean_010010 + fullmean_001100 + fullmean_110100 + fullmean_101010)))

IE_AF3 <- 1/4 * ((fullmean_010010 + fullmean_001100 + fullmean_111111 + fullmean_100001 - (fullmean_000111 + fullmean_011001 + fullmean_110100 + fullmean_101010)))

## Table formation
main_effect_table1 <- matrix(c(ME_A1,ME_B1,ME_C1,ME_D1,ME_E1,ME_F1,ME_D1,ME_E1,ME_B1,ME_C1,IE_AF1, ME_F1, ME_A1, IE_AF1, ME_C1, IE_AF1, ME_A1, ME_B1, ME_F1, ME_E1, ME_D1), ncol = 1) # Display replicate 1 values
  
  

main_effect_table1 <- as.table(main_effect_table1) # Convert to table

rownames(main_effect_table1) <- c("A","B","C","D","E","F","AB","AC","AD","AE","AF","BC","BD","BE","BF","CD","CE","CF","DE","DF","EF") # Name rows

colnames(main_effect_table1) <- "Main Effect" # Name column

main_effect_table1 # Show table


# Table 2 simply repeats table 1 
main_effect_table2 <- matrix(c(ME_A2,ME_B2,ME_C2,ME_D2,ME_E2,ME_F2,ME_D2,ME_E2,ME_B2,ME_C2,IE_AF2, ME_F2, ME_A2, IE_AF2, ME_C2, IE_AF2, ME_A2, ME_B2, ME_F2, ME_E2, ME_D2), ncol = 1)
  
main_effect_table2 <- as.table(main_effect_table2)

rownames(main_effect_table2) <- c("A","B","C","D","E","F","AB","AC","AD","AE","AF","BC","BD","BE","BF","CD","CE","CF","DE","DF","EF")

colnames(main_effect_table2) <- "Main Effect"

main_effect_table2
}

# Table 3 repeats the procedure again

main_effect_table3 <- matrix(c(ME_A3,ME_B3,ME_C3,ME_D3,ME_E3,ME_F3,ME_D3,ME_E3,ME_B3,ME_C3,IE_AF3, ME_F3, ME_A3, IE_AF3, ME_C3, IE_AF3, ME_A3, ME_B3, ME_F3, ME_E3, ME_D3), ncol = 1)
  
  

main_effect_table3 <- as.table(main_effect_table3)

rownames(main_effect_table3) <- c("A","B","C","D","E","F","AB","AC","AD","AE","AF","BC","BD","BE","BF","CD","CE","CF","DE","DF","EF")

colnames(main_effect_table3) <- "Main Effect"

main_effect_table3

##  Anova and Model Diagnostics
av <- aov(totalvisits~sex + employed + region + health + sex:employed + sex:region + sex:health + employed:region + employed:health + region:health, data = df3) # Perform ANOVA

summary(av) # Summary display

av <- aov(totalvisits~region + health, data = df3)# ANOVA for final model

plot(av,which = c(1:2)) # Plot residuals and QQ