Experimental Setting
System under test
The data being tested are from the Ecdat package in R, and the data set is named OFP. This was a survey of a variety of factors which impacted how often people went to a variety of medical appointments.
The data will be analyzed by combining all of these visits into one variable, and analyzing 4 factors that may have an impact on the number of times a subject visited a doctor in some capacity. This analysis will be done by expressing two 2-level factors and two 3-level factors in the data set as combinations of 2-level factors. Once this is evaluated, a fractional factorial design will be conducted in order to limit the number of experimental runs required to obtain a concrete result. This will culminate in the formation of a model regarding the factors which affect patient visits to a physician.
Factors and Levels
The original testing independent variables of this dataset are as follows:
Sex, with levels male and female
Employed, with levels yes and no
Region, with levels noreast, west, midwest and other
Hlth, with levels excellent, poor, and other
All of these variables are currently assumed to have some effect on the number of visits a patient requires.
Continuous Variables
The response variable, totalvisits is a sum of the number of patient visits to a variety of medical practitioners that was originally surveyed. This represents the number of times that a subject had an issue for which they had to see a medical professional.
Response Variables
The response variable is overall number of visits totalvisits, which is derived as explained above.
Data Overview
head(df)
## ofp ofnp opp opnp emr hosp numchron adldiff age black sex maried
## 1 5 0 0 0 0 1 2 0 6.9 yes male yes
## 2 1 0 2 0 2 0 2 0 7.4 no female yes
## 3 13 0 0 0 3 3 4 1 6.6 yes female no
## 4 16 0 5 0 1 1 2 1 7.6 no male yes
## 5 3 0 0 0 0 0 2 1 7.9 no female yes
## 6 17 0 0 0 0 0 5 1 6.6 no female no
## school faminc employed privins medicaid region hlth
## 1 6 2.8810 yes yes no other other
## 2 10 2.7478 no yes no other other
## 3 10 0.6532 no no yes other poor
## 4 3 0.6588 no yes no other poor
## 5 6 0.6588 no yes no other other
## 6 7 0.3301 no no yes other poor
str(df)
## 'data.frame': 4406 obs. of 19 variables:
## $ ofp : int 5 1 13 16 3 17 9 3 1 0 ...
## $ ofnp : int 0 0 0 0 0 0 0 0 0 0 ...
## $ opp : int 0 2 0 5 0 0 0 0 0 0 ...
## $ opnp : int 0 0 0 0 0 0 0 0 0 0 ...
## $ emr : int 0 2 3 1 0 0 0 0 0 0 ...
## $ hosp : int 1 0 3 1 0 0 0 0 0 0 ...
## $ numchron: int 2 2 4 2 2 5 0 0 0 0 ...
## $ adldiff : int 0 0 1 1 1 1 0 0 0 0 ...
## $ age : num 6.9 7.4 6.6 7.6 7.9 6.6 7.5 8.7 7.3 7.8 ...
## $ black : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 2 1 1 1 1 1 1 ...
## $ maried : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 1 1 1 1 ...
## $ school : int 6 10 10 3 6 7 8 8 8 8 ...
## $ faminc : num 2.881 2.748 0.653 0.659 0.659 ...
## $ employed: Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ privins : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 2 2 2 2 ...
## $ medicaid: Factor w/ 2 levels "no","yes": 1 1 2 1 1 2 1 1 1 1 ...
## $ region : Factor w/ 4 levels "other","noreast",..: 1 1 1 1 1 1 3 3 3 3 ...
## $ hlth : Factor w/ 3 levels "other","excellent",..: 1 1 3 3 1 3 1 1 1 1 ...
The data set contains 4406 observations of 19 separate variables.
The following alterations will be performed:
All incomplete cases in the set will be removed.
The ofp, ofnp, opp, and opnp variables will be combined in order to show the total number of doctors visits (excluding hospital visits) a patient made over the span of the study.
The data set will be condensed down to the response variable totalvisits, and the independent variables sex, employed, region, and hlth.
The result of these operations is below. The data set is much smaller and easier to work with now.
head(df2)
## sex employed region hlth totalvisits
## 1 male yes other other 5
## 2 female no other other 3
## 3 female no other poor 13
## 4 male no other poor 21
## 5 female no other other 3
## 6 female no other poor 17
str(df2)
## 'data.frame': 4406 obs. of 5 variables:
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 2 1 1 1 1 1 1 ...
## $ employed : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "other","noreast",..: 1 1 1 1 1 1 3 3 3 3 ...
## $ hlth : Factor w/ 3 levels "other","excellent",..: 1 1 3 3 1 3 1 1 1 1 ...
## $ totalvisits: int 5 3 13 21 3 17 9 3 1 0 ...
summary(df2)
## sex employed region hlth
## female:2628 no :3951 other :1614 other :3509
## male :1778 yes: 455 noreast: 837 excellent: 343
## midwest:1157 poor : 554
## west : 798
##
##
## totalvisits
## Min. : 0.000
## 1st Qu.: 2.000
## Median : 6.000
## Mean : 8.679
## 3rd Qu.: 11.000
## Max. :296.000
Finally, in order to enable a factorial design with levels of high and low, integer factors will replace the character factors currently in the data set. The following changes will be made to the variables:
sex: male = 1, female = 0 employed: yes = 1, no = 0 region: other and midwest = 0, west = 1, noreast = 2 hlth: poor = 0, other = 1, excellent = 2
The decisions regarding combinations and ordering of factors are based on experimenter’s judgment. other seems to describe the south and great plains region of the USA, which fits more closely with the midwest than any of the other regions in the country. other was chosen as the middle level in health factor, as it seems that poor and excellent adequately define the bounds of self-described health.
The results of these transformations are shown below.
## sex employed region health totalvisits
## 1 1 1 0 1 5
## 2 0 0 0 1 3
## 3 0 0 0 0 13
## 4 1 0 0 0 21
## 5 0 0 0 1 3
## 6 0 0 0 0 17
## 'data.frame': 4406 obs. of 5 variables:
## $ sex : num 1 0 0 1 0 0 0 0 0 0 ...
## $ employed : num 1 0 0 0 0 0 0 0 0 0 ...
## $ region : num 0 0 0 0 0 0 0 0 0 0 ...
## $ health : num 1 1 0 0 1 0 1 1 1 1 ...
## $ totalvisits: int 5 3 13 21 3 17 9 3 1 0 ...
## sex employed region health
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :1.0000
## Mean :0.4035 Mean :0.1033 Mean :0.5611 Mean :0.9521
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :2.0000 Max. :2.0000
## totalvisits
## Min. : 0.000
## 1st Qu.: 2.000
## Median : 6.000
## Mean : 8.679
## 3rd Qu.: 11.000
## Max. :296.000
Appendix 2: R Code
# Please note that some variable names are strange as a result of shuffled analysis methods.
# require appropriate libraries and acquire data
require(Ecdat)
require(FrF2)
df <- OFP
# Show top of df
head(df)
# Show structure
str(df)
# Limit to complete cases
df2 <- df[complete.cases(df),]
df2$totalvisits <- df2$ofp + df2$ofnp + df2$opp + df2$opnp # Sum of total visits
df2 <- df2[,c(11,15,18,19,20)] # Only required rows
head(df2) # Top of data
str(df2) # Structure of data
summary(df2) # Summary of data
l = nrow(df2) # Find length of frame
# Create four new frames for writing numbers instead of factors
sexnum = data.frame(l)
emplnum = data.frame(l)
region = data.frame(l)
health = data.frame(l)
## Replacement loop
for (i in 1:l){
# sex replace male with 1 and female with 0
if (df2$sex[i] == "male"){
sexnum[i,1] <- 1
} else {
sexnum[i,1] <- 0
}
# employment replace yes with 1 and no with 0
if (df2$employed[i] == "yes"){
emplnum[i,1] <- 1
} else {
emplnum[i,1] <- 0
}
# region west = 1, other + midwest = 1, noreast = 2
if (df2$region[i] == "west"){
region[i,1] <- 1
}
if (df2$region[i] == "other"){
region[i,1] <- 0
}
if (df2$region[i] == "midwest") {
region[i,1] <- 0
}
if (df2$region[i] == "noreast") {
region[i,1] <- 2
}
#health poor = 0, other = 1, excellent = 2
if (df2$hlth[i] == "poor"){
health[i,1] <- 0
}
if (df2$hlth[i] == "other") {
health[i,1] <- 1
}
if (df2$hlth[i] == "excellent") {
health[i,1] <- 2
}
}
# Create numbered data frame out of column vectors and response variable
df3 <- cbind(sexnum,emplnum,region,health,df2$totalvisits)
# Title columns of data frame appropriately
colnames(df3) <- c("sex","employed","region","health","totalvisits")
# Show head, structure and summary for newest data set
head(df3)
str(df3)
summary(df3)
# Boxplot of overall factors with labeled axes
boxplot(df3$totalvisits ~ df3$sex + df3$employed + df3$region + df3$health, xlab= "sex.employed.region.health",ylab="Total doctor's visits", main="Analysis of Testing Variables")
## Show design
b <- FrF2(8,6,res3 = T) # Create fractional factorial design
print(b) # Show design structure
aliasprint(b) # Show aliasing in data
# 2^3 matrix
a <- matrix(c("-","-","-","+","-","-","-","+","-","+","+","-","-","-","+","+","-","+","-","+","+","+","+","+"),nrow = 8,byrow = T) # Three factor structure
a <- as.table(a) # Create table
a # display table
## Total matrix
a <- matrix(c("-","-","-","+","+","+","+","-","-","-","-","+","-","+","-","-","+","-","+","+","-","+","-","-","-","-","+","+","-","-","+","-","+","-","+","-","-","+","+","-","+","-","+","+","+","+","+","+"),nrow = 8,byrow = T) # Matrix with multiplied new columns, different chunk so name 'a' is reused
a = as.table(a) # Create table
a # show table
## Fractional design comparison
a <- FrF2(8,6, res3 = T, default.levels = c("-","+"), randomize = F) # Unrandomized Fractional design for comparison
a # show table
## Subset creation for factorial design
set001100 <- subset(df3,sex == "0" & employed == "0" & region == "2" & health == "0")
set011001 <- subset(df3,sex == "0" & employed == "1" & region == "1" & health == "1")
set000111 <- subset(df3,sex == "0" & employed == "0" & region == "1" & health == "2")
set111111 <- subset(df3,sex == "1" & employed == "1" & region == "2" & health == "2")
set100001 <- subset(df3,sex == "1" & employed == "0" & region == "0" & health == "1")
set010010 <- subset(df3,sex == "0" & employed == "1" & region == "0" & health == "1")
set110100 <- subset(df3,sex == "1" & employed == "1" & region == "1" & health == "0")
set101010 <- subset(df3,sex == "1" & employed == "0" & region == "1" & health == "1")
# Creation of function to draw two random rows from subsets above
myfun <- function (df){
a <- sample(nrow(df))
b <- a[1]
c <- a[2]
return(c(df$totalvisits[b],df$totalvisits[c]))
}
## Use functions to get group samples
mean_111111 <- myfun(set111111)
mean_101010 <- myfun(set101010)
mean_110100 <- myfun(set110100)
mean_100001 <- myfun(set100001)
mean_001100 <- myfun(set001100)
mean_011001 <- myfun(set011001)
mean_010010 <- myfun(set010010)
mean_000111 <- myfun(set000111)
## Take the mean of whole group to avoid outliers
fullmean_111111 <- mean(set111111$totalvisits)
fullmean_101010 <- mean(set101010$totalvisits)
fullmean_110100 <- mean(set110100$totalvisits)
fullmean_100001 <- mean(set100001$totalvisits)
fullmean_001100 <- mean(set001100$totalvisits)
fullmean_011001 <- mean(set011001$totalvisits)
fullmean_010010 <- mean(set010010$totalvisits)
fullmean_000111 <- mean(set000111$totalvisits)
## Main effect calculations according to equation in report
ME_A1 <- 1/4 * ( (mean_100001[1] + mean_110100[1] + mean_101010[1] + mean_111111[1]) - (mean_000111[1] + mean_010010[1] + mean_011001[1] + mean_001100[1]))
ME_B1 <- 1/4 * ((mean_110100[1] + mean_111111[1] + mean_010010[1] + mean_011001[1]) - (mean_100001[1] + mean_001100[1] + mean_101010[1] + mean_000111[1]))
ME_C1 <- 1/4 * ((mean_011001[1] + mean_001100[1] + mean_101010[1] + mean_111111[1]) - (mean_000111[1] + mean_010010[1] + mean_100001[1] + mean_110100[1]))
ME_D1 <- 1/4 * ((mean_000111[1] + mean_001100[1] + mean_110100[1] + mean_111111[1]) - (mean_010010[1] + mean_100001[1] + mean_101010[1] + mean_011001[1]))
ME_E1 <- 1/4 * ((mean_000111[1] + mean_010010[1] + mean_101010[1] + mean_111111[1] - (mean_011001[1] + mean_001100[1] + mean_110100[1] + mean_100001[1])))
ME_F1 <- 1/4 * ((mean_000111[1] + mean_011001[1] + mean_100001[1] + mean_111111[1] - (mean_010010[1] + mean_001100[1] + mean_110100[1] + mean_101010[1])))
# Interaction effect
IE_AF1 <- 1/4 * ((mean_010010[1] + mean_001100[1] + mean_111111[1] + mean_100001[1] - (mean_000111[1] + mean_011001[1] + mean_110100[1] + mean_101010[1])))
## Second replicate Calculations
ME_A2 <- 1/4 * ( (mean_100001[2] + mean_110100[2] + mean_101010[2] + mean_111111[2]) - (mean_000111[2] + mean_010010[2] + mean_011001[2] + mean_001100[2]))
ME_B2 <- 1/4 * ((mean_110100[2] + mean_111111[2] + mean_010010[2] + mean_011001[2]) - (mean_100001[2] + mean_001100[2] + mean_101010[2] + mean_000111[2]))
ME_C2 <- 1/4 * ((mean_011001[2] + mean_001100[2] + mean_101010[2] + mean_111111[2]) - (mean_000111[2] + mean_010010[2] + mean_100001[2] + mean_110100[2]))
ME_D2 <- 1/4 * ((mean_000111[2] + mean_001100[2] + mean_110100[2] + mean_111111[2]) - (mean_010010[2] + mean_100001[2] + mean_101010[2] + mean_011001[2]))
ME_E2 <- 1/4 * ((mean_000111[2] + mean_010010[2] + mean_101010[2] + mean_111111[2] - (mean_011001[2] + mean_001100[2] + mean_110100[2] + mean_100001[2])))
ME_F2 <- 1/4 * ((mean_000111[2] + mean_011001[2] + mean_100001[2] + mean_111111[2] - (mean_010010[2] + mean_001100[2] + mean_110100[2] + mean_101010[2])))
#Interaction
IE_AF2 <- 1/4 * ((mean_010010[2] + mean_001100[2] + mean_111111[2] + mean_100001[2] - (mean_000111[2] + mean_011001[2] + mean_110100[2] + mean_101010[2])))
## Full group mean calculation
ME_A3 <- 1/4 * ( (fullmean_100001 + fullmean_110100 + fullmean_101010 + fullmean_111111) - (fullmean_000111 + fullmean_010010 + fullmean_011001 + fullmean_001100))
ME_B3 <- 1/4 * ((fullmean_110100 + fullmean_111111 + fullmean_010010 + fullmean_011001) - (fullmean_100001 + fullmean_001100 + fullmean_101010 + fullmean_000111))
ME_C3 <- 1/4 * ((fullmean_011001 + fullmean_001100 + fullmean_101010 + fullmean_111111) - (fullmean_000111 + fullmean_010010 + fullmean_100001 + fullmean_110100))
ME_D3 <- 1/4 * ((fullmean_000111 + fullmean_001100 + fullmean_110100 + fullmean_111111) - (fullmean_010010 + fullmean_100001 + fullmean_101010 + fullmean_011001))
ME_E3 <- 1/4 * ((fullmean_000111 + fullmean_010010 + fullmean_101010 + fullmean_111111 - (fullmean_011001 + fullmean_001100 + fullmean_110100 + fullmean_100001)))
ME_F3 <- 1/4 * ((fullmean_000111 + fullmean_011001 + fullmean_100001 + fullmean_111111 - (fullmean_010010 + fullmean_001100 + fullmean_110100 + fullmean_101010)))
IE_AF3 <- 1/4 * ((fullmean_010010 + fullmean_001100 + fullmean_111111 + fullmean_100001 - (fullmean_000111 + fullmean_011001 + fullmean_110100 + fullmean_101010)))
## Table formation
main_effect_table1 <- matrix(c(ME_A1,ME_B1,ME_C1,ME_D1,ME_E1,ME_F1,ME_D1,ME_E1,ME_B1,ME_C1,IE_AF1, ME_F1, ME_A1, IE_AF1, ME_C1, IE_AF1, ME_A1, ME_B1, ME_F1, ME_E1, ME_D1), ncol = 1) # Display replicate 1 values
main_effect_table1 <- as.table(main_effect_table1) # Convert to table
rownames(main_effect_table1) <- c("A","B","C","D","E","F","AB","AC","AD","AE","AF","BC","BD","BE","BF","CD","CE","CF","DE","DF","EF") # Name rows
colnames(main_effect_table1) <- "Main Effect" # Name column
main_effect_table1 # Show table
# Table 2 simply repeats table 1
main_effect_table2 <- matrix(c(ME_A2,ME_B2,ME_C2,ME_D2,ME_E2,ME_F2,ME_D2,ME_E2,ME_B2,ME_C2,IE_AF2, ME_F2, ME_A2, IE_AF2, ME_C2, IE_AF2, ME_A2, ME_B2, ME_F2, ME_E2, ME_D2), ncol = 1)
main_effect_table2 <- as.table(main_effect_table2)
rownames(main_effect_table2) <- c("A","B","C","D","E","F","AB","AC","AD","AE","AF","BC","BD","BE","BF","CD","CE","CF","DE","DF","EF")
colnames(main_effect_table2) <- "Main Effect"
main_effect_table2
}
# Table 3 repeats the procedure again
main_effect_table3 <- matrix(c(ME_A3,ME_B3,ME_C3,ME_D3,ME_E3,ME_F3,ME_D3,ME_E3,ME_B3,ME_C3,IE_AF3, ME_F3, ME_A3, IE_AF3, ME_C3, IE_AF3, ME_A3, ME_B3, ME_F3, ME_E3, ME_D3), ncol = 1)
main_effect_table3 <- as.table(main_effect_table3)
rownames(main_effect_table3) <- c("A","B","C","D","E","F","AB","AC","AD","AE","AF","BC","BD","BE","BF","CD","CE","CF","DE","DF","EF")
colnames(main_effect_table3) <- "Main Effect"
main_effect_table3
## Anova and Model Diagnostics
av <- aov(totalvisits~sex + employed + region + health + sex:employed + sex:region + sex:health + employed:region + employed:health + region:health, data = df3) # Perform ANOVA
summary(av) # Summary display
av <- aov(totalvisits~region + health, data = df3)# ANOVA for final model
plot(av,which = c(1:2)) # Plot residuals and QQ