Bok, Joonhyuk

Rensselaer Polytechnic Institute

Dec 16 2016 version 3

1.1 System under test

Among a dataset from the Ecdat R Package, we select “Mathlevel” which would be useful for expecting SAT Math Score.

In Mathlevel data, ‘language’, ‘sex’, ‘physiccourse’ and ‘chemistcourse’ are selected as factors, which could explain the result of SAT Math Score and have 2 levels, 2 levels, 3 levels and 3 levels respectively. And ‘sat’ is chosen as a response variable.

The orginal dataset was collected for examining effect of level of math on learning Economics and consist of 8 variables and 609 observations. Original dataframe containing:

mathlevel highest level of math attained , an ordered factor with levels 170, 171a, 172, 171b, 172b, 221a, 221b
sat sat math score
language status of foreign language proficiency
sex male, female
major one of other, eco, oss (other social sciences), ns (natural sciences), hum (humanities)
mathcourse number of courses in advanced math (0 to 3)
physiccourse number of courses in physics (0 to 2)
chemistcourse number of courses in chemistry (0 to 2)

We will use some factors among them in order to conduct the analysis for the Project 3.

We can inspect the head and tail of the dataframe in order to see what the data look like.

load("C:/Users/bokjh3/Desktop/Ecdat_0.2-9/Ecdat/data/Mathlevel.rda")
head(Mathlevel, n=10)
##    mathlevel sat language    sex major mathcourse physiccourse
## 1        170 670       no   male    ns          1            2
## 2        170 660       no   male other          1            1
## 3        170 610       no female   eco          1            0
## 4        170 620      yes   male   eco          1            0
## 5        170 430       no   male   eco          0            1
## 6        170 580       no female   oss          2            1
## 7        170 550      yes female other          1            0
## 8        170 510       no female   eco          1            1
## 9        170 560      yes   male   hum          1            0
## 10       170 670       no   male   oss          1            0
##    chemistcourse
## 1              1
## 2              1
## 3              1
## 4              1
## 5              1
## 6              1
## 7              1
## 8              1
## 9              0
## 10             1
tail(Mathlevel, n=10)
##     mathlevel sat language    sex major mathcourse physiccourse
## 600      221b 660       no female    ns          2            1
## 601      221b 670       no female    ns          2            1
## 602      221b 670       no   male other          2            1
## 603      221b 660       no   male    ns          1            0
## 604      221b 590       no female    ns          2            1
## 605      221b 580       no female   oss          2            1
## 606      221b 770       no   male   oss          2            1
## 607      221b 660       no   male other          2            1
## 608      221b 710       no female   eco          2            0
## 609      221b 590       no female   oss          2            0
##     chemistcourse
## 600             1
## 601             1
## 602             1
## 603             1
## 604             1
## 605             1
## 606             1
## 607             1
## 608             1
## 609             1

1.2 Factor and Levels

4 factors selected are language, sex, physiccourse and chemistcourse. We convert variables of physiccourse and chemistcourse from numerics to factors for the analysis.

language status of foreign language proficiency
sex male, female
physiccourse number of courses in physics (0 to 2)
chemistcourse number of courses in chemistry (0 to 2)

Mathlevel$physiccourse <- as.factor(Mathlevel$physiccourse)
Mathlevel$chemistcourse <- as.factor(Mathlevel$chemistcourse)
str(Mathlevel)
## 'data.frame':    609 obs. of  8 variables:
##  $ mathlevel    : Ord.factor w/ 7 levels "170"<"171a"<"172a"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ sat          : int  670 660 610 620 430 580 550 510 560 670 ...
##  $ language     : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 2 1 ...
##  $ sex          : Factor w/ 2 levels "male","female": 1 1 2 1 1 2 2 2 1 1 ...
##  $ major        : Factor w/ 5 levels "other","eco",..: 4 1 2 2 2 3 1 2 5 3 ...
##  $ mathcourse   : num  1 1 1 1 0 2 1 1 1 1 ...
##  $ physiccourse : Factor w/ 3 levels "0","1","2": 3 2 1 1 2 2 1 2 1 1 ...
##  $ chemistcourse: Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 1 2 ...

1.3 Continous variables

Sat is a continous variable. In order to conduct ANOVA, a continuous dependent variable must be choosed as the response variable among the variables in the dataset.

1.4 Response variables

The response variable is sat. The sat is defined as SAT Math Score. Sat is a continuous variable. We investigate the effects on factors such as language, sex, physiccourse, and chemistcourse on the response variable, sat.

1.5 The Data: How is it organized and what does it look like?

The “Mathlevel” data is a cross-section data from 1983 to 1986 in United States. The number of observations is 609. The dataset is organized by the following variables: mathlevel, sat, language, sex, major, mathcourse, physiccourse and chemistcourse.

We select language, sex, physiccourse and chemistcourse as factors, which have 2, 2, 3 and 3 levles respectively, and sat as a response variable, which is a continous variable.

2. Experimental Design

2.1 How will the experiment be organized and conducted to test the hypothesis?

A Taguchi design allows the reduction of the total runs, while still allowing the computation of main effects.

To make the design simpler, the factors with 3 levels will be decomposed into factors with 2-level for calculating the required data needed to obtain appropriate data. For these, the sum of the variables with 2 levels will yield the value with 3 levels. This gives us a total of six 2-level factors for inclusion in our design.

The Taguchi experimental design will be used for futher analysis with the least number of experimental runs.

2.2 What is the rationale for this design?

A Taguchi design is another means to reduce the number of runs required for a full factorial design. The Taguchi model will need fewer runs, and still yield accurate main effect size data.

2.3 Randomize: What is the Randomization Scheme?

Randomization is a technique used to balance the effect of extraneous or uncontrollable conditions that can impact the results of an experiment. In this experiment, we do not consider randomization becasue we do not have control over data collection, but we assume that this dataset satisfies random design assumptions.

2.4 Replicate: Are there replicates and/or repeated measures?

Replicates are multiple experimental runs with the same factor settings (levels). Replicates are subject to the same sources of variability, independently of each other. We can replicate combinations of factor levels, groups of factor level combinations, or entire designs. There is no replication in this research.

2.5 Block: Did you use blocking in the design?

In experimental design, blocking is a technique used to deal with nuisance factors that may affect the results of the experiment. The experiment is organized into blocks, where the nuisance factor is maintained at a constant level in each block. Blocking is not utilized in this research.

2.6 Decomposing 3-level factors to 2-level factors

The dataset consists of 4 factors of which 2 factors have 2 levels and the other 2 factors have 3 levels. These 2 factors with 3 levels will be decomposed into 2-level factors respectively by the following process. Six 2-level factors remain and will be utilized in the research.

x = Mathlevel[1,]
physiccourse = c(0,0)

L = levels(Mathlevel[,'physiccourse'])
for(i in seq(2,dim(Mathlevel)[1])){
  if(Mathlevel[i,'physiccourse']==L[1]){
    x = rbind(x,Mathlevel[i,])
    physiccourse = rbind(physiccourse,c(0,0))
  } 
  else if(Mathlevel[i,'physiccourse']==L[2]){
    x = rbind(x,Mathlevel[i,])
    x = rbind(x,Mathlevel[i,])
    physiccourse = rbind(physiccourse,c(1,0))
    physiccourse = rbind(physiccourse,c(0,1))
  } 
  else{
    x = rbind(x,Mathlevel[i,])
    physiccourse = rbind(physiccourse,c(1,1))
  }
}

x = cbind(x,physiccourse)
x = x[,c(1,2,3,4,5,6,8,9,10)]
colnames(x) = c(colnames(Mathlevel[1,c(1,2,3,4,5,6,8)]),'physiccourseA','physiccourseB')

Lc = levels(Mathlevel[,'chemistcourse'])

chemistcourse = c(1,0)
chemistcourse = rbind(chemistcourse,c(0,1))
dat = x[1,]
dat = rbind(dat,dat)



for(i in seq(2,dim(x)[1])){
  if(x[i,'chemistcourse']==Lc[1]){
    dat = rbind(dat,x[i,])
    chemistcourse = rbind(chemistcourse,c(0,0))
  } 
  else if(x[i,'chemistcourse']==Lc[2]){
    dat = rbind(dat,x[i,])
    dat = rbind(dat,x[i,])
    chemistcourse = rbind(chemistcourse,c(1,0))
    chemistcourse = rbind(chemistcourse,c(0,1))
  } 
  else{
    dat = rbind(dat,x[i,])
    chemistcourse = rbind(chemistcourse,c(1,1))
  }
}

dat = cbind(dat,chemistcourse)

colnames(dat) = c(colnames(x),'chemistcourseA','chemistcourseB')
dat = dat[,c(1,2,3,4,5,6,7,8,9,10,11)]

2.7 Appropriate Taguchi Design

Utilizing the taguchiChoose function in the qualityTools package, we can investigate all possible Taguchi designs based on the number of factors and the number of levels for each factor.

install.packages("qualityTools", repos='http://cran.us.r-project.org')
## Installing package into 'C:/Users/bokjh3/Documents/R/win-library/3.3'
## (as 'lib' is unspecified)
## package 'qualityTools' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\bokjh3\AppData\Local\Temp\RtmpGIOuGb\downloaded_packages
library(qualityTools)
## Warning: package 'qualityTools' was built under R version 3.3.2
## Loading required package: Rsolnp
## Warning: package 'Rsolnp' was built under R version 3.3.2
## Loading required package: MASS
## 
## Attaching package: 'qualityTools'
## The following object is masked from 'package:stats':
## 
##     sigma
taguchiChoose(factors1 = 6, level1 = 2)
## 6 factors on 2 levels and 0 factors on 0 levels with 0 desired interactions to be estimated
## 
## Possible Designs:
## 
## L8_2 L12_2 L16_2 L32_2
## 
## Use taguchiDesign("L8_2") or different to create a taguchi design object

Although there are various Taguchi designs we can use, we are only interested in main effects and in minimizing the number of experimental runs with the aims of this experiment. Therefore, the L8_2 design is the most proper.

Next, we will create our design using the taguchiDesign function. We will also set the random seed so that our results are reproduceable.

set.seed(1587)
design <- taguchiDesign("L8_2")
## Warning in `[<-`(`*tmp*`, i, value = <S4 object of class
## structure("taguchiFactor", package = "qualityTools")>): implicit list
## embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, i, value = <S4 object of class
## structure("taguchiFactor", package = "qualityTools")>): implicit list
## embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, i, value = <S4 object of class
## structure("taguchiFactor", package = "qualityTools")>): implicit list
## embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, i, value = <S4 object of class
## structure("taguchiFactor", package = "qualityTools")>): implicit list
## embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, i, value = <S4 object of class
## structure("taguchiFactor", package = "qualityTools")>): implicit list
## embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, i, value = <S4 object of class
## structure("taguchiFactor", package = "qualityTools")>): implicit list
## embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, i, value = <S4 object of class
## structure("taguchiFactor", package = "qualityTools")>): implicit list
## embedding of S4 objects is deprecated
names(design) = c("dat$physiccourseA", "dat$physiccourseB", "dat$chemistcourseA", "chemistcourseB", "language", "sex")
design
##   StandOrder RunOrder Replicate A B C D E F G  y
## 1          8        1         1 2 2 1 2 1 1 2 NA
## 2          5        2         1 2 1 2 1 2 1 2 NA
## 3          3        3         1 1 2 2 1 1 2 2 NA
## 4          6        4         1 2 1 2 2 1 2 1 NA
## 5          4        5         1 1 2 2 2 2 1 1 NA
## 6          1        6         1 1 1 1 1 1 1 1 NA
## 7          7        7         1 2 2 1 1 2 2 1 NA
## 8          2        8         1 1 1 1 2 2 2 2 NA

3. Statistical Analysis

3.1 Data collection

We will collect the data for our 8 experimental runs based on our design. In order to make subset, we convert variables of language and sex from factors to numerics.

Mathlevel$language <- as.numeric(Mathlevel$language)
Mathlevel$sex <- as.numeric(Mathlevel$sex)
set1 <- subset(Mathlevel, physiccourse == 2 & chemistcourse == 1 & language == 1 & sex == 1)
set2 <- subset(Mathlevel, physiccourse == 1 & chemistcourse == 1 & language == 2 & sex == 1)
set3 <- subset(Mathlevel, physiccourse == 1 & chemistcourse == 1 & language == 1 & sex == 2)
set4 <- subset(Mathlevel, physiccourse == 1 & chemistcourse == 2 & language == 1 & sex == 2)
set5 <- subset(Mathlevel, physiccourse == 1 & chemistcourse == 2 & language == 2 & sex == 1)
set6 <- subset(Mathlevel, physiccourse == 0 & chemistcourse == 0 & language == 1 & sex == 1)
set7 <- subset(Mathlevel, physiccourse == 0 & chemistcourse == 0 & language == 2 & sex == 2)
set8 <- subset(Mathlevel, physiccourse == 0 & chemistcourse == 1 & language == 2 & sex == 2)
run1 <- set1[sample(1:nrow(set1), 1), ]
run2 <- set2[sample(1:nrow(set2), 1), ]
run3 <- set3[sample(1:nrow(set3), 1), ]
run4 <- set4[sample(1:nrow(set4), 1), ]
run5 <- set5[sample(1:nrow(set5), 1), ]
run6 <- set6[sample(1:nrow(set6), 1), ]
run7 <- set7[sample(1:nrow(set7), 1), ]
run8 <- set8[sample(1:nrow(set8), 1), ]
response <- c(run1$sat, run2$sat, run3$sat, run4$sat, run5$sat, run6$sat, run7$sat, run8$sat)
response
## [1] 580 780 650 590 670 560 610 550

Now we add the response column to our design.

response(design) = response
summary(design)
## Taguchi SINGLE Design
## Information about the factors:
## 
##                         A                 B                  C
## value 1                 1                 1                  1
## value 2                 2                 2                  2
## name    dat$physiccourseA dat$physiccourseB dat$chemistcourseA
## unit                                                          
## type              numeric           numeric            numeric
##                      D        E       F       G
## value 1              1        1       1       1
## value 2              2        2       2       2
## name    chemistcourseB language     sex    <NA>
## unit                                           
## type           numeric  numeric numeric numeric
## 
## -----------
## 
##   StandOrder RunOrder Replicate A B C D E F G response
## 1          8        1         1 2 2 1 2 1 1 2      580
## 2          5        2         1 2 1 2 1 2 1 2      780
## 3          3        3         1 1 2 2 1 1 2 2      650
## 4          6        4         1 2 1 2 2 1 2 1      590
## 5          4        5         1 1 2 2 2 2 1 1      670
## 6          1        6         1 1 1 1 1 1 1 1      560
## 7          7        7         1 2 2 1 1 2 2 1      610
## 8          2        8         1 1 1 1 2 2 2 2      550
## 
## -----------

3.2 Main Effects

According to the main effect plots, it seems that physiccourse and sex have a significant effect on Sat Math Score.

effectPlot(design)

3.3 Testing

In this section, we conduct ANOVA test. We estimate main effects ont the full dataset in order to check the validity of the results. An ANOVA table for the main effects is shown below. The table shows that sex and physiccourse have statistically significant effects on the SAT Math Score as expected from the main effects determined by our Taguchi desing experiment.

model = lm(sat~language+sex+physiccourse+chemistcourse,data=Mathlevel)
anova(model)
## Analysis of Variance Table
## 
## Response: sat
##                Df  Sum Sq Mean Sq F value    Pr(>F)    
## language        1   11963   11963  3.5283   0.06081 .  
## sex             1   66461   66461 19.6012 1.133e-05 ***
## physiccourse    2   24716   12358  3.6446   0.02671 *  
## chemistcourse   2   13678    6839  2.0169   0.13396    
## Residuals     602 2041192    3391                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.4 Comparison to Fractional Factorial Design Results

In order to compare the Fractional Factorial Design to the Taguchi Design, we look at the main effects plot for both (effect plot and boxplot). We provide boxplots to analyze the main effects of factors in Fractional Factorial Design over the response variable, sat.

boxplot(dat$sat~dat$language, xlab="language", ylab="SAT Math Score")

boxplot(dat$sat~dat$sex, xlab="sex", ylab="SAT Math Score")

boxplot(dat$sat~dat$physiccourseA, xlab="physiccourseA", ylab="SAT Math Score")

boxplot(dat$sat~dat$physiccourseB, xlab="physiccourseB", ylab="SAT Math Score")

boxplot(dat$sat~dat$chemistcourseA, xlab="chemistcourseA", ylab="SAT Math Score")

boxplot(dat$sat~dat$chemistcourseB, xlab="chemistcourseB", ylab="SAT Math Score")

We also conduct ANOVA test for the main effects based on the Fractional Factorial Design.

model_FFD = lm(sat~language+sex+physiccourseA+physiccourseB+chemistcourseA+chemistcourseB,data=dat)
anova(model_FFD)
## Analysis of Variance Table
## 
## Response: sat
##                  Df  Sum Sq Mean Sq F value    Pr(>F)    
## language          1   40659   40659 11.6992  0.000639 ***
## sex               1  198469  198469 57.1074 6.478e-14 ***
## physiccourseA     1    7089    7089  2.0397  0.153413    
## physiccourseB     1   33684   33684  9.6923  0.001879 ** 
## chemistcourseA    1    2080    2080  0.5985  0.439263    
## chemistcourseB    1   18501   18501  5.3236  0.021150 *  
## Residuals      1833 6370355    3475                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Discussion

The researcher using the Taguchi design only focuses on the main effect of the few factors that contribute significantly to the response. Therefore, the purpose of the comparison is to check how a Taguchi design compares against a fractional factorial design in estimating main effects.

Both designs are capable of estimating main effects with only 8 experimental runs for our six 2-level factor experiment. They requires fewer experimental runs, and therefore saves time and resources a lot. However, both could not guarantee statistical precisions compared to the results based on the full dataset.

Here we can find out that the results of the two experiments are greatly different although they make use of the same dataset. After the results of the ANOVA test conducted on the full dataset, we learn that the Taguchi model gives a better estimate of the main effects than the Fractional Factorial Design.

For this dataset, it appears that the Taguchi design is a more effective design. This makes sense. We already know that the main effects for the Fractional Factorial Design are confounded with the 2-factor interactions. The main effects are aliased with the 2-factor interactions in the Fractional Factorial design.

3.5 Diagnostics / Model Adequacy Checking

At first, we look at the histogram of sat to see if it satisfies the assumption of normality. As you can see, the distribution meets the assumption of normality.

hist(Mathlevel$sat, main = "SAT Math Score")

Quantile-Quantile (Q-Q) plots are graphs used to verify the distributional assumption for a set of data. The relatively linear relationship for all data sets justifies the use of ANOVA to test for the significant difference. From the QQ plot, the residuals nearly form a linear line, and we learn that the assumptions of normality are met.

qqnorm(residuals(model))
qqline(residuals(model))

Residuals vs. Fits Plot is a common graph used in residual analysis. It is a scatter plot of residuals as a function of fitted values, or the estimated responses. From the Residuals vs. Fits plot, the distribution of points seems random although we can see a few outliers and some linearity.

plot(fitted(model),residuals(model))

4. References to the literature

Montgomery, Douglas C. 2012. Design and Analysis of Experiments, 8th Edition.

5. Appendices

Raw data

The Mathlevel dataset can be found by installing and loading the Ecdat R package.