Recipe 3: Multi-Factor Analysis

Recipes for the Design of Experiments: Recipe Outline

Global Health and Mortality

Jane Braun

RPI

OCtober 7th Version 1.0

1. Setting

System under test

What is the problem that you were given?

The problem given is how mortality is affected by various factors.

data.text <- read.csv("C:/Users/braunj6/Documents/Fall 2014/Design of Experiments/data-text.csv")

x<-data.text

Factors and Levels

The data being examined had 12 variables and 1,746 observations. Some of the variables included year, region, income group, country, gender, and mortality rate. Year had 3 levels, region had 6 levels, income group had 4 levels, country had 194 levels, etc.

head(x)
##                                                                                 Indicator
## 1 Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)
## 2 Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)
## 3 Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)
## 4 Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)
## 5 Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)
## 6 Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)
##   PUBLISH.STATES Year      WHO.region World.Bank.income.group
## 1      Published 1990          Europe             High-income
## 2      Published 2012          Europe             High-income
## 3      Published 1990        Americas             High-income
## 4      Published 2012 Western Pacific             High-income
## 5      Published 2000          Europe             High-income
## 6      Published 2012          Europe             High-income
##               Country        Sex Display.Value Numeric Low High Comments
## 1             Andorra       Male           144     144  NA   NA       NA
## 2             Andorra Both sexes            68      68  NA   NA       NA
## 3 Antigua and Barbuda Both sexes           174     174  NA   NA       NA
## 4           Australia       Male            75      75  NA   NA       NA
## 5             Austria       Male           126     126  NA   NA       NA
## 6             Austria       Male            91      91  NA   NA       NA
summary(x)
##                                                                                    Indicator   
##  Adult mortality rate (probability of dying between 15 and 60 years per 1000 population):1746  
##                                                                                                
##                                                                                                
##                                                                                                
##                                                                                                
##                                                                                                
##                                                                                                
##    PUBLISH.STATES      Year                      WHO.region 
##  Published:1746   Min.   :1990   Africa               :414  
##                   1st Qu.:1990   Americas             :315  
##                   Median :2000   Eastern Mediterranean:198  
##                   Mean   :2001   Europe               :477  
##                   3rd Qu.:2012   South-East Asia      : 99  
##                   Max.   :2012   Western Pacific      :243  
##                                                             
##         World.Bank.income.group                Country    
##  High-income        :459        Afghanistan        :   9  
##  Low-income         :405        Albania            :   9  
##  Lower-middle-income:477        Algeria            :   9  
##  Upper-middle-income:405        Andorra            :   9  
##                                 Angola             :   9  
##                                 Antigua and Barbuda:   9  
##                                 (Other)            :1692  
##          Sex      Display.Value    Numeric      Low         
##  Both sexes:582   Min.   : 34   Min.   : 34   Mode:logical  
##  Female    :582   1st Qu.:113   1st Qu.:113   NA's:1746     
##  Male      :582   Median :177   Median :177                 
##                   Mean   :206   Mean   :206                 
##                   3rd Qu.:277   3rd Qu.:277                 
##                   Max.   :764   Max.   :764                 
##                                                             
##    High         Comments      
##  Mode:logical   Mode:logical  
##  NA's:1746      NA's:1746     
##                               
##                               
##                               
##                               
## 
str(x)
## 'data.frame':    1746 obs. of  12 variables:
##  $ Indicator              : Factor w/ 1 level "Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PUBLISH.STATES         : Factor w/ 1 level "Published": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year                   : int  1990 2012 1990 2012 2000 2012 1990 2000 2000 2000 ...
##  $ WHO.region             : Factor w/ 6 levels "Africa","Americas",..: 4 4 2 6 4 4 4 4 3 3 ...
##  $ World.Bank.income.group: Factor w/ 4 levels "High-income",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Country                : Factor w/ 194 levels "Afghanistan",..: 4 4 6 9 10 10 17 17 13 13 ...
##  $ Sex                    : Factor w/ 3 levels "Both sexes","Female",..: 3 1 1 3 3 3 1 1 2 1 ...
##  $ Display.Value          : int  144 68 174 75 126 91 107 100 83 96 ...
##  $ Numeric                : int  144 68 174 75 126 91 107 100 83 96 ...
##  $ Low                    : logi  NA NA NA NA NA NA ...
##  $ High                   : logi  NA NA NA NA NA NA ...
##  $ Comments               : logi  NA NA NA NA NA NA ...

Continuous variables

The continuous variable in this dataset in the mortality rate (Numeric)

Response variables

The response variable being analyzed is the global adult mortality rate, where the value is the probability of dying between the ages of 15 and 65 of every 1000 people.

The Data: How is it organized and what does it look like?

The data is organized into 12 columns, with 1,746 observations.

Randomization

This data is not randomized because it takes the rate of mortality at the end of each year for all of the factors. Therefore, it is a set value and not randomized.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

The experiment will use a multi-factor analysis of variance. It will use various attributes as factors, such as year, region, income group, and gender, and will measure the variation in mortality rate among these groups.

What is the rationale for this design?

The rationale for this type of design is that the goal was to analyze the difference in means between the groups. The ANOVA was set up to include interaction between the factors. Therefore, we wanted to see what the variation was both among, and between groups.

Randomize: What is the Randomization Scheme?

Because the dataset was a set of observations, there was no randomization.

Replicate: Are there replicates and/or repeated measures?

There are no replicates or repeated meaures in this design. There is only one value for each given combination of factors.

Block: Did you use blocking in the design?

Blocking was used for year, region, income group, and gender.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

# Frequency of Region
barplot(table(x$WHO.region), main = "Frequency of Region")

plot of chunk unnamed-chunk-3

# This plot by region shows the frequency of observations, grouped by their region in the world. There are six regions, with Europe appearing most frequently, and South-East Asia the least. This would make sense, because Europe has the most countries out of all the regions.

# Frequency of Income Groups
barplot(table(x$World.Bank.income.group), main = "Frequency of Income Group")

plot of chunk unnamed-chunk-3

# The frequencies by each income group were very similar. Lower-middle-income was the highest frequency at 477, while low-income and upper-middle-income were the lowest frequencies, at 405 observations. 

boxplot(x$Numeric ~ x$WHO.region)

plot of chunk unnamed-chunk-3

# The boxplots show the median, IQR, outliers of the mortality rate, broken down by region. Africa stands out as having the highest mortality rate, while the others remain relatively similar. 

boxplot(x$Numeric ~ x$World.Bank.income.group)

plot of chunk unnamed-chunk-3

# This boxplot shows the mortality rate broken down by income level. As seen in the plot, the low-income generally has the highest mortality rate, while the high-income is the lowest. 

Testing

The focus of this recipe was on a >2 factor, >2 level ANOVA test. Therefore, the factors being analyzed had to be mutiple levels.

#ANOVA Testing

#Comparing mortality versus multiple factors: year, region, income group, and gender
# H0: There is no difference between the means of the samples. The variation in mortality is not caused by anything other than randomization.
# Ha: The difference in variation of mortality can caused by something other than randomization. 


model1 <- aov(Numeric ~ Year* WHO.region* World.Bank.income.group* Sex, data = x)
summary(model1)
##                                               Df   Sum Sq Mean Sq F value
## Year                                           1   610913  610913  127.61
## WHO.region                                     5 11786237 2357247  492.37
## World.Bank.income.group                        3  2468145  822715  171.85
## Sex                                            2  1404979  702490  146.73
## Year:WHO.region                                5    98650   19730    4.12
## Year:World.Bank.income.group                   3    57949   19316    4.03
## WHO.region:World.Bank.income.group            13  1020251   78481   16.39
## Year:Sex                                       2     8436    4218    0.88
## WHO.region:Sex                                10    87292    8729    1.82
## World.Bank.income.group:Sex                    6    60455   10076    2.10
## Year:WHO.region:World.Bank.income.group       13    96493    7423    1.55
## Year:WHO.region:Sex                           10     6466     647    0.14
## Year:World.Bank.income.group:Sex               6     3961     660    0.14
## WHO.region:World.Bank.income.group:Sex        26    36135    1390    0.29
## Year:WHO.region:World.Bank.income.group:Sex   26     5141     198    0.04
## Residuals                                   1614  7727064    4788        
##                                             Pr(>F)    
## Year                                        <2e-16 ***
## WHO.region                                  <2e-16 ***
## World.Bank.income.group                     <2e-16 ***
## Sex                                         <2e-16 ***
## Year:WHO.region                             0.0010 ** 
## Year:World.Bank.income.group                0.0072 ** 
## WHO.region:World.Bank.income.group          <2e-16 ***
## Year:Sex                                    0.4145    
## WHO.region:Sex                              0.0521 .  
## World.Bank.income.group:Sex                 0.0500 *  
## Year:WHO.region:World.Bank.income.group     0.0928 .  
## Year:WHO.region:Sex                         0.9993    
## Year:World.Bank.income.group:Sex            0.9913    
## WHO.region:World.Bank.income.group:Sex      0.9998    
## Year:WHO.region:World.Bank.income.group:Sex 1.0000    
## Residuals                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The summary of the ANOVA gives the p-values for each individual factor, along with the p-value for the interactions between the factors. The null hypothesis states that the variation in the response variable, mortality rate, cannot be explained by anything other than randomization. The p-values in this ANOVA support that this is not the case. There are significant p-values for region, income group, gender, and all interactions, except:

Diagnostics/Model Adequacy Checking

par(mfrow = c(1,1))
qqnorm(residuals(model1))
qqline(residuals(model1))

plot of chunk unnamed-chunk-5 The Q-Q Normality Plot of the residuals shows that the data is not particularly normal. If it were normal, the residuals would be linear, and would lie on the Q-Q line.

interaction.plot(x$Sex, x$World.Bank.income.group, x$Numeric)

plot of chunk unnamed-chunk-6

# The interaction plot between gender and income group shows that there is no significant interaction between the two groups. 

interaction.plot(x$WHO.region, x$World.Bank.income.group, x$Numeric)

plot of chunk unnamed-chunk-6

# The interaction plot between region and income group shows that there is significant interaction between the groups. Interaction is represented by crossing the various lines, and by the lines running unparallel to eachother.

interaction.plot(x$Year, x$World.Bank.income.group, x$Numeric)

plot of chunk unnamed-chunk-6

# The interaction plot between year and income  group shows that while none of the groups cross directly, the lines do not run directly parallel.
plot(fitted(model1),residuals(model1))

plot of chunk unnamed-chunk-7 The plot of the fitted values versus the residual values is not generally very scattered. The points are clustered tightly, and spread out over the course of the plot. This indicates that the data is not normally distributed.

Tukey's range tests are used alongside ANOVAs in order to find means that are significantly different from each other, by compairing pairs of means. This test compares the mean of each treatment level to the mean of the other treatment levels.

The null hypothesis for a Tukey test states that there is no difference between the means of a pair of data, while the alternative states that there is a significant difference between the means.

tukey1 <- TukeyHSD(aov(Numeric ~ WHO.region, data = x))
tukey1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Numeric ~ WHO.region, data = x)
## 
## $WHO.region
##                                          diff       lwr      upr  p adj
## Americas-Africa                       -186.15 -205.0689 -167.226 0.0000
## Eastern Mediterranean-Africa          -163.50 -185.3682 -141.634 0.0000
## Europe-Africa                         -214.35 -231.3482 -197.350 0.0000
## South-East Asia-Africa                -130.55 -158.8649 -102.238 0.0000
## Western Pacific-Africa                -172.30 -192.7526 -151.849 0.0000
## Eastern Mediterranean-Americas          22.65   -0.3058   45.598 0.0556
## Europe-Americas                        -28.20  -46.5755   -9.828 0.0002
## South-East Asia-Americas                55.60   26.4364   84.755 0.0000
## Western Pacific-Americas                13.85   -7.7613   35.454 0.4478
## Europe-Eastern Mediterranean           -50.85  -72.2428  -29.453 0.0000
## South-East Asia-Eastern Mediterranean   32.95    1.7981   64.101 0.0310
## Western Pacific-Eastern Mediterranean   -8.80  -33.0287   15.429 0.9058
## South-East Asia-Europe                  83.80   55.8473  111.748 0.0000
## Western Pacific-Europe                  42.05   22.1022   61.994 0.0000
## Western Pacific-South-East Asia        -41.75  -71.9239  -11.575 0.0012
plot(tukey1)

plot of chunk unnamed-chunk-8

# The results of the Tukey Test display mostly all p-values less than an alpha, 0.05 between all pairs of regions, except between the Western Pacific and Eastern Mediterranean, the Western Pacific and the Americas, and the Eastern Mediterranea and the Americas. This shows that, for these pairs with significan p-values, there is a noticeable difference in means between these pairs. 

tukey2 <- TukeyHSD(aov(Numeric ~ World.Bank.income.group, data = x))
tukey2
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Numeric ~ World.Bank.income.group, data = x)
## 
## $World.Bank.income.group
##                                            diff     lwr     upr p adj
## Low-income-High-income                   212.45  196.06  228.84     0
## Lower-middle-income-High-income          116.23  100.51  131.96     0
## Upper-middle-income-High-income           67.44   51.05   83.84     0
## Lower-middle-income-Low-income           -96.22 -112.46  -79.97     0
## Upper-middle-income-Low-income          -145.01 -161.90 -128.11     0
## Upper-middle-income-Lower-middle-income  -48.79  -65.04  -32.54     0
plot(tukey2)

plot of chunk unnamed-chunk-8

# The results of the Tukey Test display all p-values less than an alpha, 0.05 between all pairs of income groups. This shows that, for these pairs, there is a noticeable difference in means between these pairs. 

4. Contingencies

Based on the model adequacy testing, it appears that the data is not normally distributed. Unfortunately, normality is an assumption for using ANOVA tests.

A Kruskal-Wallis test can be used to evaluate whether groups' distributions are identical, without the assumption that the data is normally distributed.

kruskal.test(Numeric ~ WHO.region, data = x)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Numeric by WHO.region
## Kruskal-Wallis chi-squared = 736.5, df = 5, p-value < 2.2e-16
kruskal.test(Numeric ~ World.Bank.income.group, data = x)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Numeric by World.Bank.income.group
## Kruskal-Wallis chi-squared = 823.4, df = 3, p-value < 2.2e-16
kruskal.test(Numeric ~ Sex, data = x)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Numeric by Sex
## Kruskal-Wallis chi-squared = 146.8, df = 2, p-value < 2.2e-16

The null hypothesis for a Kruskal-Wallis test states that the variation in distributions between the groups cannot be explained by anything other than variation. Because the p-values are less than 0.05 for all three tests, we can reject the null. Instead, we can support an alternative, that the variation in group distributions can be explained by something other than randomization - in this case, region, income group, and gender.

4. References

The data can be found at the World Health Organization's website. http://apps.who.int/gho/data/node.main.2?lang=en