What is the problem that you were given?
The problem given is how mortality is affected by various factors.
data.text <- read.csv("C:/Users/braunj6/Documents/Fall 2014/Design of Experiments/data-text.csv")
x<-data.text
The data being examined had 12 variables and 1,746 observations. Some of the variables included year, region, income group, country, gender, and mortality rate. Year had 3 levels, region had 6 levels, income group had 4 levels, country had 194 levels, etc.
head(x)
## Indicator
## 1 Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)
## 2 Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)
## 3 Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)
## 4 Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)
## 5 Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)
## 6 Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)
## PUBLISH.STATES Year WHO.region World.Bank.income.group
## 1 Published 1990 Europe High-income
## 2 Published 2012 Europe High-income
## 3 Published 1990 Americas High-income
## 4 Published 2012 Western Pacific High-income
## 5 Published 2000 Europe High-income
## 6 Published 2012 Europe High-income
## Country Sex Display.Value Numeric Low High Comments
## 1 Andorra Male 144 144 NA NA NA
## 2 Andorra Both sexes 68 68 NA NA NA
## 3 Antigua and Barbuda Both sexes 174 174 NA NA NA
## 4 Australia Male 75 75 NA NA NA
## 5 Austria Male 126 126 NA NA NA
## 6 Austria Male 91 91 NA NA NA
summary(x)
## Indicator
## Adult mortality rate (probability of dying between 15 and 60 years per 1000 population):1746
##
##
##
##
##
##
## PUBLISH.STATES Year WHO.region
## Published:1746 Min. :1990 Africa :414
## 1st Qu.:1990 Americas :315
## Median :2000 Eastern Mediterranean:198
## Mean :2001 Europe :477
## 3rd Qu.:2012 South-East Asia : 99
## Max. :2012 Western Pacific :243
##
## World.Bank.income.group Country
## High-income :459 Afghanistan : 9
## Low-income :405 Albania : 9
## Lower-middle-income:477 Algeria : 9
## Upper-middle-income:405 Andorra : 9
## Angola : 9
## Antigua and Barbuda: 9
## (Other) :1692
## Sex Display.Value Numeric Low
## Both sexes:582 Min. : 34 Min. : 34 Mode:logical
## Female :582 1st Qu.:113 1st Qu.:113 NA's:1746
## Male :582 Median :177 Median :177
## Mean :206 Mean :206
## 3rd Qu.:277 3rd Qu.:277
## Max. :764 Max. :764
##
## High Comments
## Mode:logical Mode:logical
## NA's:1746 NA's:1746
##
##
##
##
##
str(x)
## 'data.frame': 1746 obs. of 12 variables:
## $ Indicator : Factor w/ 1 level "Adult mortality rate (probability of dying between 15 and 60 years per 1000 population)": 1 1 1 1 1 1 1 1 1 1 ...
## $ PUBLISH.STATES : Factor w/ 1 level "Published": 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : int 1990 2012 1990 2012 2000 2012 1990 2000 2000 2000 ...
## $ WHO.region : Factor w/ 6 levels "Africa","Americas",..: 4 4 2 6 4 4 4 4 3 3 ...
## $ World.Bank.income.group: Factor w/ 4 levels "High-income",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Country : Factor w/ 194 levels "Afghanistan",..: 4 4 6 9 10 10 17 17 13 13 ...
## $ Sex : Factor w/ 3 levels "Both sexes","Female",..: 3 1 1 3 3 3 1 1 2 1 ...
## $ Display.Value : int 144 68 174 75 126 91 107 100 83 96 ...
## $ Numeric : int 144 68 174 75 126 91 107 100 83 96 ...
## $ Low : logi NA NA NA NA NA NA ...
## $ High : logi NA NA NA NA NA NA ...
## $ Comments : logi NA NA NA NA NA NA ...
The continuous variable in this dataset in the mortality rate (Numeric)
The response variable being analyzed is the global adult mortality rate, where the value is the probability of dying between the ages of 15 and 65 of every 1000 people.
The data is organized into 12 columns, with 1,746 observations.
This data is not randomized because it takes the rate of mortality at the end of each year for all of the factors. Therefore, it is a set value and not randomized.
The experiment will use a multi-factor analysis of variance. It will use various attributes as factors, such as year, region, income group, and gender, and will measure the variation in mortality rate among these groups.
The rationale for this type of design is that the goal was to analyze the difference in means between the groups. The ANOVA was set up to include interaction between the factors. Therefore, we wanted to see what the variation was both among, and between groups.
Because the dataset was a set of observations, there was no randomization.
There are no replicates or repeated meaures in this design. There is only one value for each given combination of factors.
Blocking was used for year, region, income group, and gender.
# Frequency of Region
barplot(table(x$WHO.region), main = "Frequency of Region")
# This plot by region shows the frequency of observations, grouped by their region in the world. There are six regions, with Europe appearing most frequently, and South-East Asia the least. This would make sense, because Europe has the most countries out of all the regions.
# Frequency of Income Groups
barplot(table(x$World.Bank.income.group), main = "Frequency of Income Group")
# The frequencies by each income group were very similar. Lower-middle-income was the highest frequency at 477, while low-income and upper-middle-income were the lowest frequencies, at 405 observations.
boxplot(x$Numeric ~ x$WHO.region)
# The boxplots show the median, IQR, outliers of the mortality rate, broken down by region. Africa stands out as having the highest mortality rate, while the others remain relatively similar.
boxplot(x$Numeric ~ x$World.Bank.income.group)
# This boxplot shows the mortality rate broken down by income level. As seen in the plot, the low-income generally has the highest mortality rate, while the high-income is the lowest.
The focus of this recipe was on a >2 factor, >2 level ANOVA test. Therefore, the factors being analyzed had to be mutiple levels.
#ANOVA Testing
#Comparing mortality versus multiple factors: year, region, income group, and gender
# H0: There is no difference between the means of the samples. The variation in mortality is not caused by anything other than randomization.
# Ha: The difference in variation of mortality can caused by something other than randomization.
model1 <- aov(Numeric ~ Year* WHO.region* World.Bank.income.group* Sex, data = x)
summary(model1)
## Df Sum Sq Mean Sq F value
## Year 1 610913 610913 127.61
## WHO.region 5 11786237 2357247 492.37
## World.Bank.income.group 3 2468145 822715 171.85
## Sex 2 1404979 702490 146.73
## Year:WHO.region 5 98650 19730 4.12
## Year:World.Bank.income.group 3 57949 19316 4.03
## WHO.region:World.Bank.income.group 13 1020251 78481 16.39
## Year:Sex 2 8436 4218 0.88
## WHO.region:Sex 10 87292 8729 1.82
## World.Bank.income.group:Sex 6 60455 10076 2.10
## Year:WHO.region:World.Bank.income.group 13 96493 7423 1.55
## Year:WHO.region:Sex 10 6466 647 0.14
## Year:World.Bank.income.group:Sex 6 3961 660 0.14
## WHO.region:World.Bank.income.group:Sex 26 36135 1390 0.29
## Year:WHO.region:World.Bank.income.group:Sex 26 5141 198 0.04
## Residuals 1614 7727064 4788
## Pr(>F)
## Year <2e-16 ***
## WHO.region <2e-16 ***
## World.Bank.income.group <2e-16 ***
## Sex <2e-16 ***
## Year:WHO.region 0.0010 **
## Year:World.Bank.income.group 0.0072 **
## WHO.region:World.Bank.income.group <2e-16 ***
## Year:Sex 0.4145
## WHO.region:Sex 0.0521 .
## World.Bank.income.group:Sex 0.0500 *
## Year:WHO.region:World.Bank.income.group 0.0928 .
## Year:WHO.region:Sex 0.9993
## Year:World.Bank.income.group:Sex 0.9913
## WHO.region:World.Bank.income.group:Sex 0.9998
## Year:WHO.region:World.Bank.income.group:Sex 1.0000
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary of the ANOVA gives the p-values for each individual factor, along with the p-value for the interactions between the factors. The null hypothesis states that the variation in the response variable, mortality rate, cannot be explained by anything other than randomization. The p-values in this ANOVA support that this is not the case. There are significant p-values for region, income group, gender, and all interactions, except:
par(mfrow = c(1,1))
qqnorm(residuals(model1))
qqline(residuals(model1))
The Q-Q Normality Plot of the residuals shows that the data is not particularly normal. If it were normal, the residuals would be linear, and would lie on the Q-Q line.
interaction.plot(x$Sex, x$World.Bank.income.group, x$Numeric)
# The interaction plot between gender and income group shows that there is no significant interaction between the two groups.
interaction.plot(x$WHO.region, x$World.Bank.income.group, x$Numeric)
# The interaction plot between region and income group shows that there is significant interaction between the groups. Interaction is represented by crossing the various lines, and by the lines running unparallel to eachother.
interaction.plot(x$Year, x$World.Bank.income.group, x$Numeric)
# The interaction plot between year and income group shows that while none of the groups cross directly, the lines do not run directly parallel.
plot(fitted(model1),residuals(model1))
The plot of the fitted values versus the residual values is not generally very scattered. The points are clustered tightly, and spread out over the course of the plot. This indicates that the data is not normally distributed.
Tukey's range tests are used alongside ANOVAs in order to find means that are significantly different from each other, by compairing pairs of means. This test compares the mean of each treatment level to the mean of the other treatment levels.
The null hypothesis for a Tukey test states that there is no difference between the means of a pair of data, while the alternative states that there is a significant difference between the means.
tukey1 <- TukeyHSD(aov(Numeric ~ WHO.region, data = x))
tukey1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Numeric ~ WHO.region, data = x)
##
## $WHO.region
## diff lwr upr p adj
## Americas-Africa -186.15 -205.0689 -167.226 0.0000
## Eastern Mediterranean-Africa -163.50 -185.3682 -141.634 0.0000
## Europe-Africa -214.35 -231.3482 -197.350 0.0000
## South-East Asia-Africa -130.55 -158.8649 -102.238 0.0000
## Western Pacific-Africa -172.30 -192.7526 -151.849 0.0000
## Eastern Mediterranean-Americas 22.65 -0.3058 45.598 0.0556
## Europe-Americas -28.20 -46.5755 -9.828 0.0002
## South-East Asia-Americas 55.60 26.4364 84.755 0.0000
## Western Pacific-Americas 13.85 -7.7613 35.454 0.4478
## Europe-Eastern Mediterranean -50.85 -72.2428 -29.453 0.0000
## South-East Asia-Eastern Mediterranean 32.95 1.7981 64.101 0.0310
## Western Pacific-Eastern Mediterranean -8.80 -33.0287 15.429 0.9058
## South-East Asia-Europe 83.80 55.8473 111.748 0.0000
## Western Pacific-Europe 42.05 22.1022 61.994 0.0000
## Western Pacific-South-East Asia -41.75 -71.9239 -11.575 0.0012
plot(tukey1)
# The results of the Tukey Test display mostly all p-values less than an alpha, 0.05 between all pairs of regions, except between the Western Pacific and Eastern Mediterranean, the Western Pacific and the Americas, and the Eastern Mediterranea and the Americas. This shows that, for these pairs with significan p-values, there is a noticeable difference in means between these pairs.
tukey2 <- TukeyHSD(aov(Numeric ~ World.Bank.income.group, data = x))
tukey2
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Numeric ~ World.Bank.income.group, data = x)
##
## $World.Bank.income.group
## diff lwr upr p adj
## Low-income-High-income 212.45 196.06 228.84 0
## Lower-middle-income-High-income 116.23 100.51 131.96 0
## Upper-middle-income-High-income 67.44 51.05 83.84 0
## Lower-middle-income-Low-income -96.22 -112.46 -79.97 0
## Upper-middle-income-Low-income -145.01 -161.90 -128.11 0
## Upper-middle-income-Lower-middle-income -48.79 -65.04 -32.54 0
plot(tukey2)
# The results of the Tukey Test display all p-values less than an alpha, 0.05 between all pairs of income groups. This shows that, for these pairs, there is a noticeable difference in means between these pairs.
Based on the model adequacy testing, it appears that the data is not normally distributed. Unfortunately, normality is an assumption for using ANOVA tests.
A Kruskal-Wallis test can be used to evaluate whether groups' distributions are identical, without the assumption that the data is normally distributed.
kruskal.test(Numeric ~ WHO.region, data = x)
##
## Kruskal-Wallis rank sum test
##
## data: Numeric by WHO.region
## Kruskal-Wallis chi-squared = 736.5, df = 5, p-value < 2.2e-16
kruskal.test(Numeric ~ World.Bank.income.group, data = x)
##
## Kruskal-Wallis rank sum test
##
## data: Numeric by World.Bank.income.group
## Kruskal-Wallis chi-squared = 823.4, df = 3, p-value < 2.2e-16
kruskal.test(Numeric ~ Sex, data = x)
##
## Kruskal-Wallis rank sum test
##
## data: Numeric by Sex
## Kruskal-Wallis chi-squared = 146.8, df = 2, p-value < 2.2e-16
The null hypothesis for a Kruskal-Wallis test states that the variation in distributions between the groups cannot be explained by anything other than variation. Because the p-values are less than 0.05 for all three tests, we can reject the null. Instead, we can support an alternative, that the variation in group distributions can be explained by something other than randomization - in this case, region, income group, and gender.
The data can be found at the World Health Organization's website. http://apps.who.int/gho/data/node.main.2?lang=en