After examining the available data sets in Ecdat, BudgetFood data set has been selected.
Installation of the package and selection of the data-set is shown below:
#install.packages("pid")
#install.packages("Ecdat")
library(pid)
library("Ecdat")
prj <- BudgetFood
BudgetFood data set was collected for examining Null Hypothesis Statistical Testing and consist of 6 variables and 23,972 observations. It represents a cross-section from 1980 and the main source of the data is Journal of Applied Econometrics.
Definitions of the variables in data set are as below:
head(prj)
wfood totexp age size town sex
1 0.4676991 1290941 43 5 2 man
2 0.3130226 1277978 40 3 2 man
3 0.3764819 845852 28 3 2 man
4 0.4396909 527698 60 1 2 woman
5 0.4036149 1103220 37 5 2 man
6 0.1992503 1768128 35 4 2 man
Factors and levels of each factor are listed as below:
str(prj)
'data.frame': 23972 obs. of 6 variables:
$ wfood : num 0.468 0.313 0.376 0.44 0.404 ...
$ totexp: num 1290941 1277978 845852 527698 1103220 ...
$ age : num 43 40 28 60 37 35 40 68 43 51 ...
$ size : num 5 3 3 1 5 4 4 2 9 7 ...
$ town : num 2 2 2 2 2 2 2 2 2 2 ...
$ sex : Factor w/ 2 levels "man","woman": 1 1 1 2 1 1 1 2 1 1 ...
However, one of the factors that we are planning to use in the analysis, town, is defined as integer. We should change it to a factor before starting the analysis.
prj$town <- as.factor(prj$town)
str(prj$town)
Factor w/ 5 levels "1","2","3","4",..: 2 2 2 2 2 2 2 2 2 2 ...
We will use following 2 categorical independent variables with below levels:
levels(prj$town)
[1] "1" "2" "3" "4" "5"
levels(prj$sex)
[1] "man" "woman"
In the data set we have three continuous variables two of them are independent variables and we will not use them in our study. However, we will use only one continuous variable as our Dependent Variable.
As it has been mentioned before data consists of 8 variables. However we should re-arrange our data according to the Project-2 request: Select a data set, a pair of categorical independent variables and a continuous dependent variable.
General view of the data is as below we also removed the NA values by using na.omit function.
prj2 <- prj[,c(1,5,6)]
prj2 <- na.omit(prj2)
head(prj2)
wfood town sex
1 0.4676991 2 man
2 0.3130226 2 man
3 0.3764819 2 man
4 0.4396909 2 woman
5 0.4036149 2 man
6 0.1992503 2 man
summary(prj2)
wfood town sex
Min. :0.0000 1:2903 man :20624
1st Qu.:0.2583 2:3986 woman: 3347
Median :0.3642 3:4362
Mean :0.3783 4:9882
3rd Qu.:0.4847 5:2838
Max. :0.9966
We are using 2 factors one with two and the other with multiple levels. We will use the variable with multiple levels as the blocking and use the other one testing our null hypothesis.
For this experiment our null hypothesis will be: Percentage of total expenditure which the household spend on food does not affected by gender of the reference person. Experiment will follow below outline: First general descriptive statistics will be studied. After sample size will be calculated for the required by using G*power. ANOVA will be used for evaluating the null hypothesis. Assumptions of ANOVA will be controlled. (normality, randomization) Finally, model adequacy will be checked.
Data set has been retrieved from Ecdat, and factors had been already chosen. In this situation it is not possible for us to define the reasons of why these factors were selected, but we choose this dataset because it fits the requirements of Project.2.
Replication, local control(blocking) and randomization are fundamentals of a designing an experiment. Replication and blocking increases the precision in the experiment and randomization helps reducing bias.
Randomization is possible with below three conditions:
If all these are realized in a random behavior we can assume the analyze is randomized.
For this data set We do not have a certain information about random selection.
However, with randomly sub-setting and ordering the data we can assure the random assignment and random execution according to the sample size that we defined.
“Repeat and replicate measurements are both multiple response measurements taken at the same combination of factor settings; but repeat measurements are taken during the same experimental run or consecutive runs, while replicate measurements are taken during identical but different experimental runs, which are often randomized.”[1]
In this dataset we do not have any replication or repeated measures.
“A block is a categorical variable that explains variation in the response variable that is not caused by the factors. Although each measurement should be taken under consistent experimental conditions (other than the factors that are being varied as part of the experiment), this is not always possible. Use blocks in designed experiments and analysis to minimize bias and variance of the error because of nuisance factors.” [2]
As the definition of nuisance factor, it might have some effect on the response but on the other hand it should not be in main focus of the researcher. Also from DOE modules : “Variability between blocks can be large, variability within a block should be relatively small. In general, a block is a specific level of the nuisance factor” So this will be analyzed in our descriptive summary with boxplots and means of the blocking factor levels.
Histogram of our response variable (wfood) is plotted for visually analyzing the normal distribution.
hist(prj2$wfood, xlab="Food Expenditure Ratio", main = paste("Histogram of" , "Food Expenditure Ratio"))
From this histogram we could assume that our response variable is nearly Normally Distributed.
Town size is our blocking factor and gender is our independent variable and their boxplots and means are shown as following :
boxplot(prj2$wfood~prj2$town, xlab="Town Size", ylab="Food Expenditure Ratio")
title("Town")
means1 <- by(prj2$wfood, prj2$town, mean)
points(1:5, means1, pch = 23, cex = 2, bg = "red")
text(1:5 - 0.1, means1,labels = format(means1, format = "f", digits = 2),pos = 3, cex = 0.9, col = "red")
boxplot(prj2$wfood~prj2$sex, xlab="Gender of Individual", ylab="Food Expenditure Ratio")
title("Sex")
means2 <- by(prj2$wfood, prj2$sex, mean)
points(1:2, means2, pch = 23, cex = 2, bg = "red")
text(1:2 - 0.1, means2,labels = format(means2, format = "f" , digits = 2),pos = 3, cex = 0.9, col = "red")
For defining the sample size, we are going to use the G*power software. For this calculation we need below values:
cohensD function from lsr library will be used for effect size calculation:
library(lsr)
P2_man <- subset(prj2$wfood,prj2$sex=="man")
P2_woman <- subset(prj2$wfood,prj2$sex=="woman")
cohensD(P2_woman,P2_man)
[1] 0.1344472
This calculated effect size and selected \(\alpha\), \(\beta\) and number of groups gives us a sample size of 608.
We have 5 blocks (town size: 1 to 5) and one independent variable (sex: man or woman). I have randomly select 61 samples for each block and tries to generate a balanced design with 610 samples. Sampling the data set I used sample.row function from package kimisc which is a relatively new one. (February.2016)
library(kimisc)
m1 <- sample.rows(subset(prj2, sex == "man" & town =="1"), 61)
m2 <- sample.rows(subset(prj2, sex == "man" & town =="2"), 61)
m3 <- sample.rows(subset(prj2, sex == "man" & town =="3"), 61)
m4 <- sample.rows(subset(prj2, sex == "man" & town =="4"), 61)
m5 <- sample.rows(subset(prj2, sex == "man" & town =="5"), 61)
w1 <- sample.rows(subset(prj2, sex == "woman" & town =="1"), 61)
w2 <- sample.rows(subset(prj2, sex == "woman" & town =="2"), 61)
w3 <- sample.rows(subset(prj2, sex == "woman" & town =="3"), 61)
w4 <- sample.rows(subset(prj2, sex == "woman" & town =="4"), 61)
w5 <- sample.rows(subset(prj2, sex == "woman" & town =="5"), 61)
prj2s <- rbind(m1,m2,m3,m4,m5,w1,w2,w3,w4,w5)
summary(prj2s)
wfood town sex
Min. :0.0000 1:122 man :305
1st Qu.:0.2701 2:122 woman:305
Median :0.3791 3:122
Mean :0.3937 4:122
3rd Qu.:0.5106 5:122
Max. :0.9692
Now we can generate our ANOVA analysis and test the null hypothesis:
model_1 <-aov(prj2s$wfood~prj2s$sex+prj2s$town)
anova(model_1)
Analysis of Variance Table
Response: prj2s$wfood
Df Sum Sq Mean Sq F value Pr(>F)
prj2s$sex 1 0.0514 0.05145 1.9178 0.1666
prj2s$town 4 1.4550 0.36376 13.5597 1.363e-10 ***
Residuals 604 16.2034 0.02683
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qqnorm(residuals(model_1))
qqline(residuals(model_1))
plot(fitted(model_1),residuals(model_1))
We are selecting our sample randomly from a much bigger data set, because of this some sets generate a small p value (<0.05) but mostly generate a higher p value. (>0.05) I believe this is a good opportunity for using resampling and validate our findings.
I mostly find the p value higher than 0.05 so I fail to reject the null hypothesis. Therefore, there is no significant difference between the food expenditure ratio and gender of the selected person.
The results of ANOVA are based on the assumption that the data is normally distributed and randomly selected, q-q plots and scatter plots also show these assumptions had been fulfilled.
“Tukey’s method is used in ANOVA to create confidence intervals for all pairwise differences between factor level means while controlling the family error rate to a level you specify. It is important to consider the family error rate when making multiple comparisons because your chances of making a type I error for a series of comparisons is greater than the error rate for any one comparison alone. To counter this higher error rate, Tukey’s method adjusts the confidence level for each individual interval so that the resulting simultaneous confidence level is equal to the value you specify.” [3]
Using Tukey’s method, we specified that the entire set of comparisons should have a family error rate of 0.05 (equivalent to a 95% simultaneous confidence level).
tukey1<-TukeyHSD(model_1)
plot(tukey1)
TukeyHSD(model_1)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov.default(formula = prj2s$wfood ~ prj2s$sex + prj2s$town)
$`prj2s$sex`
diff lwr upr p adj
woman-man 0.01836739 -0.007680266 0.04441505 0.1666142
$`prj2s$town`
diff lwr upr p adj
2-1 -0.02510261 -0.08248007 0.0322748400 0.7531385
3-1 -0.05836705 -0.11574451 -0.0009895983 0.0439231
4-1 -0.10224694 -0.15962440 -0.0448694904 0.0000137
5-1 -0.13344935 -0.19082680 -0.0760718919 0.0000000
3-2 -0.03326444 -0.09064189 0.0241130155 0.5068180
4-2 -0.07714433 -0.13452178 -0.0197668766 0.0023661
5-2 -0.10834673 -0.16572419 -0.0509692780 0.0000032
4-3 -0.04387989 -0.10125735 0.0134975617 0.2246311
5-3 -0.07508229 -0.13245975 -0.0177048397 0.0033980
5-4 -0.03120240 -0.08857986 0.0261750524 0.5707725
The confidence interval for the difference between the means of sex; parallel to the ANOVA p calculations; contain zero. Confidence intervals that contain zero indicate no difference in another saying the difference is statistically insignificant.
“Resampling is the method that consists of drawing repeated samples from the original data samples. Resampling involves the selection of randomized cases with replacement from the original data sample in such a manner that each number of the sample drawn has a number of cases that are similar to the original data sample. Due to replacement, the drawn number of samples that are used by the method of resampling consists of repetitive cases.” [4]
We will use two approaches for resampling of our data:
Monte Carlo simulation:
summary(aov(prj2s$wfood~prj2s$sex))
## Df Sum Sq Mean Sq F value Pr(>F)
## prj2s$sex 1 0.051 0.05145 1.771 0.184
## Residuals 608 17.658 0.02904
meanstar = mean(prj2s$wfood)
sdstar = sqrt(0.031) #From the summary of ANOVA !!!
#We are randomly selecting the Sample group everytime when we run the script, so this value may not fit %100 with the above summary.
#
#
simsex = prj2s$sex
R = 10000
Fstar = numeric(R)
for (i in 1:R) {
groupA = rnorm(305, mean=meanstar, sd=sdstar)
groupB = rnorm(305, mean=meanstar, sd=sdstar)
simwfood = c(groupA,groupB)
simdata = data.frame(simwfood,simsex)
Fstar[i] = oneway.test(simwfood~simsex, var.equal=T, data=simdata)$statistic
}
hist(Fstar,main="Monte Carlo", ylim=c(0,1), xlim=c(0, 8), prob=T)
x=seq(.25,6,.25)
points(x,y=df(x,5,90),type="b",col="red")
Bootstrapping - this technique (Efron, 1979) samples a data set with replacement (i.e. a single case may be randomly sampled several times into the bootstrap set). The bootstrap can be applied any number of times, for increased accuracy. Compared with random sampling, the use of sampling with replacement can help to iron out generalization problems caused by the finite size of the data set. [5]
From the example presented in DOE class following code can be applied to this data:
meanstar = with(prj2s, tapply(wfood,sex,mean))
grpA = prj2s$wfood[prj2s$sex=="man"] - meanstar[1]
grpB = prj2s$wfood[prj2s$sex=="woman"] - meanstar[2]
simsex = prj2s$sex
R = 10000
Fstar = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=305, replace=T)
groupB = sample(grpB, size=305, replace=T)
simwfood = c(groupA,groupB)
simdata = data.frame(simwfood,simsex)
Fstar[i] = oneway.test(simwfood~simsex, var.equal=T, data=simdata)$statistic
}
Plot comparing bootstrapped distribution to the known F distribution
hist(Fstar, main="BootStrapping", ylim=c(0,1), xlim=c(0,8), prob=T)
x=seq(.25,6,.25)
points(x,y=df(x,5,90),type="b",col="red") # The F distribution
The alpha level from the bootstrapped distribution
print(realFstar<-oneway.test(wfood~sex, var.equal=T, data=prj2s)$statistic)
F
1.771398
mean(Fstar>=realFstar)
[1] 0.1815
estimate \(\alpha\) = 0.05 value of the test statistic, given F* from empirical distribution above generate quantiles of the analytic F distribution
qf(.95,5,90)
[1] 2.315689
\(\alpha\) = 0.05 value of bootstrapped F* :
quantile(Fstar,.95)
95%
3.757723
Now we can again generate ANOVA test with the new data we created with bootstramp resampling:
model_2=lm(simwfood~simsex,data=simdata)
anova(model_2)
Analysis of Variance Table
Response: simwfood
Df Sum Sq Mean Sq F value Pr(>F)
simsex 1 0.0695 0.069451 2.4275 0.1197
Residuals 608 17.3948 0.028610
qqnorm(residuals(model_2))
qqline(residuals(model_2))
par(mfrow=c(2,2))
plot(model_2)
With this p value (higher than 0.05) we again fail to reject the null hypothesis. So this supports our initial finding, but as I mentioned before even I find a p value smaller than 0.05 in my initial analysis (first set selected) the bootstramp resampling is always generating similar results (fail to reject)
Normal distribution evaluation with QQ plot and distribution on Residuals vs. Fits plot shows us that assumptions have been covered for also this model.
[4]- http://www.statisticssolutions.com/sample-size-calculation-and-sample-size-justification-resampling/
[5]- https://documents.software.dell.com/statistics/textbook/statistics-glossary/r
Mean point addition to BoxPlots: https://stat.ethz.ch/pipermail/r-help/2004-January/044833.html