Project.2

Null Hypothesis Statistical Testing

Kaan UNNU

RPI

11.10.2016 V.1

1. Setting

After examining the available data sets in Ecdat, BudgetFood data set has been selected.

Installation of the package and selection of the data-set is shown below:

#install.packages("pid")
#install.packages("Ecdat")
library(pid)
library("Ecdat")
prj <- BudgetFood

System under test

BudgetFood data set was collected for examining Null Hypothesis Statistical Testing and consist of 6 variables and 23,972 observations. It represents a cross-section from 1980 and the main source of the data is Journal of Applied Econometrics.

Definitions of the variables in data set are as below:

wfood : percentage of total expenditure which the household has spent on food
totexp : total expenditure of the household
age : age of reference person in the household
size : size of the household
town : size of the town where the household is placed
sex : sex of reference person (man,woman)

head(prj)

      wfood  totexp age size town   sex
1 0.4676991 1290941  43    5    2   man
2 0.3130226 1277978  40    3    2   man
3 0.3764819  845852  28    3    2   man
4 0.4396909  527698  60    1    2 woman
5 0.4036149 1103220  37    5    2   man
6 0.1992503 1768128  35    4    2   man

Factors and Levels

Factors and levels of each factor are listed as below:

str(prj)

'data.frame':   23972 obs. of  6 variables:
 $ wfood : num  0.468 0.313 0.376 0.44 0.404 ...
 $ totexp: num  1290941 1277978 845852 527698 1103220 ...
 $ age   : num  43 40 28 60 37 35 40 68 43 51 ...
 $ size  : num  5 3 3 1 5 4 4 2 9 7 ...
 $ town  : num  2 2 2 2 2 2 2 2 2 2 ...
 $ sex   : Factor w/ 2 levels "man","woman": 1 1 1 2 1 1 1 2 1 1 ...

However, one of the factors that we are planning to use in the analysis, town, is defined as integer. We should change it to a factor before starting the analysis.

prj$town <- as.factor(prj$town)
str(prj$town)

 Factor w/ 5 levels "1","2","3","4",..: 2 2 2 2 2 2 2 2 2 2 ...

We will use following 2 categorical independent variables with below levels:

town : 5 levels
sex : 2 levels

levels(prj$town)

[1] "1" "2" "3" "4" "5"

levels(prj$sex)

[1] "man"   "woman"

Continuous variables & Response variables

In the data set we have three continuous variables two of them are independent variables and we will not use them in our study. However, we will use only one continuous variable as our Dependent Variable.

wfood : percentage of total expenditure which the household has spent on food

The Data: How is it organized and what does it look like?

As it has been mentioned before data consists of 8 variables. However we should re-arrange our data according to the Project-2 request: Select a data set, a pair of categorical independent variables and a continuous dependent variable.

General view of the data is as below we also removed the NA values by using na.omit function.

prj2 <- prj[,c(1,5,6)]
prj2 <- na.omit(prj2)
head(prj2)

      wfood town   sex
1 0.4676991    2   man
2 0.3130226    2   man
3 0.3764819    2   man
4 0.4396909    2 woman
5 0.4036149    2   man
6 0.1992503    2   man

summary(prj2)

     wfood        town        sex       
 Min.   :0.0000   1:2903   man  :20624  
 1st Qu.:0.2583   2:3986   woman: 3347  
 Median :0.3642   3:4362                
 Mean   :0.3783   4:9882                
 3rd Qu.:0.4847   5:2838                
 Max.   :0.9966

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

We are using 2 factors one with two and the other with multiple levels. We will use the variable with multiple levels as the blocking and use the other one testing our null hypothesis.

For this experiment our null hypothesis will be: Percentage of total expenditure which the household spend on food does not affected by gender of the reference person. Experiment will follow below outline: First general descriptive statistics will be studied. After sample size will be calculated for the required by using G*power. ANOVA will be used for evaluating the null hypothesis. Assumptions of ANOVA will be controlled. (normality, randomization) Finally, model adequacy will be checked.

What is the rationale for this design?

Data set has been retrieved from Ecdat, and factors had been already chosen. In this situation it is not possible for us to define the reasons of why these factors were selected, but we choose this dataset because it fits the requirements of Project.2.

Randomization,Blocking and Replication

Replication, local control(blocking) and randomization are fundamentals of a designing an experiment. Replication and blocking increases the precision in the experiment and randomization helps reducing bias.

Randomize: What is the Randomization Scheme?

Randomization is possible with below three conditions:

Random selection : Selection of your sample set from the population.
Random assignment : Assigning the samples to different groups or treatments
Random execution : Order of the experiments

If all these are realized in a random behavior we can assume the analyze is randomized.

For this data set We do not have a certain information about random selection.

However, with randomly sub-setting and ordering the data we can assure the random assignment and random execution according to the sample size that we defined.

Replicate: Are there replicates and/or repeated measures?

“Repeat and replicate measurements are both multiple response measurements taken at the same combination of factor settings; but repeat measurements are taken during the same experimental run or consecutive runs, while replicate measurements are taken during identical but different experimental runs, which are often randomized.”[1]

In this dataset we do not have any replication or repeated measures.

Block: Did you use blocking in the design?

“A block is a categorical variable that explains variation in the response variable that is not caused by the factors. Although each measurement should be taken under consistent experimental conditions (other than the factors that are being varied as part of the experiment), this is not always possible. Use blocks in designed experiments and analysis to minimize bias and variance of the error because of nuisance factors.” [2]

As the definition of nuisance factor, it might have some effect on the response but on the other hand it should not be in main focus of the researcher. Also from DOE modules : “Variability between blocks can be large, variability within a block should be relatively small. In general, a block is a specific level of the nuisance factor” So this will be analyzed in our descriptive summary with boxplots and means of the blocking factor levels.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Histogram of our response variable (wfood) is plotted for visually analyzing the normal distribution.

hist(prj2$wfood, xlab="Food Expenditure Ratio", main = paste("Histogram of" , "Food Expenditure Ratio"))

From this histogram we could assume that our response variable is nearly Normally Distributed.

Town size is our blocking factor and gender is our independent variable and their boxplots and means are shown as following :

boxplot(prj2$wfood~prj2$town, xlab="Town Size", ylab="Food Expenditure Ratio")
title("Town")
means1 <- by(prj2$wfood, prj2$town, mean)
points(1:5, means1, pch = 23, cex = 2, bg = "red")
text(1:5 - 0.1, means1,labels = format(means1, format = "f", digits = 2),pos = 3, cex = 0.9, col = "red")

boxplot(prj2$wfood~prj2$sex, xlab="Gender of Individual", ylab="Food Expenditure Ratio")
title("Sex")
means2 <- by(prj2$wfood, prj2$sex, mean)
points(1:2, means2, pch = 23, cex = 2, bg = "red")
text(1:2 - 0.1, means2,labels = format(means2, format = "f" , digits = 2),pos = 3, cex = 0.9, col = "red")

Testing

For defining the sample size, we are going to use the G*power software. For this calculation we need below values:

Effect size = will be calculated by CohensD
\(\alpha\) = 0.05
\(\beta\) = 0.20 (power=0.80)
Number of groups = 5

cohensD function from lsr library will be used for effect size calculation:

library(lsr)
P2_man <- subset(prj2$wfood,prj2$sex=="man")
P2_woman <- subset(prj2$wfood,prj2$sex=="woman")
cohensD(P2_woman,P2_man)

[1] 0.1344472

This calculated effect size and selected \(\alpha\), \(\beta\) and number of groups gives us a sample size of 608.

We have 5 blocks (town size: 1 to 5) and one independent variable (sex: man or woman). I have randomly select 61 samples for each block and tries to generate a balanced design with 610 samples. Sampling the data set I used sample.row function from package kimisc which is a relatively new one. (February.2016)

library(kimisc)
m1 <- sample.rows(subset(prj2, sex == "man" & town =="1"), 61)
m2 <- sample.rows(subset(prj2, sex == "man" & town =="2"), 61)
m3 <- sample.rows(subset(prj2, sex == "man" & town =="3"), 61)
m4 <- sample.rows(subset(prj2, sex == "man" & town =="4"), 61)
m5 <- sample.rows(subset(prj2, sex == "man" & town =="5"), 61)
w1 <- sample.rows(subset(prj2, sex == "woman" & town =="1"), 61)
w2 <- sample.rows(subset(prj2, sex == "woman" & town =="2"), 61)
w3 <- sample.rows(subset(prj2, sex == "woman" & town =="3"), 61)
w4 <- sample.rows(subset(prj2, sex == "woman" & town =="4"), 61)
w5 <- sample.rows(subset(prj2, sex == "woman" & town =="5"), 61)
prj2s <- rbind(m1,m2,m3,m4,m5,w1,w2,w3,w4,w5)
summary(prj2s)

     wfood        town       sex     
 Min.   :0.0000   1:122   man  :305  
 1st Qu.:0.2701   2:122   woman:305  
 Median :0.3791   3:122              
 Mean   :0.3937   4:122              
 3rd Qu.:0.5106   5:122              
 Max.   :0.9692

Now we can generate our ANOVA analysis and test the null hypothesis:

model_1 <-aov(prj2s$wfood~prj2s$sex+prj2s$town)
anova(model_1)

Analysis of Variance Table

Response: prj2s$wfood
            Df  Sum Sq Mean Sq F value    Pr(>F)    
prj2s$sex    1  0.0514 0.05145  1.9178    0.1666    
prj2s$town   4  1.4550 0.36376 13.5597 1.363e-10 ***
Residuals  604 16.2034 0.02683                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

qqnorm(residuals(model_1))
qqline(residuals(model_1))

plot(fitted(model_1),residuals(model_1))

We are selecting our sample randomly from a much bigger data set, because of this some sets generate a small p value (<0.05) but mostly generate a higher p value. (>0.05) I believe this is a good opportunity for using resampling and validate our findings.

I mostly find the p value higher than 0.05 so I fail to reject the null hypothesis. Therefore, there is no significant difference between the food expenditure ratio and gender of the selected person.

The results of ANOVA are based on the assumption that the data is normally distributed and randomly selected, q-q plots and scatter plots also show these assumptions had been fulfilled.

Alternatives to the Null Hypothesis Significance Testing

Confidence Intervals:

“Tukey’s method is used in ANOVA to create confidence intervals for all pairwise differences between factor level means while controlling the family error rate to a level you specify. It is important to consider the family error rate when making multiple comparisons because your chances of making a type I error for a series of comparisons is greater than the error rate for any one comparison alone. To counter this higher error rate, Tukey’s method adjusts the confidence level for each individual interval so that the resulting simultaneous confidence level is equal to the value you specify.” [3]

Using Tukey’s method, we specified that the entire set of comparisons should have a family error rate of 0.05 (equivalent to a 95% simultaneous confidence level).

tukey1<-TukeyHSD(model_1)
plot(tukey1)

TukeyHSD(model_1)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov.default(formula = prj2s$wfood ~ prj2s$sex + prj2s$town)

$`prj2s$sex`
                diff          lwr        upr     p adj
woman-man 0.01836739 -0.007680266 0.04441505 0.1666142

$`prj2s$town`
           diff         lwr           upr     p adj
2-1 -0.02510261 -0.08248007  0.0322748400 0.7531385
3-1 -0.05836705 -0.11574451 -0.0009895983 0.0439231
4-1 -0.10224694 -0.15962440 -0.0448694904 0.0000137
5-1 -0.13344935 -0.19082680 -0.0760718919 0.0000000
3-2 -0.03326444 -0.09064189  0.0241130155 0.5068180
4-2 -0.07714433 -0.13452178 -0.0197668766 0.0023661
5-2 -0.10834673 -0.16572419 -0.0509692780 0.0000032
4-3 -0.04387989 -0.10125735  0.0134975617 0.2246311
5-3 -0.07508229 -0.13245975 -0.0177048397 0.0033980
5-4 -0.03120240 -0.08857986  0.0261750524 0.5707725

The confidence interval for the difference between the means of sex; parallel to the ANOVA p calculations; contain zero. Confidence intervals that contain zero indicate no difference in another saying the difference is statistically insignificant.

ReSampling:

“Resampling is the method that consists of drawing repeated samples from the original data samples. Resampling involves the selection of randomized cases with replacement from the original data sample in such a manner that each number of the sample drawn has a number of cases that are similar to the original data sample. Due to replacement, the drawn number of samples that are used by the method of resampling consists of repetitive cases.” [4]

We will use two approaches for resampling of our data:

Monte Carlo simulation:

summary(aov(prj2s$wfood~prj2s$sex))

##              Df Sum Sq Mean Sq F value Pr(>F)
## prj2s$sex     1  0.051 0.05145   1.771  0.184
## Residuals   608 17.658 0.02904

meanstar = mean(prj2s$wfood)
sdstar = sqrt(0.031) #From the summary of ANOVA !!! 
#We are randomly selecting the Sample group everytime when we run the script, so this value may not fit %100 with the above summary.
#
#
simsex = prj2s$sex
R = 10000  
Fstar = numeric(R) 
for (i in 1:R) {
   groupA = rnorm(305, mean=meanstar, sd=sdstar)   
   groupB = rnorm(305, mean=meanstar, sd=sdstar)
   simwfood = c(groupA,groupB)
   simdata = data.frame(simwfood,simsex)
   Fstar[i] = oneway.test(simwfood~simsex, var.equal=T, data=simdata)$statistic
}
hist(Fstar,main="Monte Carlo", ylim=c(0,1), xlim=c(0, 8), prob=T)
x=seq(.25,6,.25)
points(x,y=df(x,5,90),type="b",col="red")

Bootstrapping - this technique (Efron, 1979) samples a data set with replacement (i.e. a single case may be randomly sampled several times into the bootstrap set). The bootstrap can be applied any number of times, for increased accuracy. Compared with random sampling, the use of sampling with replacement can help to iron out generalization problems caused by the finite size of the data set. [5]

From the example presented in DOE class following code can be applied to this data:

meanstar = with(prj2s, tapply(wfood,sex,mean))
grpA = prj2s$wfood[prj2s$sex=="man"] - meanstar[1]
grpB = prj2s$wfood[prj2s$sex=="woman"] - meanstar[2]
simsex = prj2s$sex

R = 10000
Fstar = numeric(R)
for (i in 1:R) {
  groupA = sample(grpA, size=305, replace=T)
  groupB = sample(grpB, size=305, replace=T)

  simwfood = c(groupA,groupB)
  simdata = data.frame(simwfood,simsex)
  Fstar[i] = oneway.test(simwfood~simsex, var.equal=T, data=simdata)$statistic
}

Plot comparing bootstrapped distribution to the known F distribution

hist(Fstar, main="BootStrapping", ylim=c(0,1), xlim=c(0,8), prob=T)
x=seq(.25,6,.25)
points(x,y=df(x,5,90),type="b",col="red")  # The F distribution

The alpha level from the bootstrapped distribution

print(realFstar<-oneway.test(wfood~sex, var.equal=T, data=prj2s)$statistic)

       F 
1.771398

mean(Fstar>=realFstar)

[1] 0.1815

estimate \(\alpha\) = 0.05 value of the test statistic, given F* from empirical distribution above generate quantiles of the analytic F distribution

qf(.95,5,90)

[1] 2.315689

\(\alpha\) = 0.05 value of bootstrapped F* :

quantile(Fstar,.95)

     95% 
3.757723

Now we can again generate ANOVA test with the new data we created with bootstramp resampling:

model_2=lm(simwfood~simsex,data=simdata)
anova(model_2)

Analysis of Variance Table

Response: simwfood
           Df  Sum Sq  Mean Sq F value Pr(>F)
simsex      1  0.0695 0.069451  2.4275 0.1197
Residuals 608 17.3948 0.028610

qqnorm(residuals(model_2))
qqline(residuals(model_2))

par(mfrow=c(2,2))
plot(model_2)

With this p value (higher than 0.05) we again fail to reject the null hypothesis. So this supports our initial finding, but as I mentioned before even I find a p value smaller than 0.05 in my initial analysis (first set selected) the bootstramp resampling is always generating similar results (fail to reject)

Normal distribution evaluation with QQ plot and distribution on Residuals vs. Fits plot shows us that assumptions have been covered for also this model.

4. References to the literature

[1] - http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/doe/basics/replicates-and-repeats-in-designed-experiments/

[2] - http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/doe/basics/what-is-a-block/

[3] - http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/anova/multiple-comparisons/what-is-tukey-s-method/

[4]- http://www.statisticssolutions.com/sample-size-calculation-and-sample-size-justification-resampling/

[5]- https://documents.software.dell.com/statistics/textbook/statistics-glossary/r

Mean point addition to BoxPlots: https://stat.ethz.ch/pipermail/r-help/2004-January/044833.html

5. Appendices

https://cran.r-project.org/web/packages/Ecdat/Ecdat.pdf