RPI

November 10, 2016 V1.0

1. Setting

From the Ecdat package, VietNamH dataset was used for this project. This dataset has a total of 5999 observations describing the household expenses of households in Vietnam. The dataset can be loaded as follows:

#install.packages("Ecdat")
library("Ecdat")

## Loading required package: Ecfun

## 
## Attaching package: 'Ecfun'

## The following object is masked from 'package:base':
## 
##     sign

## 
## Attaching package: 'Ecdat'

## The following object is masked from 'package:datasets':
## 
##     Orange

df <- VietNamH

System Under Test

As described above, the VietNamH dataset has data about the household expenses of households in Vietnam. Specifically, we have the following variables in the data set:

sex: Gender of household head (male, female)
age: Age of household head
educyr: Schooling year of household head
farm: yes if farm household
urban: yes if urban household
hhsize: Size of the household
lntotal: Natural logarighm of total expenditure of the household
lnmed: Natural logarithm of medical expenditure of the household
lnrlfood: Natural logarithm of food expenditure of the household
lnexp12m: Natural logarithm of total household health care expenditure for 12 months
commune: Commune

This dataset has been taken from the Vietnam World Bank Livings Standards Survey.

head(df)

##      sex age educyr farm urban hhsize  lntotal     lnmed  lnrlfood
## 1 female  68      4   no   yes      6 10.13649 11.233210  8.639339
## 2 female  57      8   no   yes      6 10.25206  8.505120  9.345752
## 3   male  42     14   no   yes      6 10.93231  8.713418 10.226330
## 4 female  72      9   no   yes      6 10.26749  9.291736  9.263722
## 5 female  73      1   no   yes      8 10.48811  7.555382  9.592890
## 6 female  66     13   no   yes      7 10.52660  9.789702  9.372034
##    lnexp12m commune
## 1 11.233210       1
## 2  8.505120       1
## 3  8.713418       1
## 4  9.291736       1
## 5  7.555382       1
## 6  9.789702       1

Factors and Levels

The factors and levels of each variable are shown below:

str(df)

## 'data.frame':    5999 obs. of  11 variables:
##  $ sex     : Factor w/ 2 levels "male","female": 2 2 1 2 2 2 2 1 1 1 ...
##  $ age     : int  68 57 42 72 73 66 73 46 50 45 ...
##  $ educyr  : num  4 8 14 9 1 13 2 9 12 12 ...
##  $ farm    : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
##  $ urban   : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hhsize  : int  6 6 6 6 8 7 9 4 5 4 ...
##  $ lntotal : num  10.1 10.3 10.9 10.3 10.5 ...
##  $ lnmed   : num  11.23 8.51 8.71 9.29 7.56 ...
##  $ lnrlfood: num  8.64 9.35 10.23 9.26 9.59 ...
##  $ lnexp12m: num  11.23 8.51 8.71 9.29 7.56 ...
##  $ commune : Factor w/ 194 levels "1","10","100",..: 1 1 1 1 1 1 1 1 1 1 ...

For the purposes of our analysis, we need the variables sex and ubran to be categorical variables.

df$sex <- as.factor(df$sex)
df$urban <- as.factor(df$urban)

Now we can restructure our dataset to keep only three variables that are in consideration: sex, urban and lntotal and look at the summary:

df = df[, c(7,1,5)]
str(df)

## 'data.frame':    5999 obs. of  3 variables:
##  $ lntotal: num  10.1 10.3 10.9 10.3 10.5 ...
##  $ sex    : Factor w/ 2 levels "male","female": 2 2 1 2 2 2 2 1 1 1 ...
##  $ urban  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...

levels(df$sex)

## [1] "male"   "female"

levels(df$urban)

## [1] "no"  "yes"

Thus, all in all, we have the following three variables:

sex: Independent Variable, having two levels (male, female)
urban: Independent Variable, having two levels (yes, no)
lntotal: Dependent Variable, continuous in nature, with mean value 9.342

Continuous Variables and Response Variables

In the modified dataset, we have two independent variables that are both categorical (sex and urban) and a dependent response variable, that is continuous (lntotal).

The Data: How is it organized and what does it look like?

As mentioned above, the data consists of two independent categorical variables and a dependent continuous variable.

head(df)

##    lntotal    sex urban
## 1 10.13649 female   yes
## 2 10.25206 female   yes
## 3 10.93231   male   yes
## 4 10.26749 female   yes
## 5 10.48811 female   yes
## 6 10.52660 female   yes

summary(df)

##     lntotal           sex       urban     
##  Min.   : 6.543   male  :4375   no :4269  
##  1st Qu.: 8.920   female:1624   yes:1730  
##  Median : 9.311                           
##  Mean   : 9.342                           
##  3rd Qu.: 9.759                           
##  Max.   :12.202

2. Experimental Design

How will the experiment be organized and conducted to test the hypothesis?

In our analysis, we are using two factors with two levels each. We will block on the variable sex and test our hypothesis on the variable urban. The null hypothesis is as follows:

\(H_0\): There is no difference between the family expenditure in urban and non-urban areas.
\(H_a\): There is difference between the family expenditure in urban and non-urban areas, and this difference is statistically significant.

Now that the null and alternative hypotheses have been established, we will use the following analysis methodology. First, we will perform exploratory data analysis using histograms for categorical variables and box plots for continuous variables. Secondly, we will block on the variable sex and test the null hypothesis using the other categorical variable (urban). Thirdly, the sample size needed to achieve the desired power will be computed using G*Power. The null hypothesis will be evaluated using ANOVA. Finally, model adequacy checking will be performed using Q-Q plots and fitted-residuals plot.

What is the rationale for this design?

The dataset consists of two categorical independent variables having two levels each, and one dependent continuous response variable. We want to block on the first IV (sex) in order to eliminate the source of variablity caused by the levels of the factor sex. With blocking in effect, we can perform ANOVA on the resulting data to estimate the main effects as we would be left with just one IV. Finally, two alternatives to the NHST will be deployed.

Randomization, Blocking and Replication

For ANOVA, Randomization must occur at three levels: Random Selection, Random Assignment and Random Execution. Random Selection refers to selecting a random set of sample from the population. Random Assignment refers to assigning those samples to different groups. Random Execution refers to ordering the experimental trials randomly [1]. The data comes from the Vietnam World Bank Livings Standards Survey. There doesn’t seem to be any information pertaining to random selection. However, random assignment and random execution is ensured in the analysis.

Blocking refers to arranging the experimental runs in blocks that are similar to each other [2]. In this experiment, as mentioned above, we would be blocking on the categorical IV sex in order to eliminate any variablity caused by the levels of this factor.

Replication is the repeatition of an experimental condition so that the variablity associated with the phenomenon under study can be estimated [3]. Repeated measures refers to using the same subjects with every branch of research, including the control group [4]. In this dataset, there are several replications of possible configurations of the input IVs. There is no evidence of repeated measures.

3. Statistical Analysis

Exploratory Data Analysis (EDA)

The following figure shows the histograms for the response variable (lntotal), which is the total annual expenditure of a particular household:

# Histogram for sex
hist(df$lntotal, xlab="Natural Logarithm of Total Expenditure", main="Histogram of Log of Total Expenditure")

Just by visual inspection, it seems that the response variable is normally distributed.

Following figures show the box plots for the two IVs (sex, i.e. sex of the head of the household, and urban, i.e. whether the household is urban or not).

boxplot(df$lntotal~df$sex, xlab="Sex", ylab="Log of Total Expenditure", main="Boxplot of the IV Sex")

boxplot(df$lntotal~df$urban, xlab="Urban?", ylab="Log of Total Expenditure", main="Boxplot of the IV Urban")

Testing

Now that the exploratory data analysis is completed, we can perform the actual analysis. However, before we begin the analysis, we need the effect size, \(\alpha\), \(\beta\) and number of blocks. We will calculate the effect size using the cohensD function in R.

# install.packages("lsr")
library(lsr)
df_male <- subset(df$lntotal, df$sex == "male")
df_female <- subset(df$lntotal, df$sex == "female")
cohensD(df_male, df_female)

## [1] 0.1841349

Thus, we have following inputs to G*Power:

Effect Size (given by CohensD): 0.1841349
\(\alpha\): 0.05
\(\beta\): 0.20 (power = 0.80)
Number of groups = 2

Thus, from the figure above, the sample size is 234. Using the sample size of 234, we can create a subset of our dataset by randomly picking 234 samples.

SAMPLE_SIZE <- 234
df_sample <- df[sample(nrow(df), SAMPLE_SIZE),]
head(df_sample)

##        lntotal    sex urban
## 1908  8.720575   male    no
## 3768  9.461303   male    no
## 286  10.389930 female   yes
## 477   8.383971 female   yes
## 4704 10.238810   male    no
## 190  10.769820   male   yes

summary(df_sample)

##     lntotal           sex      urban    
##  Min.   : 7.504   male  :174   no :166  
##  1st Qu.: 8.863   female: 60   yes: 68  
##  Median : 9.248                         
##  Mean   : 9.322                         
##  3rd Qu.: 9.728                         
##  Max.   :11.073

Now that we have sampled 234 rows from our original dataset, we can perform 2-way ANOVA in order to determine the \(F\)-statistic and the associated \(p\)-value.

model1 <- aov(df_sample$lntotal~df_sample$sex*df_sample$urban)
anova(model1)

## Analysis of Variance Table
## 
## Response: df_sample$lntotal
##                                Df Sum Sq Mean Sq F value Pr(>F)    
## df_sample$sex                   1  0.255   0.255  0.6540 0.4195    
## df_sample$urban                 1 32.597  32.597 83.6245 <2e-16 ***
## df_sample$sex:df_sample$urban   1  0.028   0.028  0.0718 0.7890    
## Residuals                     230 89.654   0.390                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After repeating the above experiment several times, it is clear that there is a main effect of the factor urban, i.e. there is some relation between the household expenditures of familier living in urban areas as compared to those living in non-urban areas. There is no statistically significant main effect of the factor sex, i.e. the household expenditure is not really affected by whether the head of the house is male of female. Also, there doesn’t seem to be any statistically significant interaction effect of the two factors.

We can go ahead and check the model adequacy for the above model.

qqnorm(residuals(model1))
qqline(residuals(model1))

plot(fitted(model1), residuals(model1), xlab="Fitted", ylab="Residuals")

The Q-Q plot and the fitted-residuals plot suggest that the data is normally distributed and the assumptions of ANOVA are satisfied. However, depending on the samples chosen, the results of ANOVA are different.

Alternatives to NHST

Confidence Intervals:

From the DataCamp video lectures, we know that “Tukey’s method is a single-step multiple comparison procedure and statistical test. It can be used in conjunction with ANOVA to find means that are significantly different from each other.” [5]

tukey <- TukeyHSD(model1)
plot(tukey)

In line with the conclusions from ANOVA, the 95% confidence interval for the sex factor contains 0.00 while the urban factor doesn’t. Thus, there is a significant effect of the urban factor, but not the sex factor. Also, note that the two of the confidence intervals containing the interaction plot also contain 0.00, which means there doesn’t seem to be a statistically significant interaction effect of the two factors.

TukeyHSD(model1)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = df_sample$lntotal ~ df_sample$sex * df_sample$urban)
## 
## $`df_sample$sex`
##                   diff        lwr       upr     p adj
## female-male 0.07558828 -0.1085817 0.2597583 0.4195368
## 
## $`df_sample$urban`
##             diff       lwr       upr p adj
## yes-no 0.7913101 0.6141929 0.9684272     0
## 
## $`df_sample$sex:df_sample$urban`
##                            diff        lwr       upr     p adj
## female:no-male:no    -0.1863743 -0.5122824 0.1395339 0.4512950
## male:yes-male:no      0.8361649  0.5396922 1.1326375 0.0000000
## female:yes-male:no    0.7027723  0.3768641 1.0286804 0.0000004
## male:yes-female:no    1.0225392  0.6279244 1.4171539 0.0000000
## female:yes-female:no  0.8891466  0.4719645 1.3063286 0.0000006
## female:yes-male:yes  -0.1333926 -0.5280073 0.2612221 0.8179055

Resampling

“In statistics, resampling is any of a variety of methods for doing the following tasks: (i) Estimating the precision of sample statistics by using subsets of available data or drawing randomly with replacement from a set of data points; (ii) Exchanging labels on data points when performing significance tests; (iii) Validating models by using random subsets.” [6] For the purposes of our analysis, we will use bootstrapping. “Bootstrapping refers to any test or metric that relies on random sampling with replacement. It allows assigning measures of accuracy to sample estimates.” [7] Using the code from the lecture slides, we can do the following analysis.

meanstar = with(df_sample, tapply(lntotal, sex, mean))
groupA = df_sample$lntotal[df_sample$sex == "male"] - meanstar[1]
groupB = df_sample$lntotal[df_sample$sex == "female"] - meanstar[2]
ssex = df_sample$sex

# Number of iterations
N = 10000 
Fstar = numeric(N)
for (i in 1:N)
{
  grpA = sample(groupA, SAMPLE_SIZE/2, replace=T)
  grpB = sample(groupB, SAMPLE_SIZE/2, replace=T)
  
  slntotal = c(grpA, grpB)
  sdata = data.frame(slntotal, ssex)
  Fstar[i] = oneway.test(slntotal~ssex, var.equal = T, data=sdata)$statistic
}

hist(Fstar, main="Bootstrapping Plot")

The \(\alpha\) can be determined from the bootstrapped distribution as follows:

print(trueFstar <- oneway.test(lntotal~sex, var.equal=T, data=df_sample)$statistic)

##         F 
## 0.4836471

mean(Fstar >= trueFstar)

## [1] 0.4795

qf(0.95, 5, 90)

## [1] 2.315689

quantile(Fstar, 0.95)

##     95% 
## 3.65075

Now that we have the bootstrap sampling, we can perform ANOVA again.

model2 = lm(slntotal~ssex, data=sdata)
anova(model2)

## Analysis of Variance Table
## 
## Response: slntotal
##            Df  Sum Sq Mean Sq F value Pr(>F)
## ssex        1   0.002 0.00237   0.004 0.9494
## Residuals 232 136.335 0.58765

qqnorm(residuals(model2))
qqline(residuals(model2))

plot(fitted(model2), residuals(model2))

From the ANOVA table, we see that there does not seem to be any significant effect of the IV sex on the household expenditure, i.e. the sex of the household head does not affect the expenditure. This conclusion simply reinforces our conclusions from ANOVA and Confidence Interval tests.

References

[1] [Randomized experiment](https://en.wikipedia.org/wiki/Randomized_experiment)
[2] [Blocking](https://en.wikipedia.org/wiki/Blocking_(statistics))
[3] [Replication](https://en.wikipedia.org/wiki/Replication_(statistics))
[4] [Repeated Measures](https://en.wikipedia.org/wiki/Repeated_measures_design)
[5] [Tukey’s Test](https://en.wikipedia.org/wiki/Tukey%27s_range_test)
6(https://en.wikipedia.org/wiki/Resampling_(statistics))
[7] [Bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics))

Project 2: Null Hypythesis Statistical Testing

Prasanna Date

RPI