From the Ecdat package, VietNamH dataset was used for this project. This dataset has a total of 5999 observations describing the household expenses of households in Vietnam. The dataset can be loaded as follows:
#install.packages("Ecdat")
library("Ecdat")
## Loading required package: Ecfun
##
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
##
## sign
##
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
##
## Orange
df <- VietNamH
As described above, the VietNamH dataset has data about the household expenses of households in Vietnam. Specifically, we have the following variables in the data set:
sex: Gender of household head (male, female)age: Age of household headeducyr: Schooling year of household headfarm: yes if farm householdurban: yes if urban householdhhsize: Size of the householdlntotal: Natural logarighm of total expenditure of the householdlnmed: Natural logarithm of medical expenditure of the householdlnrlfood: Natural logarithm of food expenditure of the householdlnexp12m: Natural logarithm of total household health care expenditure for 12 monthscommune: CommuneThis dataset has been taken from the Vietnam World Bank Livings Standards Survey.
head(df)
## sex age educyr farm urban hhsize lntotal lnmed lnrlfood
## 1 female 68 4 no yes 6 10.13649 11.233210 8.639339
## 2 female 57 8 no yes 6 10.25206 8.505120 9.345752
## 3 male 42 14 no yes 6 10.93231 8.713418 10.226330
## 4 female 72 9 no yes 6 10.26749 9.291736 9.263722
## 5 female 73 1 no yes 8 10.48811 7.555382 9.592890
## 6 female 66 13 no yes 7 10.52660 9.789702 9.372034
## lnexp12m commune
## 1 11.233210 1
## 2 8.505120 1
## 3 8.713418 1
## 4 9.291736 1
## 5 7.555382 1
## 6 9.789702 1
The factors and levels of each variable are shown below:
str(df)
## 'data.frame': 5999 obs. of 11 variables:
## $ sex : Factor w/ 2 levels "male","female": 2 2 1 2 2 2 2 1 1 1 ...
## $ age : int 68 57 42 72 73 66 73 46 50 45 ...
## $ educyr : num 4 8 14 9 1 13 2 9 12 12 ...
## $ farm : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
## $ urban : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ hhsize : int 6 6 6 6 8 7 9 4 5 4 ...
## $ lntotal : num 10.1 10.3 10.9 10.3 10.5 ...
## $ lnmed : num 11.23 8.51 8.71 9.29 7.56 ...
## $ lnrlfood: num 8.64 9.35 10.23 9.26 9.59 ...
## $ lnexp12m: num 11.23 8.51 8.71 9.29 7.56 ...
## $ commune : Factor w/ 194 levels "1","10","100",..: 1 1 1 1 1 1 1 1 1 1 ...
For the purposes of our analysis, we need the variables sex and ubran to be categorical variables.
df$sex <- as.factor(df$sex)
df$urban <- as.factor(df$urban)
Now we can restructure our dataset to keep only three variables that are in consideration: sex, urban and lntotal and look at the summary:
df = df[, c(7,1,5)]
str(df)
## 'data.frame': 5999 obs. of 3 variables:
## $ lntotal: num 10.1 10.3 10.9 10.3 10.5 ...
## $ sex : Factor w/ 2 levels "male","female": 2 2 1 2 2 2 2 1 1 1 ...
## $ urban : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
levels(df$sex)
## [1] "male" "female"
levels(df$urban)
## [1] "no" "yes"
Thus, all in all, we have the following three variables:
sex: Independent Variable, having two levels (male, female)urban: Independent Variable, having two levels (yes, no)lntotal: Dependent Variable, continuous in nature, with mean value 9.342In the modified dataset, we have two independent variables that are both categorical (sex and urban) and a dependent response variable, that is continuous (lntotal).
As mentioned above, the data consists of two independent categorical variables and a dependent continuous variable.
head(df)
## lntotal sex urban
## 1 10.13649 female yes
## 2 10.25206 female yes
## 3 10.93231 male yes
## 4 10.26749 female yes
## 5 10.48811 female yes
## 6 10.52660 female yes
summary(df)
## lntotal sex urban
## Min. : 6.543 male :4375 no :4269
## 1st Qu.: 8.920 female:1624 yes:1730
## Median : 9.311
## Mean : 9.342
## 3rd Qu.: 9.759
## Max. :12.202
In our analysis, we are using two factors with two levels each. We will block on the variable sex and test our hypothesis on the variable urban. The null hypothesis is as follows:
Now that the null and alternative hypotheses have been established, we will use the following analysis methodology. First, we will perform exploratory data analysis using histograms for categorical variables and box plots for continuous variables. Secondly, we will block on the variable sex and test the null hypothesis using the other categorical variable (urban). Thirdly, the sample size needed to achieve the desired power will be computed using G*Power. The null hypothesis will be evaluated using ANOVA. Finally, model adequacy checking will be performed using Q-Q plots and fitted-residuals plot.
The dataset consists of two categorical independent variables having two levels each, and one dependent continuous response variable. We want to block on the first IV (sex) in order to eliminate the source of variablity caused by the levels of the factor sex. With blocking in effect, we can perform ANOVA on the resulting data to estimate the main effects as we would be left with just one IV. Finally, two alternatives to the NHST will be deployed.
For ANOVA, Randomization must occur at three levels: Random Selection, Random Assignment and Random Execution. Random Selection refers to selecting a random set of sample from the population. Random Assignment refers to assigning those samples to different groups. Random Execution refers to ordering the experimental trials randomly [1]. The data comes from the Vietnam World Bank Livings Standards Survey. There doesn’t seem to be any information pertaining to random selection. However, random assignment and random execution is ensured in the analysis.
Blocking refers to arranging the experimental runs in blocks that are similar to each other [2]. In this experiment, as mentioned above, we would be blocking on the categorical IV sex in order to eliminate any variablity caused by the levels of this factor.
Replication is the repeatition of an experimental condition so that the variablity associated with the phenomenon under study can be estimated [3]. Repeated measures refers to using the same subjects with every branch of research, including the control group [4]. In this dataset, there are several replications of possible configurations of the input IVs. There is no evidence of repeated measures.
The following figure shows the histograms for the response variable (lntotal), which is the total annual expenditure of a particular household:
# Histogram for sex
hist(df$lntotal, xlab="Natural Logarithm of Total Expenditure", main="Histogram of Log of Total Expenditure")
Just by visual inspection, it seems that the response variable is normally distributed.
Following figures show the box plots for the two IVs (sex, i.e. sex of the head of the household, and urban, i.e. whether the household is urban or not).
boxplot(df$lntotal~df$sex, xlab="Sex", ylab="Log of Total Expenditure", main="Boxplot of the IV Sex")
boxplot(df$lntotal~df$urban, xlab="Urban?", ylab="Log of Total Expenditure", main="Boxplot of the IV Urban")
Now that the exploratory data analysis is completed, we can perform the actual analysis. However, before we begin the analysis, we need the effect size, \(\alpha\), \(\beta\) and number of blocks. We will calculate the effect size using the cohensD function in R.
# install.packages("lsr")
library(lsr)
df_male <- subset(df$lntotal, df$sex == "male")
df_female <- subset(df$lntotal, df$sex == "female")
cohensD(df_male, df_female)
## [1] 0.1841349
Thus, we have following inputs to G*Power:
Thus, from the figure above, the sample size is 234. Using the sample size of 234, we can create a subset of our dataset by randomly picking 234 samples.
SAMPLE_SIZE <- 234
df_sample <- df[sample(nrow(df), SAMPLE_SIZE),]
head(df_sample)
## lntotal sex urban
## 1908 8.720575 male no
## 3768 9.461303 male no
## 286 10.389930 female yes
## 477 8.383971 female yes
## 4704 10.238810 male no
## 190 10.769820 male yes
summary(df_sample)
## lntotal sex urban
## Min. : 7.504 male :174 no :166
## 1st Qu.: 8.863 female: 60 yes: 68
## Median : 9.248
## Mean : 9.322
## 3rd Qu.: 9.728
## Max. :11.073
Now that we have sampled 234 rows from our original dataset, we can perform 2-way ANOVA in order to determine the \(F\)-statistic and the associated \(p\)-value.
model1 <- aov(df_sample$lntotal~df_sample$sex*df_sample$urban)
anova(model1)
## Analysis of Variance Table
##
## Response: df_sample$lntotal
## Df Sum Sq Mean Sq F value Pr(>F)
## df_sample$sex 1 0.255 0.255 0.6540 0.4195
## df_sample$urban 1 32.597 32.597 83.6245 <2e-16 ***
## df_sample$sex:df_sample$urban 1 0.028 0.028 0.0718 0.7890
## Residuals 230 89.654 0.390
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After repeating the above experiment several times, it is clear that there is a main effect of the factor urban, i.e. there is some relation between the household expenditures of familier living in urban areas as compared to those living in non-urban areas. There is no statistically significant main effect of the factor sex, i.e. the household expenditure is not really affected by whether the head of the house is male of female. Also, there doesn’t seem to be any statistically significant interaction effect of the two factors.
We can go ahead and check the model adequacy for the above model.
qqnorm(residuals(model1))
qqline(residuals(model1))
plot(fitted(model1), residuals(model1), xlab="Fitted", ylab="Residuals")
The Q-Q plot and the fitted-residuals plot suggest that the data is normally distributed and the assumptions of ANOVA are satisfied. However, depending on the samples chosen, the results of ANOVA are different.
From the DataCamp video lectures, we know that “Tukey’s method is a single-step multiple comparison procedure and statistical test. It can be used in conjunction with ANOVA to find means that are significantly different from each other.” [5]
tukey <- TukeyHSD(model1)
plot(tukey)
In line with the conclusions from ANOVA, the 95% confidence interval for the sex factor contains 0.00 while the urban factor doesn’t. Thus, there is a significant effect of the urban factor, but not the sex factor. Also, note that the two of the confidence intervals containing the interaction plot also contain 0.00, which means there doesn’t seem to be a statistically significant interaction effect of the two factors.
TukeyHSD(model1)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = df_sample$lntotal ~ df_sample$sex * df_sample$urban)
##
## $`df_sample$sex`
## diff lwr upr p adj
## female-male 0.07558828 -0.1085817 0.2597583 0.4195368
##
## $`df_sample$urban`
## diff lwr upr p adj
## yes-no 0.7913101 0.6141929 0.9684272 0
##
## $`df_sample$sex:df_sample$urban`
## diff lwr upr p adj
## female:no-male:no -0.1863743 -0.5122824 0.1395339 0.4512950
## male:yes-male:no 0.8361649 0.5396922 1.1326375 0.0000000
## female:yes-male:no 0.7027723 0.3768641 1.0286804 0.0000004
## male:yes-female:no 1.0225392 0.6279244 1.4171539 0.0000000
## female:yes-female:no 0.8891466 0.4719645 1.3063286 0.0000006
## female:yes-male:yes -0.1333926 -0.5280073 0.2612221 0.8179055
“In statistics, resampling is any of a variety of methods for doing the following tasks: (i) Estimating the precision of sample statistics by using subsets of available data or drawing randomly with replacement from a set of data points; (ii) Exchanging labels on data points when performing significance tests; (iii) Validating models by using random subsets.” [6] For the purposes of our analysis, we will use bootstrapping. “Bootstrapping refers to any test or metric that relies on random sampling with replacement. It allows assigning measures of accuracy to sample estimates.” [7] Using the code from the lecture slides, we can do the following analysis.
meanstar = with(df_sample, tapply(lntotal, sex, mean))
groupA = df_sample$lntotal[df_sample$sex == "male"] - meanstar[1]
groupB = df_sample$lntotal[df_sample$sex == "female"] - meanstar[2]
ssex = df_sample$sex
# Number of iterations
N = 10000
Fstar = numeric(N)
for (i in 1:N)
{
grpA = sample(groupA, SAMPLE_SIZE/2, replace=T)
grpB = sample(groupB, SAMPLE_SIZE/2, replace=T)
slntotal = c(grpA, grpB)
sdata = data.frame(slntotal, ssex)
Fstar[i] = oneway.test(slntotal~ssex, var.equal = T, data=sdata)$statistic
}
hist(Fstar, main="Bootstrapping Plot")
The \(\alpha\) can be determined from the bootstrapped distribution as follows:
print(trueFstar <- oneway.test(lntotal~sex, var.equal=T, data=df_sample)$statistic)
## F
## 0.4836471
mean(Fstar >= trueFstar)
## [1] 0.4795
qf(0.95, 5, 90)
## [1] 2.315689
quantile(Fstar, 0.95)
## 95%
## 3.65075
Now that we have the bootstrap sampling, we can perform ANOVA again.
model2 = lm(slntotal~ssex, data=sdata)
anova(model2)
## Analysis of Variance Table
##
## Response: slntotal
## Df Sum Sq Mean Sq F value Pr(>F)
## ssex 1 0.002 0.00237 0.004 0.9494
## Residuals 232 136.335 0.58765
qqnorm(residuals(model2))
qqline(residuals(model2))
plot(fitted(model2), residuals(model2))
From the ANOVA table, we see that there does not seem to be any significant effect of the IV sex on the household expenditure, i.e. the sex of the household head does not affect the expenditure. This conclusion simply reinforces our conclusions from ANOVA and Confidence Interval tests.