This recipe is examining the budget share of food for Spanish households from the Ecdat package.Using this data set, we are testing a model with a single factor with multiple levels on a single response variable.
#installing package
install.packages("Ecdat", repos='http://cran.us.r-project.org')
## package 'Ecdat' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\tranc3\AppData\Local\Temp\RtmpWAU1Z6\downloaded_packages
library("Ecdat", lib.loc="C:/Program Files/R/R-3.1.1/library")
## Warning: package 'Ecdat' was built under R version 3.1.2
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
#Pulling the Food data from the library
Food1<-BudgetFood
Food<-subset(Food1,Food1$size<17)
We are examining the effect of the size of the household on the response variable percentage of total expenditure which the household has spend on food. One may assume that a larger size of the household would lead to a larger percentage of total expenditure which the household has spent on food. Household sizes range from 1-15.
head(Food)
## wfood totexp age size town sex
## 1 0.4677 1290941 43 5 2 man
## 2 0.3130 1277978 40 3 2 man
## 3 0.3765 845852 28 3 2 man
## 4 0.4397 527698 60 1 2 woman
## 5 0.4036 1103220 37 5 2 man
## 6 0.1993 1768128 35 4 2 man
tail(Food)
## wfood totexp age size town sex
## 23967 0.6568 257376 81 2 4 man
## 23968 0.5569 432818 87 4 4 man
## 23969 0.4333 1096826 45 2 4 man
## 23970 0.5726 807030 36 3 4 man
## 23971 0.0000 107320 76 1 4 woman
## 23972 0.3907 342832 54 2 4 man
The continuous variables in the data set are the percentage of total expenditure which the household has spent on food and the total expenditure of the household.
In this experiment, we are looking at the percentage of total expenditure which the household has spent on food and the effect of the single factor size of the household.
The dataset has 23972 observations of households in Spain. They are organized into 6 variables: wfood, totexp, age, size, town, and sex. Wfood is the percentage of total expenditure which the household has spent on food. totexp is the total expenditure of the household. age is the age of reference person in the household. size is the size of the household. Town is the size of the town where the household is placed and categorized into 5 groups, 1 is for small and 5 is for big. sex is the sex of reference person (man or woman)
The dataset consists of observations so it unknow if there was randomization.
This experiment being conducted is to determine if the variation of percentage of total expenditure which the household has spend on food can be attributed to the variation in the size of the household. The null hypothesis for this ANOVA is that the variation in the percentage of total expenditure which the household has spent on food can not be attributed to anything other than randomization. The alternate hypothesis is that the variation in the percentage of total expenditure which the household has spent on food can be attributed to the size of the household.
One may think that most households would spend the same percentage of their total expenditure on food. Or some may assume that a larger household may use a larger percentage on food. We can see if the size of ones household has an effect on the percentage of total expenditure which the household spent on food.
The dataset is the observations of spanish households so it was not randomized
There are no replicates or repeated measures
There was no blocking used in the design.
Food$size=as.factor(Food$size)
summary(Food)
## wfood totexp age size
## Min. :0.000 Min. : 14601 Min. :16.0 4 :5478
## 1st Qu.:0.258 1st Qu.: 449796 1st Qu.:38.0 2 :5146
## Median :0.364 Median : 731130 Median :50.0 3 :4409
## Mean :0.378 Mean : 865564 Mean :50.5 5 :3563
## 3rd Qu.:0.485 3rd Qu.: 1112640 3rd Qu.:62.0 1 :1945
## Max. :0.997 Max. :11397547 Max. :99.0 6 :1899
## (Other):1530
## town sex
## Min. :1.00 man :20623
## 1st Qu.:2.00 woman: 3347
## Median :4.00
## Mean :3.24
## 3rd Qu.:4.00
## Max. :5.00
##
boxplot(wfood~size, data=Food, xlab="Size of Household", ylab="Percentage of expenditure spent on food")
title("Percentage of expenditure spend on food")
When looking at the boxplots, households with one person has a range of percentages from 0 to 1. This means that a person that lives by themselves either spends none or almost all of their expenditures on food.The medians for the percentage of total expenditure spent on food for all of the different sizes of households are all around .4.I find it interesting that the ranges of values for the households are larger for smaller households and the range gets smaller as the size of the household increases.This makes sense because when households get bigger, people probably have to be more careful with money so they cant spend a lot of money on food but also have to spend a good amount to make sure everyone gets fed.
model1=aov(wfood~size, data=Food)
anova(model1)
## Analysis of Variance Table
##
## Response: wfood
## Df Sum Sq Mean Sq F value Pr(>F)
## size 14 16 1.174 43.9 <2e-16 ***
## Residuals 23955 641 0.027
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis for this ANOVA is that the variation in the percentage of total expenditure which the household has spent on food can not be attributed to anything other than randomization. The alternate hypothesis is that the variation in the percentage of total expenditure which the household has spent on food can be attributed to the size of the household.
From our model we would reject our null hypothesis and the variation in the percentage of total expenditure which the household has spent on food can be attributed to something other than randomization such as size of household. There is a 2e-16 probability of getting the F value of 43.9.
#Makes an array of the means of the levels
with(Food,tapply(wfood, size, mean))
## 1 2 3 4 5 6 7 8 9 10
## 0.3903 0.4185 0.3590 0.3529 0.3635 0.3807 0.3937 0.4074 0.4288 0.4285
## 11 12 13 14 15
## 0.4616 0.4842 0.5008 0.5796 0.5908
#makes an array of the variances of the levels
with(Food,tapply(wfood, size, var))
## 1 2 3 4 5 6 7 8 9
## 0.04634 0.03457 0.02505 0.02086 0.02041 0.02034 0.02390 0.02478 0.02094
## 10 11 12 13 14 15
## 0.03068 0.03592 0.03131 0.05137 0.06127 0.02391
#makes an array of the lenths of the levels
with(Food,tapply(wfood, size, length))
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1945 5146 4409 5478 3563 1899 843 376 182 79 21 16 7 3 3
#summary ANOVA model
summary(aov(wfood~size, data=Food))
## Df Sum Sq Mean Sq F value Pr(>F)
## size 14 16 1.174 43.9 <2e-16 ***
## Residuals 23955 641 0.027
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#mean of the percentages
meanstar=mean(Food$wfood)
#squareroot of the mean square of the residuals
sdstar=sqrt(.027)
simsize=Food$size
R=10000 #number of runs
Fstar = numeric(R)
for (i in 1:R) {
# make residual normally distributed with known pooled variance
# this loop is a Monte Carlo simulation since distribution is known normal
groupA = rnorm(1598, mean=meanstar, sd=sdstar)
groupB = rnorm(1598, mean=meanstar, sd=sdstar)
groupC = rnorm(1598, mean=meanstar, sd=sdstar)
groupD = rnorm(1598, mean=meanstar, sd=sdstar)
groupE = rnorm(1598, mean=meanstar, sd=sdstar)
groupF = rnorm(1598, mean=meanstar, sd=sdstar)
groupG = rnorm(1598, mean=meanstar, sd=sdstar)
groupH = rnorm(1598, mean=meanstar, sd=sdstar)
groupI = rnorm(1598, mean=meanstar, sd=sdstar)
groupJ = rnorm(1598, mean=meanstar, sd=sdstar)
groupK = rnorm(1598, mean=meanstar, sd=sdstar)
groupL = rnorm(1598, mean=meanstar, sd=sdstar)
groupM = rnorm(1598, mean=meanstar, sd=sdstar)
groupN = rnorm(1598, mean=meanstar, sd=sdstar)
groupO = rnorm(1598, mean=meanstar, sd=sdstar)
simyield = c(groupA,groupB,groupC,groupD,groupE,groupF, groupG, groupH, groupI, groupJ, groupK, groupL, groupM, groupN, groupO)
simdata = data.frame(simyield,simsize)
Fstar[i] = oneway.test(simyield~simsize, var.equal=T, data=simdata)$statistic
}
hist(Fstar,ylim=c(0,1), xlim=c(0,3),prob=T)
x=seq(.25,3,.1)
points(x,y=df(x,14,23955),type="b",col="red")
The code above performed a montecarlo simulation. A montecarlo simulation performs risk analysis by building models of possible reults by substitute a range of values. We made random normal distributed data with 23970 observations. The results shown are the f distribution. The histogram shows the distribution with 14 degrees of freedom for the size and 23955 degrees of freedom for the residuals. When looking at the histogram, the empirical and analytical distributions have a good fit.
meanstar = with(Food, tapply(wfood,size,mean))
grpA = Food$wfood[Food$size=="1"] - meanstar[1]
grpB = Food$wfood[Food$size=="2"] - meanstar[2]
grpC = Food$wfood[Food$size=="3"] - meanstar[3]
grpD = Food$wfood[Food$size=="4"] - meanstar[4]
grpE = Food$wfood[Food$size=="5"] - meanstar[5]
grpF = Food$wfood[Food$size=="6"] - meanstar[6]
grpG = Food$wfood[Food$size=="7"] - meanstar[7]
grpH= Food$wfood[Food$size=="8"] - meanstar[8]
grpI= Food$wfood[Food$size=="9"] - meanstar[9]
grpJ= Food$wfood[Food$size=="10"] - meanstar[10]
grpK= Food$wfood[Food$size=="11"] - meanstar[11]
grpL= Food$wfood[Food$size=="12"] - meanstar[12]
grpM= Food$wfood[Food$size=="13"] - meanstar[13]
grpN= Food$wfood[Food$size=="14"] - meanstar[14]
grpO= Food$wfood[Food$size=="15"] - meanstar[15]
simblock = Food$size
R = 10000
Fstar = numeric(R)
for (i in 1:R) {
groupA = sample(grpA, size=1598, replace=T)
groupB = sample(grpB, size=1598, replace=T)
groupC = sample(grpC, size=1598, replace=T)
groupD = sample(grpD, size=1598, replace=T)
groupE = sample(grpE, size=1598, replace=T)
groupF = sample(grpF, size=1598, replace=T)
groupG = sample(grpG, size=1598, replace=T)
groupH = sample(grpH, size=1598, replace=T)
groupI = sample(grpI, size=1598, replace=T)
groupJ = sample(grpJ, size=1598, replace=T)
groupK = sample(grpK, size=1598, replace=T)
groupL = sample(grpL, size=1598, replace=T)
groupM = sample(grpM, size=1598, replace=T)
groupN = sample(grpN, size=1598, replace=T)
groupO = sample(grpO, size=1598, replace=T)
simyield = c(groupA,groupB,groupC,groupD,groupE,groupF,groupG,groupH,groupI,groupJ,groupK,groupL,groupM,groupN,groupO)
simdata = data.frame(simyield,simblock)
Fstar[i] = oneway.test(simyield~simblock, var.equal=T, data=simdata)$statistic
}
hist(Fstar,ylim=c(0,1), xlim=c(0,3),prob=T)
x=seq(.25,3,.1)
points(x,y=df(x,14,23955),type="b",col="red")
print(realFstar<-oneway.test(wfood~size, var.equal=T, data=Food)$statistic)
## F
## 43.86
mean(Fstar>=realFstar)
## [1] 0
qf(.95,14,23955)
## [1] 1.692
quantile(Fstar,.95)
## 95%
## 1.644
The bootstrap method uses random sampling with replacement. It estimates properties of an estimator (variance) by measuring those properties when sampling from an approximating distribution. When looking at the bootstrap, the models of the analytical and the empirical have a good fit.
model2=aov(simyield~simblock, data=simdata)
anova(model2)
## Analysis of Variance Table
##
## Response: simyield
## Df Sum Sq Mean Sq F value Pr(>F)
## simblock 14 0 0.0234 0.81 0.66
## Residuals 23955 691 0.0289
Looking at the ANOVA of the resampled data, the p-value changed to .94. With an F value of .48 and a p-value of .94, we would fail to reject the null hypothesis and the variation in the percentage of total expenditure which the household has spent on food cannot be attributed to anything other than randomization.
qqnorm(residuals(model1))
qqline(residuals(model1))
plot(fitted(model1), residuals(model1))
qqnorm(residuals(model2))
qqline(residuals(model2))
plot(fitted(model2), residuals(model2))
A Q-Q plot can be used to compare the shape of the distribution of the dataset. The Q-Q plot and Q-Q line of the residuals appear to be normal. However at each of the ends, the residuals seem to stray from the line. When looking at the plot of the fitted model and the residuals, a good fit would be if the plot was symmetric at zero and had a high variation. The points seem to be symmetric at zero however the points at the beginning have higher variation and it decreases towards the right of the plot.
Since the dataset was all observation from homes in Spain, it is possible that all of the values were from chance and that there might have been something different about the year the observations were made.There may be other nuisance factors that were not recorded that year. Also when looking at the boxplots of the variables we were testing, the distributions werent normal. The Kruskal Wallis test does not assume a normal distrubtion of the residuals and could be used to test if the variation can be attributed to anything other than randomization without assuming a normal distribution.
kruskal.test(wfood~size, data=Food)
##
## Kruskal-Wallis rank sum test
##
## data: wfood by size
## Kruskal-Wallis chi-squared = 533.9, df = 14, p-value < 2.2e-16
When looking at the results of the Kruskal-Wallis test, we would reject the null hypothesis and the variation in the percentage of total expenditure which the household has spent on food can be attributed to the size of the household.