Cheryl Tran

RPI

11/05/2014 Version 1

1. Setting

System under test

This recipe is examining the budget share of food for Spanish households from the Ecdat package.Using this data set, we are testing a model with a single factor with multiple levels on a single response variable.

#installing package
install.packages("Ecdat", repos='http://cran.us.r-project.org')
## package 'Ecdat' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\tranc3\AppData\Local\Temp\RtmpWAU1Z6\downloaded_packages
library("Ecdat", lib.loc="C:/Program Files/R/R-3.1.1/library")
## Warning: package 'Ecdat' was built under R version 3.1.2
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
#Pulling the Food data from the library
Food1<-BudgetFood
Food<-subset(Food1,Food1$size<17)

Factors and Levels

We are examining the effect of the size of the household on the response variable percentage of total expenditure which the household has spend on food. One may assume that a larger size of the household would lead to a larger percentage of total expenditure which the household has spent on food. Household sizes range from 1-15.

head(Food)
##    wfood  totexp age size town   sex
## 1 0.4677 1290941  43    5    2   man
## 2 0.3130 1277978  40    3    2   man
## 3 0.3765  845852  28    3    2   man
## 4 0.4397  527698  60    1    2 woman
## 5 0.4036 1103220  37    5    2   man
## 6 0.1993 1768128  35    4    2   man
tail(Food)
##        wfood  totexp age size town   sex
## 23967 0.6568  257376  81    2    4   man
## 23968 0.5569  432818  87    4    4   man
## 23969 0.4333 1096826  45    2    4   man
## 23970 0.5726  807030  36    3    4   man
## 23971 0.0000  107320  76    1    4 woman
## 23972 0.3907  342832  54    2    4   man

Continuous variables (if any)

The continuous variables in the data set are the percentage of total expenditure which the household has spent on food and the total expenditure of the household.

Response variables

In this experiment, we are looking at the percentage of total expenditure which the household has spent on food and the effect of the single factor size of the household.

The Data: How is it organized and what does it look like?

The dataset has 23972 observations of households in Spain. They are organized into 6 variables: wfood, totexp, age, size, town, and sex. Wfood is the percentage of total expenditure which the household has spent on food. totexp is the total expenditure of the household. age is the age of reference person in the household. size is the size of the household. Town is the size of the town where the household is placed and categorized into 5 groups, 1 is for small and 5 is for big. sex is the sex of reference person (man or woman)

Randomization

The dataset consists of observations so it unknow if there was randomization.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

This experiment being conducted is to determine if the variation of percentage of total expenditure which the household has spend on food can be attributed to the variation in the size of the household. The null hypothesis for this ANOVA is that the variation in the percentage of total expenditure which the household has spent on food can not be attributed to anything other than randomization. The alternate hypothesis is that the variation in the percentage of total expenditure which the household has spent on food can be attributed to the size of the household.

What is the rationale for this design?

One may think that most households would spend the same percentage of their total expenditure on food. Or some may assume that a larger household may use a larger percentage on food. We can see if the size of ones household has an effect on the percentage of total expenditure which the household spent on food.

Randomize: What is the Randomization Scheme?

The dataset is the observations of spanish households so it was not randomized

Replicate: Are there replicates and/or repeated measures?

There are no replicates or repeated measures

Block: Did you use blocking in the design?

There was no blocking used in the design.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Food$size=as.factor(Food$size)
summary(Food)
##      wfood           totexp              age            size     
##  Min.   :0.000   Min.   :   14601   Min.   :16.0   4      :5478  
##  1st Qu.:0.258   1st Qu.:  449796   1st Qu.:38.0   2      :5146  
##  Median :0.364   Median :  731130   Median :50.0   3      :4409  
##  Mean   :0.378   Mean   :  865564   Mean   :50.5   5      :3563  
##  3rd Qu.:0.485   3rd Qu.: 1112640   3rd Qu.:62.0   1      :1945  
##  Max.   :0.997   Max.   :11397547   Max.   :99.0   6      :1899  
##                                                    (Other):1530  
##       town         sex       
##  Min.   :1.00   man  :20623  
##  1st Qu.:2.00   woman: 3347  
##  Median :4.00                
##  Mean   :3.24                
##  3rd Qu.:4.00                
##  Max.   :5.00                
## 
boxplot(wfood~size, data=Food, xlab="Size of Household", ylab="Percentage of expenditure spent on food")
title("Percentage of expenditure spend on food")

plot of chunk unnamed-chunk-3 When looking at the boxplots, households with one person has a range of percentages from 0 to 1. This means that a person that lives by themselves either spends none or almost all of their expenditures on food.The medians for the percentage of total expenditure spent on food for all of the different sizes of households are all around .4.I find it interesting that the ranges of values for the households are larger for smaller households and the range gets smaller as the size of the household increases.This makes sense because when households get bigger, people probably have to be more careful with money so they cant spend a lot of money on food but also have to spend a good amount to make sure everyone gets fed.

Testing

model1=aov(wfood~size, data=Food)
anova(model1)
## Analysis of Variance Table
## 
## Response: wfood
##              Df Sum Sq Mean Sq F value Pr(>F)    
## size         14     16   1.174    43.9 <2e-16 ***
## Residuals 23955    641   0.027                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis for this ANOVA is that the variation in the percentage of total expenditure which the household has spent on food can not be attributed to anything other than randomization. The alternate hypothesis is that the variation in the percentage of total expenditure which the household has spent on food can be attributed to the size of the household.

From our model we would reject our null hypothesis and the variation in the percentage of total expenditure which the household has spent on food can be attributed to something other than randomization such as size of household. There is a 2e-16 probability of getting the F value of 43.9.

#Makes an array of the means of the levels
with(Food,tapply(wfood, size, mean))
##      1      2      3      4      5      6      7      8      9     10 
## 0.3903 0.4185 0.3590 0.3529 0.3635 0.3807 0.3937 0.4074 0.4288 0.4285 
##     11     12     13     14     15 
## 0.4616 0.4842 0.5008 0.5796 0.5908
#makes an array of the variances of the levels
with(Food,tapply(wfood, size, var))
##       1       2       3       4       5       6       7       8       9 
## 0.04634 0.03457 0.02505 0.02086 0.02041 0.02034 0.02390 0.02478 0.02094 
##      10      11      12      13      14      15 
## 0.03068 0.03592 0.03131 0.05137 0.06127 0.02391
#makes an array of the lenths of the levels
with(Food,tapply(wfood, size, length))
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 1945 5146 4409 5478 3563 1899  843  376  182   79   21   16    7    3    3
#summary ANOVA model
summary(aov(wfood~size, data=Food))
##                Df Sum Sq Mean Sq F value Pr(>F)    
## size           14     16   1.174    43.9 <2e-16 ***
## Residuals   23955    641   0.027                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#mean of the percentages
meanstar=mean(Food$wfood)
#squareroot of the mean square of the residuals
sdstar=sqrt(.027)
simsize=Food$size
R=10000 #number of runs
Fstar = numeric(R)
for (i in 1:R) {
  # make residual normally distributed with known pooled variance
  # this loop is a Monte Carlo simulation since distribution is known normal
  groupA = rnorm(1598, mean=meanstar, sd=sdstar)
  groupB = rnorm(1598, mean=meanstar, sd=sdstar)
  groupC = rnorm(1598, mean=meanstar, sd=sdstar)
  groupD = rnorm(1598, mean=meanstar, sd=sdstar)
  groupE = rnorm(1598, mean=meanstar, sd=sdstar)
  groupF = rnorm(1598, mean=meanstar, sd=sdstar)
  groupG = rnorm(1598, mean=meanstar, sd=sdstar)
  groupH = rnorm(1598, mean=meanstar, sd=sdstar)
  groupI = rnorm(1598, mean=meanstar, sd=sdstar)
  groupJ = rnorm(1598, mean=meanstar, sd=sdstar)
  groupK = rnorm(1598, mean=meanstar, sd=sdstar)
  groupL = rnorm(1598, mean=meanstar, sd=sdstar)
  groupM = rnorm(1598, mean=meanstar, sd=sdstar)
  groupN = rnorm(1598, mean=meanstar, sd=sdstar)
  groupO = rnorm(1598, mean=meanstar, sd=sdstar)
  simyield = c(groupA,groupB,groupC,groupD,groupE,groupF, groupG, groupH, groupI, groupJ, groupK, groupL, groupM, groupN, groupO)
  simdata = data.frame(simyield,simsize)
  Fstar[i] = oneway.test(simyield~simsize, var.equal=T, data=simdata)$statistic
}
hist(Fstar,ylim=c(0,1), xlim=c(0,3),prob=T)
x=seq(.25,3,.1)
points(x,y=df(x,14,23955),type="b",col="red")

plot of chunk unnamed-chunk-5 The code above performed a montecarlo simulation. A montecarlo simulation performs risk analysis by building models of possible reults by substitute a range of values. We made random normal distributed data with 23970 observations. The results shown are the f distribution. The histogram shows the distribution with 14 degrees of freedom for the size and 23955 degrees of freedom for the residuals. When looking at the histogram, the empirical and analytical distributions have a good fit.

Now the bootstrap version of ANOVA

meanstar = with(Food, tapply(wfood,size,mean))
grpA = Food$wfood[Food$size=="1"] - meanstar[1]
grpB = Food$wfood[Food$size=="2"] - meanstar[2]
grpC = Food$wfood[Food$size=="3"] - meanstar[3]
grpD = Food$wfood[Food$size=="4"] - meanstar[4]
grpE = Food$wfood[Food$size=="5"] - meanstar[5]
grpF = Food$wfood[Food$size=="6"] - meanstar[6]
grpG = Food$wfood[Food$size=="7"] - meanstar[7]
grpH= Food$wfood[Food$size=="8"] - meanstar[8]
grpI= Food$wfood[Food$size=="9"] - meanstar[9]
grpJ= Food$wfood[Food$size=="10"] - meanstar[10]
grpK= Food$wfood[Food$size=="11"] - meanstar[11]
grpL= Food$wfood[Food$size=="12"] - meanstar[12]
grpM= Food$wfood[Food$size=="13"] - meanstar[13]
grpN= Food$wfood[Food$size=="14"] - meanstar[14]
grpO= Food$wfood[Food$size=="15"] - meanstar[15]
simblock = Food$size
R = 10000
Fstar = numeric(R)
for (i in 1:R) {
  groupA = sample(grpA, size=1598, replace=T)
  groupB = sample(grpB, size=1598, replace=T)
  groupC = sample(grpC, size=1598, replace=T)
  groupD = sample(grpD, size=1598, replace=T)
  groupE = sample(grpE, size=1598, replace=T)
  groupF = sample(grpF, size=1598, replace=T)
  groupG = sample(grpG, size=1598, replace=T)
  groupH = sample(grpH, size=1598, replace=T)
  groupI = sample(grpI, size=1598, replace=T)
  groupJ = sample(grpJ, size=1598, replace=T)
  groupK = sample(grpK, size=1598, replace=T)
  groupL = sample(grpL, size=1598, replace=T)
  groupM = sample(grpM, size=1598, replace=T)
  groupN = sample(grpN, size=1598, replace=T)
  groupO = sample(grpO, size=1598, replace=T)
  simyield = c(groupA,groupB,groupC,groupD,groupE,groupF,groupG,groupH,groupI,groupJ,groupK,groupL,groupM,groupN,groupO)
  simdata = data.frame(simyield,simblock)
  Fstar[i] = oneway.test(simyield~simblock, var.equal=T, data=simdata)$statistic
}
hist(Fstar,ylim=c(0,1), xlim=c(0,3),prob=T)
x=seq(.25,3,.1)
points(x,y=df(x,14,23955),type="b",col="red")

plot of chunk unnamed-chunk-6

print(realFstar<-oneway.test(wfood~size, var.equal=T, data=Food)$statistic)
##     F 
## 43.86
mean(Fstar>=realFstar)
## [1] 0
qf(.95,14,23955)
## [1] 1.692
quantile(Fstar,.95)
##   95% 
## 1.644

The bootstrap method uses random sampling with replacement. It estimates properties of an estimator (variance) by measuring those properties when sampling from an approximating distribution. When looking at the bootstrap, the models of the analytical and the empirical have a good fit.

ANOVA of Resampled Data

model2=aov(simyield~simblock, data=simdata)
anova(model2)
## Analysis of Variance Table
## 
## Response: simyield
##              Df Sum Sq Mean Sq F value Pr(>F)
## simblock     14      0  0.0234    0.81   0.66
## Residuals 23955    691  0.0289

Looking at the ANOVA of the resampled data, the p-value changed to .94. With an F value of .48 and a p-value of .94, we would fail to reject the null hypothesis and the variation in the percentage of total expenditure which the household has spent on food cannot be attributed to anything other than randomization.

Diagnostics/Model Adequacy Checking

qqnorm(residuals(model1))
qqline(residuals(model1))

plot of chunk unnamed-chunk-8

plot(fitted(model1), residuals(model1))

plot of chunk unnamed-chunk-8

qqnorm(residuals(model2))
qqline(residuals(model2))

plot of chunk unnamed-chunk-8

plot(fitted(model2), residuals(model2))

plot of chunk unnamed-chunk-8 A Q-Q plot can be used to compare the shape of the distribution of the dataset. The Q-Q plot and Q-Q line of the residuals appear to be normal. However at each of the ends, the residuals seem to stray from the line. When looking at the plot of the fitted model and the residuals, a good fit would be if the plot was symmetric at zero and had a high variation. The points seem to be symmetric at zero however the points at the beginning have higher variation and it decreases towards the right of the plot.

4. Contingencies

Since the dataset was all observation from homes in Spain, it is possible that all of the values were from chance and that there might have been something different about the year the observations were made.There may be other nuisance factors that were not recorded that year. Also when looking at the boxplots of the variables we were testing, the distributions werent normal. The Kruskal Wallis test does not assume a normal distrubtion of the residuals and could be used to test if the variation can be attributed to anything other than randomization without assuming a normal distribution.

kruskal.test(wfood~size, data=Food)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  wfood by size
## Kruskal-Wallis chi-squared = 533.9, df = 14, p-value < 2.2e-16

When looking at the results of the Kruskal-Wallis test, we would reject the null hypothesis and the variation in the percentage of total expenditure which the household has spent on food can be attributed to the size of the household.

5. References to the literature

http://jae.wiley.com/jae/

6. Appendices

A summary of, or pointer to, the raw data

complete and documented R code