Analysis of covariance

Hongyu Chen

RPI, RIN:661405156

Oct.27 V1.0

1. Setting

System under test

This analysis uses statistics about budget share of food for Spanish households. Data set is from ‘BudgetFood’ in package ‘Ecdat’. This test aims to investigate how size of town people live affect percentage of total expenditure spent on food, with age of reference person and total expenditure of the household as independent explanatory variables.

Below is the installation and initial examination of the dataset:

library("Ecdat")
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
data<-BudgetFood
head(data)
##    wfood  totexp age size town   sex
## 1 0.4677 1290941  43    5    2   man
## 2 0.3130 1277978  40    3    2   man
## 3 0.3765  845852  28    3    2   man
## 4 0.4397  527698  60    1    2 woman
## 5 0.4036 1103220  37    5    2   man
## 6 0.1993 1768128  35    4    2   man
summary(data)
##      wfood           totexp              age            size     
##  Min.   :0.000   Min.   :   14601   Min.   :16.0   Min.   : 1.0  
##  1st Qu.:0.258   1st Qu.:  449820   1st Qu.:38.0   1st Qu.: 2.0  
##  Median :0.364   Median :  731114   Median :50.0   Median : 4.0  
##  Mean   :0.378   Mean   :  865550   Mean   :50.5   Mean   : 3.7  
##  3rd Qu.:0.485   3rd Qu.: 1112533   3rd Qu.:62.0   3rd Qu.: 5.0  
##  Max.   :0.997   Max.   :11397547   Max.   :99.0   Max.   :37.0  
##       town         sex       
##  Min.   :1.00   man  :20624  
##  1st Qu.:2.00   woman: 3347  
##  Median :4.00   NA's :    1  
##  Mean   :3.24                
##  3rd Qu.:4.00                
##  Max.   :5.00

Factors and Levels

In this study we focus on particular group of people whose total household expenditure is relatively small.

One factor that interested in is the size of town where the household is placed, it is categorised into 5 groups: 1 for small towns, 5 for big ones. There are two explanatory variables, the age of reference person of the household and total expenditure.

Below is selection of sample data.

# factors and levels setting
P<-data$totexp<=50000
sample<-data[P,]
sample
##         wfood totexp age size town   sex
## 556   0.45383  39072  85    1    2 woman
## 1353  0.63494  47992  73    1    3   man
## 1696  0.35556  23400  77    1    1 woman
## 2093  0.39792  24176  66    2    3   man
## 2767  0.67126  46480  88    1    4 woman
## 4286  0.54171  48764  78    1    1 woman
## 4616  0.00000  22840  75    1    3 woman
## 4918  0.00000  46800  72    1    4 woman
## 5794  0.00000  14601  53    1    3 woman
## 6178  0.17825  29173  63    1    1 woman
## 6356  0.57179  49564  74    2    3   man
## 7333  0.48296  47052  82    1    1 woman
## 8090  0.53947  47424  78    2    2   man
## 8210  0.63381  30602  74    1    2 woman
## 8213  0.40000  29900  70    3    2 woman
## 9009  0.75801  34712  69    1    4 woman
## 9030  0.16663  20284  72    1    3 woman
## 9851  0.63925  28064  88    1    4   man
## 10535 0.05772  16216  84    2    2 woman
## 10569 0.37705  37788  67    2    1 woman
## 10704 0.44583  27176  73    1    4   man
## 11748 0.79393  44014  64    1    2   man
## 11789 0.77432  23236  78    1    4 woman
## 11960 0.70719  37868  82    1    1 woman
## 13540 0.23854  38584  76    1    4 woman
## 13564 0.57769  27184  86    1    4 woman
## 15041 0.61867  39336  73    1    2 woman
## 15105 0.08816  25364  71    1    4 woman
## 15347 0.49239  46256  67    2    4 woman
## 16382 0.00000  40560  65    3    2   man
## 17658 0.00000  22786  66    1    4 woman
## 19123 0.22862  41169  70    1    5 woman
## 20329 0.58266  40339  76    1    1 woman
## 20583 0.64191  45770  72    1    1 woman
## 22802 0.37539  41695  73    1    1 woman
## 23175 0.51520  47236  80    1    1 woman
## 23263 0.84146  42640  78    1    1   man
## 23869 0.80620  31476  77    1    4 woman
str(sample)
## 'data.frame':    38 obs. of  6 variables:
##  $ wfood : num  0.454 0.635 0.356 0.398 0.671 ...
##  $ totexp: num  39072 47992 23400 24176 46480 ...
##  $ age   : num  85 73 77 66 88 78 75 72 53 63 ...
##  $ size  : num  1 1 1 2 1 1 1 1 1 1 ...
##  $ town  : num  2 3 1 3 4 1 3 4 3 1 ...
##  $ sex   : Factor w/ 2 levels "man","woman": 2 1 2 1 2 2 2 2 2 2 ...

Continuous variables (if any)

There are several continous variables in this test. Percentage of food expenditure, total expenditure and age of reference person are all continous variables.

Response variables

Percentage of total expenditure spent on food of a household is assigned as responsible variable, which is ‘wfood’ in the data set.

The Data: How is it organized and what does it look like?

The dataframe contains columns about persentage of expenditure on food, total expenditure, age of reference person, size of household, size of town they live, and sex of reference person.

This analysis is interested in how size of town people live affects percentage of expenditure on food, there are two explanatory variables which are age and total expenditure.

Randomization

Survey was condected within 23972 households in Spain, from information provided, it is fair to say all factors including age, household size, town size are completely randomized.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

We select one insterested factor ‘town’ and other two independent explanatory variables, ‘age’ and ‘totexp’. Then test whether the size of town people live in could influence the percentage of their expenditure on food, also we are going to analyze the covariance of explanatory variables.

Null hypothesis H0:Variance of the percentage of expenditure spent on food is only due to randomization.

Alternative hypothesis H1: variance of the percentage of expenditure spent on food is due to other than randomizaion.

What is the rationale for this design?

Engel’s law is an observation in economics stating that as income rises, the proportion of income spent on food falls. In other words, the income elasticity of demand of food is between 0 and 1. Engel’s coefficient is a significant parameter to evaluate quality of life.

Using covariance analysis, we can further investigate how a factor affect response variable with the help of explanatory variables.

Randomize: What is the Randomization Scheme?

Randomization depends on the way data collected which can be assumed as randomly gethered.

Replicate: Are there replicates and/or repeated measures?

There is no replicates or repeated measures in this test.

Block: Did you use blocking in the design?

We set a restriction on total expenditure. In this case we investigate households with relatively low expenditure as less than 50,000.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

#Histogram plot
par(mfrow=c(1,3));for (i in c(1,2,3)) hist(sample[,i],main = names(sample)[i])

plot of chunk unnamed-chunk-3

#Boxplot
boxplot(wfood~town,data=sample,xlab="size of town",ylab="percentage of expenditure on food")

plot of chunk unnamed-chunk-3

The boxplot reveal how the size of town affect expenditure percentage on food, there are obvious in the mean of ‘wfood’ when size of town is different, which may indicate that variance of expenditure percentage spent on food is due to size of town.

#Plot each variable interested against others
plot(sample[,c(1,2,3,5)])

plot of chunk unnamed-chunk-4

plot(wfood~age,pch=as.character(town),data=sample)

plot of chunk unnamed-chunk-4

plot(wfood~totexp,pch=as.character(town),data=sample)

plot of chunk unnamed-chunk-4

From plots about each variable against others, it seems when expenditure percentage spent on food is dependent variable, it is hard to tell if the relationship between dependent and independent variables is positve or negative correlation. Plot is widely distributed.

In the plot about ‘wfood’ ‘age’ and ‘wfood’ ‘totexp’, town size effect is also not clear.

Testing

ANCOVA:

# Analysis of covariance 
sample$town=as.factor(sample$town)
sampleaov=lm(wfood~totexp+age+town, data=sample)
anova(sampleaov)
## Analysis of Variance Table
## 
## Response: wfood
##           Df Sum Sq Mean Sq F value Pr(>F)   
## totexp     1  0.434   0.434    7.69 0.0093 **
## age        1  0.273   0.273    4.84 0.0354 * 
## town       4  0.061   0.015    0.27 0.8936   
## Residuals 31  1.750   0.056                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(wfood~town,data=sample))
## Analysis of Variance Table
## 
## Response: wfood
##           Df Sum Sq Mean Sq F value Pr(>F)
## town       4  0.226  0.0566    0.81   0.53
## Residuals 33  2.292  0.0695

ANCOVA result as above reveals that P-value of both total expenditure and age of reference person are less than 0.05, which indicates variation of these two independent explanatory variables may cause variation of percentage of expenditure spent on food.

By contrast, the factor we interested in, namely the size of town, has a P-value much larger than 0.05. Therefore we could reject the null hypothesis and accept the alternative hypothesis that variation of percentage of expenditure spent on food is due to something other than randomization.

Estimation (of Parameters)

# Summary of linear model
sampleaov=lm(wfood~totexp+age+town, data=sample)
summary(sampleaov)
## 
## Call:
## lm(formula = wfood ~ totexp + age + town, data = sample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5422 -0.1201  0.0034  0.1268  0.3867 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -6.37e-01   4.38e-01   -1.45    0.157  
## totexp       9.18e-06   4.13e-06    2.22    0.034 *
## age          1.03e-02   5.61e-03    1.84    0.076 .
## town2       -2.08e-02   1.12e-01   -0.19    0.853  
## town3       -5.33e-02   1.31e-01   -0.41    0.687  
## town4        6.03e-03   1.03e-01    0.06    0.954  
## town5       -2.35e-01   2.50e-01   -0.94    0.354  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.238 on 31 degrees of freedom
## Multiple R-squared:  0.305,  Adjusted R-squared:  0.171 
## F-statistic: 2.27 on 6 and 31 DF,  p-value: 0.0624
confint(sampleaov)
##                  2.5 %    97.5 %
## (Intercept) -1.531e+00 0.2574579
## totexp       7.604e-07 0.0000176
## age         -1.126e-03 0.0217705
## town2       -2.489e-01 0.2072363
## town3       -3.209e-01 0.2143248
## town4       -2.050e-01 0.2170481
## town5       -7.451e-01 0.2745875

From estimates result, only total expenditure returns a P-value smaller than 0.05

Diagnostics/Model Adequacy Checking

#Q-Q norm
qqnorm(residuals(sampleaov))
qqline(residuals(sampleaov))

plot of chunk unnamed-chunk-7

#Plot of fitted and residuals 
plot(fitted(sampleaov),residuals(sampleaov))

plot of chunk unnamed-chunk-7

Q-Q norm shows a linear pattern, also points are evenly distributed on each side of zero, indicating the model used before is valid.

#Total expenditure with different town size
plot(wfood~totexp,pch=unclass(town),data=sample)
for (i in 1:4) abline(lm(wfood~totexp,data=sample[sample$town==i,]))

plot of chunk unnamed-chunk-8

#Age of referece person with different town size
plot(wfood~age,pch=unclass(town),data=sample)
for (i in 1:4) abline(lm(wfood~age,data=sample[sample$town==i,]))

plot of chunk unnamed-chunk-8

None of plots above has parallel lines in terms of different size of town.

4. References to the literature

https://www2.bc.edu/~lewbel/CollectiveEngel.pdf https://en.wikipedia.org/wiki/Engel%27s_law

5. Appendices

A summary of, or pointer to, the raw data