Rensselaer Polytechnic Institute

1.Setting

library(Ecdat)
project2 <- Tobacco

System Under Test

The data analyzed in the recipe analyzes the houeholds tobacco budget share. The study consists of 2724 observations and 9 variables. A cross-section from 1995-1996.

Categoical IV:

  • Occuptaion: bluecol(Blue Collar), whitecol(White Collar), inactself(inactive and self-emloyed)
  • Region : flanders, wallon, brussels (regions in Beligum)

Continuous DV:

  • lnx: : log of total expenditures

A dataframe containing:

  • occupation : a factor with levels (bluecol, whitecol, inactself), the last level being inactive and self-employed

  • region : a factor with levels (flanders, wallon, brussels)

  • nkids : number of kids of more than two years old

  • nadults: number of adults in household

  • lnx : log of total expenditures

  • stobacco : budgetshare of tobacco

  • salcohol : budgetshare of alcohol

  • age : age in brackes (0-4)

First 6 rows of Tobacco data:

head(project2)
##   occupation   region nkids nkids2 nadults      lnx stobacco  salcohol age
## 1    bluecol flanders     1      0       2 14.19054        0 0.0000000   2
## 2  inactself flanders     0      0       3 13.90857        0 0.0022851   3
## 3   whitecol flanders     0      0       1 13.97461        0 0.0128755   2
## 4    bluecol flanders     1      0       2 13.76281        0 0.0059072   2
## 5  inactself flanders     2      0       1 13.80800        0 0.0219805   2
## 6   whitecol flanders     3      0       2 14.00313        0 0.0166913   2

Summary of the Tobacco dataframe:

summary(project2)
##      occupation        region         nkids            nkids2       
##  inactself:1417   brussels: 454   Min.   :0.0000   Min.   :0.00000  
##  bluecol  : 400   flanders:1231   1st Qu.:0.0000   1st Qu.:0.00000  
##  whitecol : 907   walloon :1039   Median :0.0000   Median :0.00000  
##                                   Mean   :0.5646   Mean   :0.04479  
##                                   3rd Qu.:1.0000   3rd Qu.:0.00000  
##                                   Max.   :5.0000   Max.   :2.00000  
##     nadults          lnx           stobacco          salcohol       
##  Min.   :1.00   Min.   :11.76   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:1.00   1st Qu.:13.41   1st Qu.:0.00000   1st Qu.:0.002906  
##  Median :2.00   Median :13.76   Median :0.00000   Median :0.010898  
##  Mean   :1.97   Mean   :13.73   Mean   :0.01224   Mean   :0.017828  
##  3rd Qu.:2.00   3rd Qu.:14.06   3rd Qu.:0.01381   3rd Qu.:0.024244  
##  Max.   :7.00   Max.   :15.33   Max.   :0.19276   Max.   :0.211124  
##       age       
##  Min.   :0.000  
##  1st Qu.:1.000  
##  Median :2.000  
##  Mean   :2.408  
##  3rd Qu.:4.000  
##  Max.   :4.000

Factors and Levels

Factors and Levels of each factor is listed below:

str(Tobacco)
## 'data.frame':    2724 obs. of  9 variables:
##  $ occupation: Factor w/ 3 levels "inactself","bluecol",..: 2 1 3 2 1 3 1 2 3 1 ...
##  $ region    : Factor w/ 3 levels "brussels","flanders",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ nkids     : int  1 0 0 1 2 3 0 1 0 0 ...
##  $ nkids2    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nadults   : int  2 3 1 2 1 2 2 2 2 2 ...
##  $ lnx       : num  14.2 13.9 14 13.8 13.8 ...
##  $ stobacco  : num  0 0 0 0 0 ...
##  $ salcohol  : num  0 0.00229 0.01288 0.00591 0.02198 ...
##  $ age       : int  2 3 2 2 2 2 4 0 1 4 ...

The 3 factors being studied include:

  • Occuptaion: bluecol(Blue Collar), whitecol(White Collar), inactself(inactive and self-emloyed)
  • Region : flanders, wallon, brussels (regions in Beligum)
  • nadult : number of adults in household
project2$nadults <-as.factor(project2$nadults)
str(project2$nadults)
##  Factor w/ 7 levels "1","2","3","4",..: 2 3 1 2 1 2 2 2 2 2 ...
levels(project2$nadults)
## [1] "1" "2" "3" "4" "5" "6" "7"
levels(Tobacco$occupation)
## [1] "inactself" "bluecol"   "whitecol"
levels(Tobacco$region)
## [1] "brussels" "flanders" "walloon"

Continuous Variables

There are 3 continuous variables:

  • lnx : log of total expenditures
  • stobacco : budgetshare of tobacco
  • salcohol : budgetshare of alcohol

Reponse Variable

The response variable is lnx the log of total expenditures

The Data: How is it organized and what does it look like?

The study contains tobacco information from 2724 Belgian households and 9 variables, taken from the Belgian household budget survey of 1995/1996. The data are supplied by the National Institute of Statistics, Belgium. A new dataset is created by subsetting the data using the 2 factors discussed above.

myvars <- c("lnx", "occupation", "region")
newdata <- project2[myvars]
head(newdata)
##        lnx occupation   region
## 1 14.19054    bluecol flanders
## 2 13.90857  inactself flanders
## 3 13.97461   whitecol flanders
## 4 13.76281    bluecol flanders
## 5 13.80800  inactself flanders
## 6 14.00313   whitecol flanders
str(newdata)
## 'data.frame':    2724 obs. of  3 variables:
##  $ lnx       : num  14.2 13.9 14 13.8 13.8 ...
##  $ occupation: Factor w/ 3 levels "inactself","bluecol",..: 2 1 3 2 1 3 1 2 3 1 ...
##  $ region    : Factor w/ 3 levels "brussels","flanders",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(newdata)
##       lnx            occupation        region    
##  Min.   :11.76   inactself:1417   brussels: 454  
##  1st Qu.:13.41   bluecol  : 400   flanders:1231  
##  Median :13.76   whitecol : 907   walloon :1039  
##  Mean   :13.73                                   
##  3rd Qu.:14.06                                   
##  Max.   :15.33

2. Experimental Design

This experiment will test whether or not occupation of the head of the household affects the Houshold Budgetshare of Tobacco, blocked on the region of the Belgian households. An ANOVA test will be performed to determine the F statistic and the associated p-value for this factor. Then, two alternative methods to null hypothesis statistical testing are use to analyze these affects, the multiple models and confidence intervals.

Selection of Sample Size

  • Type 1 error Alpha = 0.05
  • Type 2 error Beta = 0.05
  • Power = 1 - Beta = 0.95

Below was used to calculate the effect size:

mean_bluecol <- mean(Tobacco$lnx[Tobacco$occupation == "bluecol"])
mean_whitecol <- mean(Tobacco$lnx[Tobacco$occupation == "whitecol"])
mean_inactself <- mean(Tobacco$lnx[Tobacco$occupation == "inactself"])
MEAN_TOTAL <- mean(Tobacco$lnx)

n_bluecol <- length(Tobacco$lnx[Tobacco$occupation == "bluecol"])
n_whitecol <- length(Tobacco$lnx[Tobacco$occupation == "whitecol"])
n_inactself <- length(Tobacco$lnx[Tobacco$occupation == "inactself"])

N <- length(Tobacco$lnx)

a <- (n_bluecol/N) * (mean_bluecol - MEAN_TOTAL) ^ 2
b <- (n_whitecol/N) * (mean_whitecol - MEAN_TOTAL) ^ 2
c <- (n_inactself/N) * (mean_inactself - MEAN_TOTAL) ^ 2

variation = var(Tobacco$lnx)
f_square = (a+b+c)/ variation
f <- sqrt(f_square)
print(f)
## [1] 0.3344385

Calculation of required sample size using G*Power to achieve the desired power.

G*Power

G*Power

Required Sample Size = 122

N = 122
set.seed (300)
Random_Tobacco = Tobacco[sample(nrow(Tobacco),N),]

What is the rationale for this design?

The rationale for this design is to use blocking on our recipe and use non-NHST methods to test and verify our results.

Randomize: What is the Randomization Scheme?

The data can be assumed to be randomly collected since it is part of Ecdat data package.

Replicate: Are there replicates and/or repeated measures?

The data was generated by selecting unique Belgian households. There is no mention of any repeated measures or replicates of this study.

Block: Did you use blocking in the design?

In this experiment, the data will be blocked on if household Budgetshare of tobbaco is greater in a certain region of Belgium. Region variable has 3 levels (flanders, wallon, brussels)

3. Statistical Analysis

Exploratory Data Analysis

Below is a graphic and descriptive summary of this recipe. The historgram represents the frequency of our response variable, lnx, the log of total expenditures.

##      occupation        region         nkids            nkids2       
##  inactself:1417   brussels: 454   Min.   :0.0000   Min.   :0.00000  
##  bluecol  : 400   flanders:1231   1st Qu.:0.0000   1st Qu.:0.00000  
##  whitecol : 907   walloon :1039   Median :0.0000   Median :0.00000  
##                                   Mean   :0.5646   Mean   :0.04479  
##                                   3rd Qu.:1.0000   3rd Qu.:0.00000  
##                                   Max.   :5.0000   Max.   :2.00000  
##                                                                     
##  nadults       lnx           stobacco          salcohol       
##  1: 706   Min.   :11.76   Min.   :0.00000   Min.   :0.000000  
##  2:1567   1st Qu.:13.41   1st Qu.:0.00000   1st Qu.:0.002906  
##  3: 316   Median :13.76   Median :0.00000   Median :0.010898  
##  4: 108   Mean   :13.73   Mean   :0.01224   Mean   :0.017828  
##  5:  18   3rd Qu.:14.06   3rd Qu.:0.01381   3rd Qu.:0.024244  
##  6:   8   Max.   :15.33   Max.   :0.19276   Max.   :0.211124  
##  7:   1                                                       
##       age       
##  Min.   :0.000  
##  1st Qu.:1.000  
##  Median :2.000  
##  Mean   :2.408  
##  3rd Qu.:4.000  
##  Max.   :4.000  
## 

Note: According to Working-Leser model, each share of food item (ie: tobacco) is simply a linear function of the log of prices and of the total expenditure on all the food items under consideration. Hence, the log of total expenditure representing Tobacco expenditure calculated by each share of tobacco by each Belgian Household.

According to the historgram, there is long left tail and short right tail. The distribute is skewed left which means that the mean is typically less than the median.

ANOVA - Testing Random Model (at random, N experimental cases)

Goal: To perform ANOVA and determine the F-statistic and the associated p-value for factor.

model<- aov(Random_Tobacco$lnx ~ Random_Tobacco$occupation + Random_Tobacco$region)
summary(model)
##                            Df Sum Sq Mean Sq F value   Pr(>F)    
## Random_Tobacco$occupation   2  4.017  2.0085  10.984 4.25e-05 ***
## Random_Tobacco$region       2  0.038  0.0190   0.104    0.901    
## Residuals                 117 21.394  0.1829                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis was that there is no relationship between the occupation in regards to the log Tobacco expenditure. Data contining “region” variable would be ignored because it is used as a blocking variable. From the ANOVA results, the F-statistic was 10.984 and the p-value was 0.0000425. The alpha value was set to 0.05. The p-value was much less than the alpha value and the F value was not close to 0.Therefore, since the p-value < alpha, we reject the null hypothesis, and the result was statistically significant.

Model Adequacy Checking

qqnorm(residuals(model))
qqline(residuals(model))

From the QQ plot, the residuals form a linear line. Hence, it appears that it is a sample from a normal distribution.

plot(fitted(model), residuals(model))

This residual model shows a random pattern, indicating a good fit for a linear model.

4. Alternatives to Null Hypothesis Statistical Testing

Multiple Models

model1 = lm(Tobacco$lnx ~ Tobacco$occupation * Tobacco$region)
qqnorm(residuals(model1))

plot(fitted(model1), residuals(model1))

model2 = lm(Tobacco$lnx ~ (Tobacco$occupation)^2 * Tobacco$region)
qqnorm(residuals(model2))

plot(fitted(model2), residuals(model2))

model3 = lm(Tobacco$lnx ~ ((Tobacco$region)^2) * Tobacco$occupation)
qqnorm(residuals(model3))

plot(fitted(model3), residuals(model3))

model4 = lm(Tobacco$lnx ~ ((Tobacco$occupation)^2) * (Tobacco$region)^2)
qqnorm(residuals(model4))

plot(fitted(model4), residuals(model4))

One of the assumptions for regression analysis is that the residuals are normally distributed. Four of the plots are normally distributed. Therefore, we can trust the results of this regression analysis.

Estimation

N = 122
set.seed (300)
Random_Tobacco = Tobacco[sample(nrow(Tobacco),N),]
model_rand = lm(lnx~occupation+region,data=Random_Tobacco)
coef(model_rand)
##        (Intercept)  occupationbluecol occupationwhitecol 
##       13.539561827        0.254456659        0.401236925 
##     regionflanders      regionwalloon 
##        0.041191185        0.006432225