Rensselaer Polytechnic Institute
library(Ecdat)
project2 <- Tobacco
The data analyzed in the recipe analyzes the houeholds tobacco budget share. The study consists of 2724 observations and 9 variables. A cross-section from 1995-1996.
Categoical IV:
Continuous DV:
A dataframe containing:
occupation : a factor with levels (bluecol, whitecol, inactself), the last level being inactive and self-employed
region : a factor with levels (flanders, wallon, brussels)
nkids : number of kids of more than two years old
nadults: number of adults in household
lnx : log of total expenditures
stobacco : budgetshare of tobacco
salcohol : budgetshare of alcohol
age : age in brackes (0-4)
First 6 rows of Tobacco data:
head(project2)
## occupation region nkids nkids2 nadults lnx stobacco salcohol age
## 1 bluecol flanders 1 0 2 14.19054 0 0.0000000 2
## 2 inactself flanders 0 0 3 13.90857 0 0.0022851 3
## 3 whitecol flanders 0 0 1 13.97461 0 0.0128755 2
## 4 bluecol flanders 1 0 2 13.76281 0 0.0059072 2
## 5 inactself flanders 2 0 1 13.80800 0 0.0219805 2
## 6 whitecol flanders 3 0 2 14.00313 0 0.0166913 2
Summary of the Tobacco dataframe:
summary(project2)
## occupation region nkids nkids2
## inactself:1417 brussels: 454 Min. :0.0000 Min. :0.00000
## bluecol : 400 flanders:1231 1st Qu.:0.0000 1st Qu.:0.00000
## whitecol : 907 walloon :1039 Median :0.0000 Median :0.00000
## Mean :0.5646 Mean :0.04479
## 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :5.0000 Max. :2.00000
## nadults lnx stobacco salcohol
## Min. :1.00 Min. :11.76 Min. :0.00000 Min. :0.000000
## 1st Qu.:1.00 1st Qu.:13.41 1st Qu.:0.00000 1st Qu.:0.002906
## Median :2.00 Median :13.76 Median :0.00000 Median :0.010898
## Mean :1.97 Mean :13.73 Mean :0.01224 Mean :0.017828
## 3rd Qu.:2.00 3rd Qu.:14.06 3rd Qu.:0.01381 3rd Qu.:0.024244
## Max. :7.00 Max. :15.33 Max. :0.19276 Max. :0.211124
## age
## Min. :0.000
## 1st Qu.:1.000
## Median :2.000
## Mean :2.408
## 3rd Qu.:4.000
## Max. :4.000
Factors and Levels of each factor is listed below:
str(Tobacco)
## 'data.frame': 2724 obs. of 9 variables:
## $ occupation: Factor w/ 3 levels "inactself","bluecol",..: 2 1 3 2 1 3 1 2 3 1 ...
## $ region : Factor w/ 3 levels "brussels","flanders",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ nkids : int 1 0 0 1 2 3 0 1 0 0 ...
## $ nkids2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nadults : int 2 3 1 2 1 2 2 2 2 2 ...
## $ lnx : num 14.2 13.9 14 13.8 13.8 ...
## $ stobacco : num 0 0 0 0 0 ...
## $ salcohol : num 0 0.00229 0.01288 0.00591 0.02198 ...
## $ age : int 2 3 2 2 2 2 4 0 1 4 ...
The 3 factors being studied include:
project2$nadults <-as.factor(project2$nadults)
str(project2$nadults)
## Factor w/ 7 levels "1","2","3","4",..: 2 3 1 2 1 2 2 2 2 2 ...
levels(project2$nadults)
## [1] "1" "2" "3" "4" "5" "6" "7"
levels(Tobacco$occupation)
## [1] "inactself" "bluecol" "whitecol"
levels(Tobacco$region)
## [1] "brussels" "flanders" "walloon"
There are 3 continuous variables:
The response variable is lnx the log of total expenditures
The study contains tobacco information from 2724 Belgian households and 9 variables, taken from the Belgian household budget survey of 1995/1996. The data are supplied by the National Institute of Statistics, Belgium. A new dataset is created by subsetting the data using the 2 factors discussed above.
myvars <- c("lnx", "occupation", "region")
newdata <- project2[myvars]
head(newdata)
## lnx occupation region
## 1 14.19054 bluecol flanders
## 2 13.90857 inactself flanders
## 3 13.97461 whitecol flanders
## 4 13.76281 bluecol flanders
## 5 13.80800 inactself flanders
## 6 14.00313 whitecol flanders
str(newdata)
## 'data.frame': 2724 obs. of 3 variables:
## $ lnx : num 14.2 13.9 14 13.8 13.8 ...
## $ occupation: Factor w/ 3 levels "inactself","bluecol",..: 2 1 3 2 1 3 1 2 3 1 ...
## $ region : Factor w/ 3 levels "brussels","flanders",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(newdata)
## lnx occupation region
## Min. :11.76 inactself:1417 brussels: 454
## 1st Qu.:13.41 bluecol : 400 flanders:1231
## Median :13.76 whitecol : 907 walloon :1039
## Mean :13.73
## 3rd Qu.:14.06
## Max. :15.33
This experiment will test whether or not occupation of the head of the household affects the Houshold Budgetshare of Tobacco, blocked on the region of the Belgian households. An ANOVA test will be performed to determine the F statistic and the associated p-value for this factor. Then, two alternative methods to null hypothesis statistical testing are use to analyze these affects, the multiple models and confidence intervals.
Below was used to calculate the effect size:
mean_bluecol <- mean(Tobacco$lnx[Tobacco$occupation == "bluecol"])
mean_whitecol <- mean(Tobacco$lnx[Tobacco$occupation == "whitecol"])
mean_inactself <- mean(Tobacco$lnx[Tobacco$occupation == "inactself"])
MEAN_TOTAL <- mean(Tobacco$lnx)
n_bluecol <- length(Tobacco$lnx[Tobacco$occupation == "bluecol"])
n_whitecol <- length(Tobacco$lnx[Tobacco$occupation == "whitecol"])
n_inactself <- length(Tobacco$lnx[Tobacco$occupation == "inactself"])
N <- length(Tobacco$lnx)
a <- (n_bluecol/N) * (mean_bluecol - MEAN_TOTAL) ^ 2
b <- (n_whitecol/N) * (mean_whitecol - MEAN_TOTAL) ^ 2
c <- (n_inactself/N) * (mean_inactself - MEAN_TOTAL) ^ 2
variation = var(Tobacco$lnx)
f_square = (a+b+c)/ variation
f <- sqrt(f_square)
print(f)
## [1] 0.3344385
Calculation of required sample size using G*Power to achieve the desired power.
G*Power
N = 122
set.seed (300)
Random_Tobacco = Tobacco[sample(nrow(Tobacco),N),]
The rationale for this design is to use blocking on our recipe and use non-NHST methods to test and verify our results.
The data can be assumed to be randomly collected since it is part of Ecdat data package.
The data was generated by selecting unique Belgian households. There is no mention of any repeated measures or replicates of this study.
In this experiment, the data will be blocked on if household Budgetshare of tobbaco is greater in a certain region of Belgium. Region variable has 3 levels (flanders, wallon, brussels)
Below is a graphic and descriptive summary of this recipe. The historgram represents the frequency of our response variable, lnx, the log of total expenditures.
## occupation region nkids nkids2
## inactself:1417 brussels: 454 Min. :0.0000 Min. :0.00000
## bluecol : 400 flanders:1231 1st Qu.:0.0000 1st Qu.:0.00000
## whitecol : 907 walloon :1039 Median :0.0000 Median :0.00000
## Mean :0.5646 Mean :0.04479
## 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :5.0000 Max. :2.00000
##
## nadults lnx stobacco salcohol
## 1: 706 Min. :11.76 Min. :0.00000 Min. :0.000000
## 2:1567 1st Qu.:13.41 1st Qu.:0.00000 1st Qu.:0.002906
## 3: 316 Median :13.76 Median :0.00000 Median :0.010898
## 4: 108 Mean :13.73 Mean :0.01224 Mean :0.017828
## 5: 18 3rd Qu.:14.06 3rd Qu.:0.01381 3rd Qu.:0.024244
## 6: 8 Max. :15.33 Max. :0.19276 Max. :0.211124
## 7: 1
## age
## Min. :0.000
## 1st Qu.:1.000
## Median :2.000
## Mean :2.408
## 3rd Qu.:4.000
## Max. :4.000
##
Note: According to Working-Leser model, each share of food item (ie: tobacco) is simply a linear function of the log of prices and of the total expenditure on all the food items under consideration. Hence, the log of total expenditure representing Tobacco expenditure calculated by each share of tobacco by each Belgian Household.
According to the historgram, there is long left tail and short right tail. The distribute is skewed left which means that the mean is typically less than the median.
Goal: To perform ANOVA and determine the F-statistic and the associated p-value for factor.
model<- aov(Random_Tobacco$lnx ~ Random_Tobacco$occupation + Random_Tobacco$region)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Random_Tobacco$occupation 2 4.017 2.0085 10.984 4.25e-05 ***
## Random_Tobacco$region 2 0.038 0.0190 0.104 0.901
## Residuals 117 21.394 0.1829
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis was that there is no relationship between the occupation in regards to the log Tobacco expenditure. Data contining “region” variable would be ignored because it is used as a blocking variable. From the ANOVA results, the F-statistic was 10.984 and the p-value was 0.0000425. The alpha value was set to 0.05. The p-value was much less than the alpha value and the F value was not close to 0.Therefore, since the p-value < alpha, we reject the null hypothesis, and the result was statistically significant.
qqnorm(residuals(model))
qqline(residuals(model))
From the QQ plot, the residuals form a linear line. Hence, it appears that it is a sample from a normal distribution.
plot(fitted(model), residuals(model))
This residual model shows a random pattern, indicating a good fit for a linear model.
model1 = lm(Tobacco$lnx ~ Tobacco$occupation * Tobacco$region)
qqnorm(residuals(model1))
plot(fitted(model1), residuals(model1))
model2 = lm(Tobacco$lnx ~ (Tobacco$occupation)^2 * Tobacco$region)
qqnorm(residuals(model2))
plot(fitted(model2), residuals(model2))
model3 = lm(Tobacco$lnx ~ ((Tobacco$region)^2) * Tobacco$occupation)
qqnorm(residuals(model3))
plot(fitted(model3), residuals(model3))
model4 = lm(Tobacco$lnx ~ ((Tobacco$occupation)^2) * (Tobacco$region)^2)
qqnorm(residuals(model4))
plot(fitted(model4), residuals(model4))
One of the assumptions for regression analysis is that the residuals are normally distributed. Four of the plots are normally distributed. Therefore, we can trust the results of this regression analysis.
N = 122
set.seed (300)
Random_Tobacco = Tobacco[sample(nrow(Tobacco),N),]
model_rand = lm(lnx~occupation+region,data=Random_Tobacco)
coef(model_rand)
## (Intercept) occupationbluecol occupationwhitecol
## 13.539561827 0.254456659 0.401236925
## regionflanders regionwalloon
## 0.041191185 0.006432225